I. Introduction and Problem Description
This use case offers a step-by-step guide to applying segmentation algorithms, with a particular focus on their use in a car insurance context. The code for this analysis is available upon request and can be readily adapted for datasets beyond the one discussed here. This project, successfully implemented for a car insurance provider, has been anonymized and adapted for use as a case study.
1. Data Overview
We consider a data set of a German car insurance provider:
The dataset contains approximately 2,000 rows, with each row representing an insurance contract.
There are 7 columns:
6 columns describe the characteristics of the contract (e.g., contract details and customer attributes).
The 7th column indicates whether there was a claim reported for the contract in the last year.
Here is an overview of the data set, where we display the first six rows of the data set:
## schadenfreie_jahre rabattschutz fahrleistung_im_jahr eintrittsalter garage region ## 1 14 N 30 19 N Sued ## 2 0 N 5000 3 N West ## 3 17 N 5000 25 J West ## 4 36 N 5000 37 N West ## 5 11 N 5000 15 J West ## 6 0 J 5000 33 N Nord ## schadenmeldung ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 1
2. Objectives of the Analysis
The goal is to detect and analyze which customer segments are more prone to claims.
A reliable prediction of which contract will result in a claim would provide significant economic value.
This insight could help improve risk assessment and inform strategies for claim prevention.
3. Our Strategy and Implementation Plan
Objective Definition:
- We begin by clearly defining the goal of the analysis in more details, focusing on identifying customer segments that are more prone to claims.
Data Exploration:
We will perform a data plausibility check to ensure data consistency and accuracy.
We will then conduct a descriptive and exploratory analysis with relevant visualizations to better understand the dataset.
Supervised Learning:
We will implement up to five supervised learning algorithms comparatively
- Naive Bayes,
- Logistic Regression,
- (Simple) Decision Tree,
- Random Forest, and
- Random classification
to estimate the probability of a claim for each contract.
- Then select the better-performing model based on accuracy or other performance metrics, and use it to generate a the likelihood/probability of a claim for each contract.
Clustering for Segmentation:
We will incorporate the probability of claim as an additional feature in the clustering process.
Apply and compare four clustering algorithms:
k-Means Clustering
Hierarchical Clustering
Fuzzy C-Means Clustering
**Featureless or random clustering **
Segment Analysis with the Better Performing Clustering Method:
Analyze the resulting clusters to identify which customer segments are more likely to file claims.
4. Variable Descriptions
schadenfreie_jahre (numeric): Number of claim-free years for each contract.
rabattschutz (character): Indicates whether the customer has discount protection (“J” for yes, “N” for no).
fahrleistung_im_jahr (numeric): The annual mileage in kilometers for each customer.
eintrittsalter (numeric): Age at which the customer entered the contract.
garage (character): Indicates if the vehicle has been kept in a garage (“J” for yes, “N” for no).
region (character): Region where the customer resides, e.g., “Sued” (South), “West”, etc.
schadenmeldung (numeric): Indicates whether a claim was reported for the contract in the last year (1 for yes, 0 for no).
II. Eploratory Data Analysis
1. Univariate Analysis and Plausibility Check
Here, we inspect each variable individually and check the plausibility of the values of the variables.
Outcome variable: schadenmeldung: Claim reported last year? (1 for yes, 0 for no)
## 0 1 ## 1735 336
**rabattschutz**: discount protection (“J” for yes, “N” for no)
## J N ## 312 1759
garage: Garage-kept vehicle status (“J” for yes, “N” for no)
## J N ## 801 1270
region: Region where the customer resides
## Nord NORD Ost Sued West ## 517 29 479 515 531
schadenfreie_jahre: claim-free years
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 0.00 6.00 10.27 18.00 44.00
fahrleistung_im_jahr: annual mileage in kilometers
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 30 5000 5000 7852 5000 30000
An annual mileage of 30 kilometers is highly atypical for insurance companies. Upon further examination of this variable, we observe that annual mileage in the dataset is limited to just three distinct values: 30, 5000, and 30,000 kilometers, as detailed in the following frequency analysis.
## 30 5000 30000 ## 39 1788 244
Following discussions with stakeholders and the data architects responsible for data transformation, it was clarified that this variable is, in fact, a categorical indicator: 30 represents contracts with low annual mileage, 5000 indicates average mileage, and 30,000 represents high mileage. We will therefore treat this variable as a categorical classification in the analysis. This highlights the critical importance of collaborative engagement with stakeholders.
eintrittsalter: Customer’s contract entry age
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 16.00 26.00 26.46 37.00 71.00
Here again, there are some unrealistic values:
The age at which the customer entered the contract should not be less than 18 years in Germany, since the age at which the driving licence is authorized in Germany is 18 year. We delete then the contracts for which this age is less than 18 years
2. Univariate Analysis after Plausibility Check
Outcome variable: **schadenmeldung**
## 0 1 ## 1310 156
rabattschutz
## J N ## 229 1237
garage
## J N ## 644 822
region
## Nord NORD Ost Sued West ## 380 21 342 357 366
schadenfreie_jahre
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 2.0 11.0 13.3 22.0 44.0
fahrleistung_im_jahr
## 30 5000 30000 ## 30 1252 184
eintrittsalter: Customer’s contract entry age
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.00 25.00 33.00 33.41 40.00 71.00
3. Bivariate Analysis after plausibility Check
Claim Proportion Versus Age at Entry
We observe that the claim proportion is higher for contracts where the entry age is between 35 and 45 years.
The mean and median entry age are higher for contracts with a reported claim in the last year compared to those without a claim.
Claim Proportion versus claim-free Years (schadenfreie_jahre)
There is a tendency for contracts with reported claims last year to have few or no claim-free years preceding that period.
A more comprehensive and detailed pair plot including all available variables will be provided prior to modelling.
Statistical Analysis of Explanatory Variables across the Claim Groups
We analyze the distribution of explanatory variables across the two claim groups to identify differences. For continuous variables, we use parametric tests, specifically one-way ANOVA (equivalent to the t-test in our case, as there are only two groups), to compare means. For categorical variables, we apply Chi-square tests.
Table: Table: Comparison of Covariate Distributions Across Claim Groups
level | 0 | 1 | p | |
---|---|---|---|---|
n | 1310 | 156 | ||
schadenfreie_jahre (mean (SD)) | 14.72 (11.37) | 1.36 (5.31) | <0.001 | |
rabattschutz (%) | J | 183 (14.0) | 46 (29.5) | <0.001 |
N | 1127 (86.0) | 110 (70.5) | ||
fahrleistung_im_jahr (%) | 30 | 25 (1.9) | 5 (3.2) | 0.196 |
5000 | 1126 (86.0) | 126 (80.8) | ||
30000 | 159 (12.1) | 25 (16.0) | ||
eintrittsalter (mean (SD)) | 33.18 (9.77) | 35.30 (9.39) | 0.010 | |
garage (%) | J | 583 (44.5) | 61 (39.1) | 0.230 |
N | 727 (55.5) | 95 (60.9) | ||
region (%) | Nord | 332 (25.3) | 48 (30.8) | 0.382 |
NORD | 20 (1.5) | 1 (0.6) | ||
Ost | 302 (23.1) | 40 (25.6) | ||
Sued | 324 (24.7) | 33 (21.2) | ||
West | 332 (25.3) | 34 (21.8) |
We observe significant differences between the two claim groups (claims vs. no claims in the last year) for the explanatory variables schadenfreie_jahre, eintrittsalter, and rabattschutz. These findings suggest that these three variables may play a crucial role in influencing the decision to file a claim.
Propensity Score Matching followed by a Comparison of Covariate Distributions across Claim Groups
Although Propensity Score Matching (PSM) is traditionally used in case-control studies, our analysis is not a case-control design. Instead, we use PSM to ensure that explanatory variables are balanced across claim groups. This balance reduces the likelihood that observed differences are due to variations in these variables, thereby allowing us to more accurately assess the true relationships in our study. For this reason, we apply Propensity Score Matching.
Table: Table: Comparison of Covariate Distributions Across Claim Groups after Propensity Score Matching
level | 0 | 1 | p | |
---|---|---|---|---|
n | 156 | 156 | ||
schadenfreie_jahre (mean (SD)) | 2.31 (5.02) | 1.36 (5.31) | 0.106 | |
rabattschutz (%) | J | 27 (17.3) | 46 (29.5) | 0.016 |
N | 129 (82.7) | 110 (70.5) | ||
fahrleistung_im_jahr (%) | 30 | 4 (2.6) | 5 (3.2) | 0.757 |
5000 | 131 (84.0) | 126 (80.8) | ||
30000 | 21 (13.5) | 25 (16.0) | ||
eintrittsalter (mean (SD)) | 34.76 (6.84) | 35.30 (9.39) | 0.563 | |
garage (%) | J | 54 (34.6) | 61 (39.1) | 0.481 |
N | 102 (65.4) | 95 (60.9) | ||
region (%) | Nord | 46 (29.5) | 48 (30.8) | 0.967 |
NORD | 2 (1.3) | 1 (0.6) | ||
Ost | 40 (25.6) | 40 (25.6) | ||
Sued | 36 (23.1) | 33 (21.2) | ||
West | 32 (20.5) | 34 (21.8) |
After propensity score matching, a significant difference is observed only for the explanatory variable rabattschutz.
Using a non-parametric test (Kruskal-Wallis Rank Sum Test, equivalent to the Wilcoxon test or Mann-Whitney U test for two groups) to compare median values, rather than a parametric test (one-way ANOVA) to compare means, a significant difference is identified in the variable schadenfreie_jahre (the number of claim-free years per contract) (results not shown).
It is logical to consider that the number of claim-free years for each contract plays a critical role in influencing claim probability. We anticipate that contracts with a long history of claim-free years will have a lower or higher likelihood of future claims. In contrast, the presence of discount protection (Rabattschutz) is less straightforward to interpret, though it also provides meaningful insights. Stakeholders may provide additional explanations.
Interestingly, the annual mileage (fahrleistung_im_jahr) shows no significant difference between the two groups, both before and after Propensity Score Matching. This may be attributable to the categorization of mileage values into three broad classes.
III. Modelling
Seventy percent of the dataset is randomly selected for model training, while the remaining thirty percent will be used for testing.
Before proceeding with model building, let us conduct an additional summary of the bivariate statistical analysis for the training dataset. We anticipate similar behavior to that observed in the complete dataset described above.
The green color represents contracts with claims filed last year.
There exists for instance a stronger correlation between the contract entry age for those who did not file a claim compared to those who filed a claim.
On average, the entry age is higher for contracts with claims filed last year compared to those without. This trend is observed among both customers with and without discount protection. However, the median and mean entry ages are higher for contracts without discount protection.
The distribution of prior claim-free years differs between contracts that filed a claim last year and those that did not.
1. Comparing the Performance of Usual and Appropriate Supervised Learning Algoritms
We have selected four methods pertaining to different classes/idea of supervised learning algorithm:
Naive Bayes which pertains to the class of Bayes-Rules,
Logistic Regression which pertains to the the class of Odds and Regression
(Simple) Decision Tree which pertains to the class of Trees,
Random Forest which pertains to the class of Trees and Bagging, and
Random Classification which is a random classification, generally used for comparison
The Random Forest algorithm (classif.ranger) demonstrated the best performance and has been selected for further optimization.
2. Optimization of the Best Performing Supervised Learning Model
**Variable Importance: Filter Selection **
Here, we use a filter method to evaluate features based on heuristics derived from the general characteristics of the data. This approach is independent of the supervised learning algorithm (Random Forest).
We observe that, regardless of the supervised learning method applied, the Number of Claim-Free Years (schadenfreie_jahre) per contract consistently emerges as a highly significant variable. The Age at Contract Entry (eintrittsalter) also ranks prominently, while the score for the Discount Protection Indicator (rabattschutz) remains substantial. These findings align well with the prior statistical test conducted to analyze the distribution of explanatory variables across the claim groups. We observed significant differences between the two claim groups (claims vs. no claims in the last year) for the explanatory variables schadenfreie_jahre, eintrittsalter, and rabattschutz.
The remaining variables appear to contribute less substantially. In particular, the variable garage has the smallest score.
Addressing Class Imbalance: Undersampling the Majority Class
After comparing three methods:
Undersampling the majority class,
Oversampling the minority class, and
SMOTE
We selected the optimal approach based on accuracy for the supervised learning algorithms detailed below. Although undersampling excludes some data and results in information loss, it provided both the highest accuracy and the fastest training time for the chosen models in our case.
Here is the class outcome frequency (filed claim or not) after undersampling.
Class Distribution after Undersampling
## ## 0 1 ## 183 109
Variable Importance: Wrapper Selection
This method operates by fitting models on selected feature subsets and evaluating their performance. The chosen performance metric here is the Area Under the Curve (AUC), used to compare various feature sets to identify the best-performing combination. Feature selection can be conducted sequentially, such as by iteratively adding features in a forward selection approach, or in parallel. This method is applied specifically to the selected supervised learning model, Random Forest. Here is the approach applied in this specific example:
Feature Subset Proposal
- The selection algorithm generates one or more feature subsets for evaluation, potentially processing multiple subsets in parallel.
Model Training and Evaluation
- For each proposed subset, the specified learner is trained on the training set using a resampling method (holdout method in this case) and evaluated based on the Area Under the Curve (AUC) metric.
Result Archiving
- All evaluation results are stored in an archive for reference and analysis.
Termination Check
- The process continues iteratively until the termination criteria are met. If the criteria are not triggered, the algorithm returns to the beginning to propose new feature subsets.
Best Subset Selection
- The feature subset with the best performance, as observed through the evaluations, is identified.
Final Output
- The best-performing feature subset is stored as the final result.
## [1] "eintrittsalter" "fahrleistung_im_jahr" "rabattschutz" ## [4] "region" "schadenfreie_jahre"
The variables displayed above are the key features selected and included in the final Random Forest training model. Only the variable garage, which also received the lowest score in the filtering algorithm, was not identified as important and has therefore been excluded from the Random Forest model.
Performance Achieved (AUC) with the Final Selected Set of Variables Using the Random Forest Algorithm
## classif.auc ## 0.9561648
The performance achieved is excellent and highly satisfactory.
Performance of the Model on the Test Dataset
The confusion matrix is given by:
## truth ## response 0 1 ## 0 360 9 ## 1 33 38
The classification accuracy is then given by:
## classif.acc ## 0.9045455
The performance achieved on the test dataset is very satisfactory.
Let us consider an additional performance measure of the model: the ROC curve
The ROC curve is very close to the top-left corner, indicating strong classification performance and an Area Under The Curve close to 1, which again, is very satisfactory.
3. Prediction: Probability of Claim to be Included in the Clustering Algorithm
Integrating the Predicted Claim Probability with Initial Data for Enhanced Clustering Analysis
After validating the model’s performance on the test set and confirming its high accuracy, we now apply it to calculate the probability of filing a claim across the complete dataset.
We incorporate the claim probability into the dataset to include it as a feature in the clustering algorithm.
Here are six rows of the data set:
## schadenfreie_jahre rabattschutz fahrleistung_im_jahr eintrittsalter garage region ## 241 19 N 5000 27 N West ## 1965 20 J 5000 43 J Ost ## 346 10 J 5000 26 N Sued ## 1030 38 N 5000 40 J Ost ## 265 0 N 5000 32 J Ost ## 1539 1 J 5000 24 N Sued ## schadenmeldung row_ids truth response prob.0 prob ## 241 0 1 0 0 0.9820064 0.01799357 ## 1965 0 2 0 0 0.7684706 0.23152937 ## 346 0 3 0 0 0.9707732 0.02922682 ## 1030 0 4 0 0 0.6944640 0.30553599 ## 265 0 5 0 0 0.6667064 0.33329356 ## 1539 1 6 1 1 0.1039377 0.89606226
At this stage, after predicting the claim probability for each contract, this probability can serve as a criterion to identify contracts that are more likely to result in claims.
In practice, and sometimes for simplicity, the predicted probabilities can be used to classify contracts into two or more groups. However, this approach requires selecting appropriate cutoff points, which can be challenging as well. To address this, we propose exploring the presence of natural groups or segments (clusters) among the contracts by incorporating claim probability into a cluster analysis to identify these segments effectively.
4. Clustering for Segmentation
Here are the features in the dataset, including the predicted claim probability for each contract.
## [1] "schadenfreie_jahre" "rabattschutz" "fahrleistung_im_jahr" ## [4] "eintrittsalter" "garage" "region" ## [7] "prob"
We standardize the data to improve clustering performance. Below are the first six rows of the standardized dataset.
## schadenfreie_jahre eintrittsalter prob rabattschutzN garageN regionNORD regionOst ## <num> <num> <num> <num> <num> <num> <num> ## 1: 0.4898250 -0.6574591 -0.66688618 0.4301151 0.884828 -0.1205112 -0.551419 ## 2: 0.5757613 0.9841590 0.08285544 -2.3233730 -1.129392 -0.1205112 1.812266 ## 3: -0.2836014 -0.7600602 -0.62744533 -2.3233730 0.884828 -0.1205112 -0.551419 ## 4: 2.1226142 0.6763556 0.34269876 0.4301151 -1.129392 -0.1205112 1.812266 ## 5: -1.1429641 -0.1444534 0.44015785 0.4301151 -1.129392 -0.1205112 1.812266 ## 6: -1.0570279 -0.9652624 2.41608478 -2.3233730 0.884828 -0.1205112 -0.551419 ## regionSued regionWest fahrleistung_im_jahr5000 fahrleistung_im_jahr30000 ## <num> <num> <num> <num> ## 1: -0.567179 1.7330362 0.4132916 -0.3787187 ## 2: -0.567179 -0.5766284 0.4132916 -0.3787187 ## 3: 1.761909 -0.5766284 0.4132916 -0.3787187 ## 4: -0.567179 -0.5766284 0.4132916 -0.3787187 ## 5: -0.567179 -0.5766284 0.4132916 -0.3787187 ## 6: 1.761909 -0.5766284 0.4132916 -0.3787187
Candidate Cluster Methods
We Apply and compare four clustering algorithms:
k-Means Clustering
Fuzzy C-Means Clustering
Hierarchical Clustering
Featureless or random clustering
Comparison Criteria
We consider two comparison methods here in order to compare the performance of the four cluster:
- Within Sum of Squares (WSS) measure (clust.wss):
WSS calculates the sum of
squared differences between observations and centroids, which is a quantification of cluster cohesion (smaller values indicate the clusters are more compact).
and
- Silhouette coefficient (clust.silhouette).
The silhouette coefficient
quantifies how well each point belongs to its assigned cluster versus neighboring clusters,
where scores closer to 1 indicate well clustered and scores closer to -1 indicate poorly clustered.
Below are the comparison results:
## INFO [21:50:13.635] [mlr3] Running benchmark with 4 resampling iterations ## INFO [21:50:13.647] [mlr3] Applying learner 'clust.featureless' on task 'datanew_scaled' (iter 1/1) ## INFO [21:50:13.668] [mlr3] Applying learner 'clust.hclust' on task 'datanew_scaled' (iter 1/1) ## INFO [21:50:13.923] [mlr3] Applying learner 'clust.kmeans' on task 'datanew_scaled' (iter 1/1) ## INFO [21:50:13.988] [mlr3] Applying learner 'clust.cmeans' on task 'datanew_scaled' (iter 1/1)
## INFO [21:50:14.186] [mlr3] Finished benchmark
## learner_id clust.wss clust.silhouette ## <char> <num> <num> ## 1: clust.featureless 16115.00 0.00000000 ## 2: clust.hclust 11974.45 0.31645447 ## 3: clust.kmeans 11830.11 0.17654499 ## 4: clust.cmeans 14287.15 0.07575615
The hierarchical Cluster method emerges as the most effective model among those selected, as it achieves a silhouette coefficient closer to one compared to the alternatives. Additionally, it yields with the K-means clustering smaller Within Sum of Squares, indicating a better clustering performance.
Next, we proceed with the Hierarchical Clustering method to conduct our cluster analysis, focusing on optimizing cluster performance.
Here are some observations of the clustering results stored in the variable partition:
## <PredictionClust> for 1466 observations: ## row_ids partition ## 1 1 ## 2 1 ## 3 2 ## --- --- ## 1464 1 ## 1465 1 ## 1466 1
We merged the predictions with the initial dataset:
## schadenfreie_jahre rabattschutz fahrleistung_im_jahr eintrittsalter garage region ## 241 19 N 5000 27 N West ## 1965 20 J 5000 43 J Ost ## 346 10 J 5000 26 N Sued ## 1030 38 N 5000 40 J Ost ## 265 0 N 5000 32 J Ost ## 1539 1 J 5000 24 N Sued ## schadenmeldung prob row_ids partition ## 241 0 0.01799357 1 1 ## 1965 0 0.23152937 2 1 ## 346 0 0.02922682 3 2 ## 1030 0 0.30553599 4 1 ## 265 0 0.33329356 5 1 ## 1539 1 0.89606226 6 2
Let us perform a bivariate descriptive analysis of the clusters (as defined in the variable partition):
With four clusters, there seems to be a relatively random cluster, that is overlapping with all other clusters. This may be due to artificially splitting the clusters more than they should be.
Silhouette plot from predictions
The plots include
a dotted line which visualizes the average silhouette coefficient across all data points and
each data point’s silhouette value is represented by a bar colored by their assigned cluster.
In our particular case, the average silhouette index is about 0.15. If the average silhouette value
for a given cluster is below the average silhouette coefficient line then this implies that the
cluster is not well defined.
In our pplication,
most observations fall below the average line for the majority of the segments, indicating that the quality of the cluster assignments is suboptimal for these segments. This suggests that many observations may have been assigned to incorrect clusters. However, it is important to note that no segment is entirely below the average line.
Analysis of Feature Distribution Across Clusters
Table: Table: Feature Distribution Across Different Cluster Segments
level | 1 | 2 | 3 | 4 | p | |
---|---|---|---|---|---|---|
n | 833 | 430 | 21 | 182 | ||
schadenfreie_jahre (median [IQR]) | 15.00 [6.00, 25.00] | 2.00 [0.00, 11.00] | 14.00 [7.00, 21.00] | 17.00 [6.25, 25.00] | <0.001 | |
rabattschutz (%) | J | 52 (6.2) | 142 (33.0) | 5 (23.8) | 30 (16.5) | <0.001 |
N | 781 (93.8) | 288 (67.0) | 16 (76.2) | 152 (83.5) | ||
fahrleistung_im_jahr (%) | 30 | 21 (2.5) | 4 (0.9) | 2 (9.5) | 3 (1.6) | <0.001 |
5000 | 812 (97.5) | 426 (99.1) | 14 (66.7) | 0 (0.0) | ||
30000 | 0 (0.0) | 0 (0.0) | 5 (23.8) | 179 (98.4) | ||
eintrittsalter (median [IQR]) | 35.00 [26.00, 42.00] | 30.00 [24.00, 37.00] | 31.00 [26.00, 36.00] | 34.00 [26.00, 39.00] | <0.001 | |
garage (%) | J | 394 (47.3) | 166 (38.6) | 7 (33.3) | 77 (42.3) | 0.019 |
N | 439 (52.7) | 264 (61.4) | 14 (66.7) | 105 (57.7) | ||
region (%) | Nord | 264 (31.7) | 63 (14.7) | 0 (0.0) | 53 (29.1) | <0.001 |
NORD | 0 (0.0) | 0 (0.0) | 21 (100.0) | 0 (0.0) | ||
Ost | 228 (27.4) | 76 (17.7) | 0 (0.0) | 38 (20.9) | ||
Sued | 93 (11.2) | 220 (51.2) | 0 (0.0) | 44 (24.2) | ||
West | 248 (29.8) | 71 (16.5) | 0 (0.0) | 47 (25.8) | ||
schadenmeldung (%) | 0 | 813 (97.6) | 321 (74.7) | 20 (95.2) | 156 (85.7) | <0.001 |
1 | 20 (2.4) | 109 (25.3) | 1 (4.8) | 26 (14.3) | ||
prob (median [IQR]) | 0.03 [0.01, 0.15] | 0.32 [0.02, 0.73] | 0.03 [0.01, 0.26] | 0.06 [0.04, 0.29] | <0.001 |
We observed notable differences (statistically significant) across the four cluster groups for all features.
Recall that.
Key Observations on Cluster Segments
Segment 2: High-Risk Participants
This is the most critical segment, representing participants with a high probability of filing a claim.
The median (with IQR) claim probability in Segment 2 is 0.32 [0.02, 0.73], compared to 0.03 [0.01, 0.15], 0.03 [0.01, 0.26], 0.06 [0.04, 0.29] in the other two segments 1, 3, 4 respectively.
Discount protection is higher in Segment 2 (33.0%) than in Segment 1 (6.6%), 3 (23.8%), and Segment 4 (16.5%).
The median prior claim-free years in Segment 2 is significantly lower 2.00 [0.00, 11.00] compared to Segment 1, 15.00 [6.00, 25.00], Segment 3, 14.00 [7.00, 21.00] and Segment 4, 17.00 [6.25, 25.00].
The annual mileage for contracts in Segment 2 is nearly always 5000, with 99.1% of contracts reporting this value.
Contracts in Segment 2 predominantly originate from the Sud region.
Claims was reported last year mostly for contracts in Segment 2.
It is important to note that the clustering algorithm did not use the outcome variable Schadenmeldung (claims filed last year) directly. Instead, it leveraged the predicted claim probability generated by a supervised learning model with near-perfect accuracy. The fact that Segment 2 has the high percentage of filed claims last year serves as an indication of a well-functioning methodology and effective application.
IV. Summary and Conclusion
We successfully identified four distinct contract clusters, although the clusters were not entirely homogeneous. The initial goal of creating four clusters was achieved, enabling a focus on the higher-risk cluster first, with the possibility of addressing other clusters in subsequent stages.
An alternative approach could have been to use a supervised learning algorithm to classify contracts into two or more groups based on predicted claim probabilities. However, this method involves selecting appropriate cutoff points, which can be just as challenging as implementing clustering techniques.
Upon examining the clusters, we observed that some segments share notable similarities and could potentially be merged for specific applications, simplifying further analysis and decision-making.
Segment 2 emerged as the group with the highest likelihood of filing claims. This segment should be prioritized for targeted marketing efforts and risk mitigation strategies, as it represents the most critical group requiring intervention.
The next step involves deploying the model into production, a process that includes several critical stages such as model retraining, monitoring, and integration into operational workflows. Each of these steps will be addressed in detail within separate use cases to ensure a comprehensive understanding and effective implementation.