I. Introduction and Problem Description

This use case offers a step-by-step guide to applying segmentation algorithms, with a particular focus on their use in a car insurance context. The code for this analysis is available upon request and can be readily adapted for datasets beyond the one discussed here. This project, successfully implemented for a car insurance provider, has been anonymized and adapted for use as a case study.


1. Data Overview

We consider a data set of a German car insurance provider:

  • The dataset contains approximately 2,000 rows, with each row representing an insurance contract.

  • There are 7 columns:

    • 6 columns describe the characteristics of the contract (e.g., contract details and customer attributes).

    • The 7th column indicates whether there was a claim reported for the contract in the last year.


Here is an overview of the data set, where we display the first six rows of the data set:


##   schadenfreie_jahre rabattschutz fahrleistung_im_jahr eintrittsalter garage region
## 1                 14            N                   30             19      N   Sued
## 2                  0            N                 5000              3      N   West
## 3                 17            N                 5000             25      J   West
## 4                 36            N                 5000             37      N   West
## 5                 11            N                 5000             15      J   West
## 6                  0            J                 5000             33      N   Nord
##   schadenmeldung
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              1

2. Objectives of the Analysis


  • The goal is to detect and analyze which customer segments are more prone to claims.

  • A reliable prediction of which contract will result in a claim would provide significant economic value.

  • This insight could help improve risk assessment and inform strategies for claim prevention.


3. Our Strategy and Implementation Plan


  • Objective Definition:

    • We begin by clearly defining the goal of the analysis in more details, focusing on identifying customer segments that are more prone to claims.

  • Data Exploration:

    • We will perform a data plausibility check to ensure data consistency and accuracy.

    • We will then conduct a descriptive and exploratory analysis with relevant visualizations to better understand the dataset.


  • Supervised Learning:

    • We will implement up to five supervised learning algorithms comparatively

      • Naive Bayes,
      • Logistic Regression,
      • (Simple) Decision Tree,
      • Random Forest, and
      • Random classification

    to estimate the probability of a claim for each contract.

    • Then select the better-performing model based on accuracy or other performance metrics, and use it to generate a the likelihood/probability of a claim for each contract.

  • Clustering for Segmentation:

    • We will incorporate the probability of claim as an additional feature in the clustering process.

    • Apply and compare four clustering algorithms:

      • k-Means Clustering

      • Hierarchical Clustering

      • Fuzzy C-Means Clustering

      • **Featureless or random clustering **

  • Segment Analysis with the Better Performing Clustering Method:

Analyze the resulting clusters to identify which customer segments are more likely to file claims.


4. Variable Descriptions

  • schadenfreie_jahre (numeric): Number of claim-free years for each contract.

  • rabattschutz (character): Indicates whether the customer has discount protection (“J” for yes, “N” for no).

  • fahrleistung_im_jahr (numeric): The annual mileage in kilometers for each customer.

  • eintrittsalter (numeric): Age at which the customer entered the contract.

  • garage (character): Indicates if the vehicle has been kept in a garage (“J” for yes, “N” for no).

  • region (character): Region where the customer resides, e.g., “Sued” (South), “West”, etc.

  • schadenmeldung (numeric): Indicates whether a claim was reported for the contract in the last year (1 for yes, 0 for no).


II. Eploratory Data Analysis


1. Univariate Analysis and Plausibility Check

Here, we inspect each variable individually and check the plausibility of the values of the variables.


Outcome variable: schadenmeldung: Claim reported last year? (1 for yes, 0 for no)

##    0    1 
## 1735  336

**rabattschutz**: discount protection (“J” for yes, “N” for no)

##    J    N 
##  312 1759

garage: Garage-kept vehicle status (“J” for yes, “N” for no)

##    J    N 
##  801 1270

region: Region where the customer resides

## Nord NORD  Ost Sued West 
##  517   29  479  515  531

schadenfreie_jahre: claim-free years

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    6.00   10.27   18.00   44.00

fahrleistung_im_jahr: annual mileage in kilometers

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      30    5000    5000    7852    5000   30000

An annual mileage of 30 kilometers is highly atypical for insurance companies. Upon further examination of this variable, we observe that annual mileage in the dataset is limited to just three distinct values: 30, 5000, and 30,000 kilometers, as detailed in the following frequency analysis.


##    30  5000 30000 
##    39  1788   244

Following discussions with stakeholders and the data architects responsible for data transformation, it was clarified that this variable is, in fact, a categorical indicator: 30 represents contracts with low annual mileage, 5000 indicates average mileage, and 30,000 represents high mileage. We will therefore treat this variable as a categorical classification in the analysis. This highlights the critical importance of collaborative engagement with stakeholders.



eintrittsalter: Customer’s contract entry age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   16.00   26.00   26.46   37.00   71.00

Here again, there are some unrealistic values:

The age at which the customer entered the contract should not be less than 18 years in Germany, since the age at which the driving licence is authorized in Germany is 18 year. We delete then the contracts for which this age is less than 18 years


2. Univariate Analysis after Plausibility Check


Outcome variable: **schadenmeldung**

##    0    1 
## 1310  156

rabattschutz

##    J    N 
##  229 1237

garage

##   J   N 
## 644 822

region

## Nord NORD  Ost Sued West 
##  380   21  342  357  366

schadenfreie_jahre

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     2.0    11.0    13.3    22.0    44.0

fahrleistung_im_jahr

##    30  5000 30000 
##    30  1252   184

eintrittsalter: Customer’s contract entry age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   25.00   33.00   33.41   40.00   71.00

3. Bivariate Analysis after plausibility Check


Claim Proportion Versus Age at Entry


plot of chunk setup41


We observe that the claim proportion is higher for contracts where the entry age is between 35 and 45 years.


plot of chunk setup42


The mean and median entry age are higher for contracts with a reported claim in the last year compared to those without a claim.


Claim Proportion versus claim-free Years (schadenfreie_jahre)


plot of chunk setup61


plot of chunk setup62


There is a tendency for contracts with reported claims last year to have few or no claim-free years preceding that period.


A more comprehensive and detailed pair plot including all available variables will be provided prior to modelling.


Statistical Analysis of Explanatory Variables across the Claim Groups


We analyze the distribution of explanatory variables across the two claim groups to identify differences. For continuous variables, we use parametric tests, specifically one-way ANOVA (equivalent to the t-test in our case, as there are only two groups), to compare means. For categorical variables, we apply Chi-square tests.


Table: Table: Comparison of Covariate Distributions Across Claim Groups

level01p
n1310156
schadenfreie_jahre (mean (SD))14.72 (11.37)1.36 (5.31)<0.001
rabattschutz (%)J183 (14.0)46 (29.5)<0.001
N1127 (86.0)110 (70.5)
fahrleistung_im_jahr (%)3025 (1.9)5 (3.2)0.196
50001126 (86.0)126 (80.8)
30000159 (12.1)25 (16.0)
eintrittsalter (mean (SD))33.18 (9.77)35.30 (9.39)0.010
garage (%)J583 (44.5)61 (39.1)0.230
N727 (55.5)95 (60.9)
region (%)Nord332 (25.3)48 (30.8)0.382
NORD20 (1.5)1 (0.6)
Ost302 (23.1)40 (25.6)
Sued324 (24.7)33 (21.2)
West332 (25.3)34 (21.8)

We observe significant differences between the two claim groups (claims vs. no claims in the last year) for the explanatory variables schadenfreie_jahre, eintrittsalter, and rabattschutz. These findings suggest that these three variables may play a crucial role in influencing the decision to file a claim.


Propensity Score Matching followed by a Comparison of Covariate Distributions across Claim Groups


Although Propensity Score Matching (PSM) is traditionally used in case-control studies, our analysis is not a case-control design. Instead, we use PSM to ensure that explanatory variables are balanced across claim groups. This balance reduces the likelihood that observed differences are due to variations in these variables, thereby allowing us to more accurately assess the true relationships in our study. For this reason, we apply Propensity Score Matching.


Table: Table: Comparison of Covariate Distributions Across Claim Groups after Propensity Score Matching

level01p
n156156
schadenfreie_jahre (mean (SD))2.31 (5.02)1.36 (5.31)0.106
rabattschutz (%)J27 (17.3)46 (29.5)0.016
N129 (82.7)110 (70.5)
fahrleistung_im_jahr (%)304 (2.6)5 (3.2)0.757
5000131 (84.0)126 (80.8)
3000021 (13.5)25 (16.0)
eintrittsalter (mean (SD))34.76 (6.84)35.30 (9.39)0.563
garage (%)J54 (34.6)61 (39.1)0.481
N102 (65.4)95 (60.9)
region (%)Nord46 (29.5)48 (30.8)0.967
NORD2 (1.3)1 (0.6)
Ost40 (25.6)40 (25.6)
Sued36 (23.1)33 (21.2)
West32 (20.5)34 (21.8)

After propensity score matching, a significant difference is observed only for the explanatory variable rabattschutz.

Using a non-parametric test (Kruskal-Wallis Rank Sum Test, equivalent to the Wilcoxon test or Mann-Whitney U test for two groups) to compare median values, rather than a parametric test (one-way ANOVA) to compare means, a significant difference is identified in the variable schadenfreie_jahre (the number of claim-free years per contract) (results not shown).

It is logical to consider that the number of claim-free years for each contract plays a critical role in influencing claim probability. We anticipate that contracts with a long history of claim-free years will have a lower or higher likelihood of future claims. In contrast, the presence of discount protection (Rabattschutz) is less straightforward to interpret, though it also provides meaningful insights. Stakeholders may provide additional explanations.

Interestingly, the annual mileage (fahrleistung_im_jahr) shows no significant difference between the two groups, both before and after Propensity Score Matching. This may be attributable to the categorization of mileage values into three broad classes.


III. Modelling



Seventy percent of the dataset is randomly selected for model training, while the remaining thirty percent will be used for testing.

Before proceeding with model building, let us conduct an additional summary of the bivariate statistical analysis for the training dataset. We anticipate similar behavior to that observed in the complete dataset described above.


plot of chunk setup31autoplot

The green color represents contracts with claims filed last year.
There exists for instance a stronger correlation between the contract entry age for those who did not file a claim compared to those who filed a claim.


plot of chunk setup311autoplot


On average, the entry age is higher for contracts with claims filed last year compared to those without. This trend is observed among both customers with and without discount protection. However, the median and mean entry ages are higher for contracts without discount protection.


plot of chunk setup312autoplot

The distribution of prior claim-free years differs between contracts that filed a claim last year and those that did not.



1. Comparing the Performance of Usual and Appropriate Supervised Learning Algoritms


We have selected four methods pertaining to different classes/idea of supervised learning algorithm:

  • Naive Bayes which pertains to the class of Bayes-Rules,

  • Logistic Regression which pertains to the the class of Odds and Regression

  • (Simple) Decision Tree which pertains to the class of Trees,

  • Random Forest which pertains to the class of Trees and Bagging, and

  • Random Classification which is a random classification, generally used for comparison



plot of chunk setup5unbalance21

The Random Forest algorithm (classif.ranger) demonstrated the best performance and has been selected for further optimization.


2. Optimization of the Best Performing Supervised Learning Model


**Variable Importance: Filter Selection **

Here, we use a filter method to evaluate features based on heuristics derived from the general characteristics of the data. This approach is independent of the supervised learning algorithm (Random Forest).


plot of chunk setup5unbalance3


We observe that, regardless of the supervised learning method applied, the Number of Claim-Free Years (schadenfreie_jahre) per contract consistently emerges as a highly significant variable. The Age at Contract Entry (eintrittsalter) also ranks prominently, while the score for the Discount Protection Indicator (rabattschutz) remains substantial. These findings align well with the prior statistical test conducted to analyze the distribution of explanatory variables across the claim groups. We observed significant differences between the two claim groups (claims vs. no claims in the last year) for the explanatory variables schadenfreie_jahre, eintrittsalter, and rabattschutz.
The remaining variables appear to contribute less substantially. In particular, the variable garage has the smallest score.


Addressing Class Imbalance: Undersampling the Majority Class


After comparing three methods:

  • Undersampling the majority class,

  • Oversampling the minority class, and

  • SMOTE

We selected the optimal approach based on accuracy for the supervised learning algorithms detailed below. Although undersampling excludes some data and results in information loss, it provided both the highest accuracy and the fastest training time for the chosen models in our case.


Here is the class outcome frequency (filed claim or not) after undersampling.



Class Distribution after Undersampling

## 
##   0   1 
## 183 109

Variable Importance: Wrapper Selection


This method operates by fitting models on selected feature subsets and evaluating their performance. The chosen performance metric here is the Area Under the Curve (AUC), used to compare various feature sets to identify the best-performing combination. Feature selection can be conducted sequentially, such as by iteratively adding features in a forward selection approach, or in parallel. This method is applied specifically to the selected supervised learning model, Random Forest. Here is the approach applied in this specific example:



  • Feature Subset Proposal

    • The selection algorithm generates one or more feature subsets for evaluation, potentially processing multiple subsets in parallel.
  • Model Training and Evaluation

    • For each proposed subset, the specified learner is trained on the training set using a resampling method (holdout method in this case) and evaluated based on the Area Under the Curve (AUC) metric.
  • Result Archiving

    • All evaluation results are stored in an archive for reference and analysis.
  • Termination Check

    • The process continues iteratively until the termination criteria are met. If the criteria are not triggered, the algorithm returns to the beginning to propose new feature subsets.
  • Best Subset Selection

    • The feature subset with the best performance, as observed through the evaluations, is identified.
  • Final Output

    • The best-performing feature subset is stored as the final result.

## [1] "eintrittsalter"       "fahrleistung_im_jahr" "rabattschutz"        
## [4] "region"               "schadenfreie_jahre"

The variables displayed above are the key features selected and included in the final Random Forest training model. Only the variable garage, which also received the lowest score in the filtering algorithm, was not identified as important and has therefore been excluded from the Random Forest model.


Performance Achieved (AUC) with the Final Selected Set of Variables Using the Random Forest Algorithm

## classif.auc 
##   0.9561648

The performance achieved is excellent and highly satisfactory.



Performance of the Model on the Test Dataset



The confusion matrix is given by:


##         truth
## response   0   1
##        0 360   9
##        1  33  38

The classification accuracy is then given by:

## classif.acc 
##   0.9045455

The performance achieved on the test dataset is very satisfactory.

Let us consider an additional performance measure of the model: the ROC curve


plot of chunk setup5unbalance555


The ROC curve is very close to the top-left corner, indicating strong classification performance and an Area Under The Curve close to 1, which again, is very satisfactory.


3. Prediction: Probability of Claim to be Included in the Clustering Algorithm


Integrating the Predicted Claim Probability with Initial Data for Enhanced Clustering Analysis


After validating the model’s performance on the test set and confirming its high accuracy, we now apply it to calculate the probability of filing a claim across the complete dataset.

We incorporate the claim probability into the dataset to include it as a feature in the clustering algorithm.

Here are six rows of the data set:


##      schadenfreie_jahre rabattschutz fahrleistung_im_jahr eintrittsalter garage region
## 241                  19            N                 5000             27      N   West
## 1965                 20            J                 5000             43      J    Ost
## 346                  10            J                 5000             26      N   Sued
## 1030                 38            N                 5000             40      J    Ost
## 265                   0            N                 5000             32      J    Ost
## 1539                  1            J                 5000             24      N   Sued
##      schadenmeldung row_ids truth response    prob.0       prob
## 241               0       1     0        0 0.9820064 0.01799357
## 1965              0       2     0        0 0.7684706 0.23152937
## 346               0       3     0        0 0.9707732 0.02922682
## 1030              0       4     0        0 0.6944640 0.30553599
## 265               0       5     0        0 0.6667064 0.33329356
## 1539              1       6     1        1 0.1039377 0.89606226

At this stage, after predicting the claim probability for each contract, this probability can serve as a criterion to identify contracts that are more likely to result in claims.
In practice, and sometimes for simplicity, the predicted probabilities can be used to classify contracts into two or more groups. However, this approach requires selecting appropriate cutoff points, which can be challenging as well. To address this, we propose exploring the presence of natural groups or segments (clusters) among the contracts by incorporating claim probability into a cluster analysis to identify these segments effectively.


4. Clustering for Segmentation


Here are the features in the dataset, including the predicted claim probability for each contract.

## [1] "schadenfreie_jahre"   "rabattschutz"         "fahrleistung_im_jahr"
## [4] "eintrittsalter"       "garage"               "region"              
## [7] "prob"

We standardize the data to improve clustering performance. Below are the first six rows of the standardized dataset.


##    schadenfreie_jahre eintrittsalter        prob rabattschutzN   garageN regionNORD regionOst
##                 <num>          <num>       <num>         <num>     <num>      <num>     <num>
## 1:          0.4898250     -0.6574591 -0.66688618     0.4301151  0.884828 -0.1205112 -0.551419
## 2:          0.5757613      0.9841590  0.08285544    -2.3233730 -1.129392 -0.1205112  1.812266
## 3:         -0.2836014     -0.7600602 -0.62744533    -2.3233730  0.884828 -0.1205112 -0.551419
## 4:          2.1226142      0.6763556  0.34269876     0.4301151 -1.129392 -0.1205112  1.812266
## 5:         -1.1429641     -0.1444534  0.44015785     0.4301151 -1.129392 -0.1205112  1.812266
## 6:         -1.0570279     -0.9652624  2.41608478    -2.3233730  0.884828 -0.1205112 -0.551419
##    regionSued regionWest fahrleistung_im_jahr5000 fahrleistung_im_jahr30000
##         <num>      <num>                    <num>                     <num>
## 1:  -0.567179  1.7330362                0.4132916                -0.3787187
## 2:  -0.567179 -0.5766284                0.4132916                -0.3787187
## 3:   1.761909 -0.5766284                0.4132916                -0.3787187
## 4:  -0.567179 -0.5766284                0.4132916                -0.3787187
## 5:  -0.567179 -0.5766284                0.4132916                -0.3787187
## 6:   1.761909 -0.5766284                0.4132916                -0.3787187


Candidate Cluster Methods

We Apply and compare four clustering algorithms:

  • k-Means Clustering

  • Fuzzy C-Means Clustering

  • Hierarchical Clustering

  • Featureless or random clustering


Comparison Criteria

We consider two comparison methods here in order to compare the performance of the four cluster:

  • Within Sum of Squares (WSS) measure (clust.wss):

WSS calculates the sum of
squared differences between observations and centroids, which is a quantification of cluster cohesion (smaller values indicate the clusters are more compact).

and


  • Silhouette coefficient (clust.silhouette).

The silhouette coefficient
quantifies how well each point belongs to its assigned cluster versus neighboring clusters,
where scores closer to 1 indicate well clustered and scores closer to -1 indicate poorly clustered.


Below are the comparison results:


## INFO  [21:50:13.635] [mlr3] Running benchmark with 4 resampling iterations
## INFO  [21:50:13.647] [mlr3] Applying learner 'clust.featureless' on task 'datanew_scaled' (iter 1/1)
## INFO  [21:50:13.668] [mlr3] Applying learner 'clust.hclust' on task 'datanew_scaled' (iter 1/1)
## INFO  [21:50:13.923] [mlr3] Applying learner 'clust.kmeans' on task 'datanew_scaled' (iter 1/1)
## INFO  [21:50:13.988] [mlr3] Applying learner 'clust.cmeans' on task 'datanew_scaled' (iter 1/1)
## INFO  [21:50:14.186] [mlr3] Finished benchmark
##           learner_id clust.wss clust.silhouette
##               <char>     <num>            <num>
## 1: clust.featureless  16115.00       0.00000000
## 2:      clust.hclust  11974.45       0.31645447
## 3:      clust.kmeans  11830.11       0.17654499
## 4:      clust.cmeans  14287.15       0.07575615

The hierarchical Cluster method emerges as the most effective model among those selected, as it achieves a silhouette coefficient closer to one compared to the alternatives. Additionally, it yields with the K-means clustering smaller Within Sum of Squares, indicating a better clustering performance.

Next, we proceed with the Hierarchical Clustering method to conduct our cluster analysis, focusing on optimizing cluster performance.


Here are some observations of the clustering results stored in the variable partition:


## <PredictionClust> for 1466 observations:
##  row_ids partition
##        1         1
##        2         1
##        3         2
##      ---       ---
##     1464         1
##     1465         1
##     1466         1

We merged the predictions with the initial dataset:


##      schadenfreie_jahre rabattschutz fahrleistung_im_jahr eintrittsalter garage region
## 241                  19            N                 5000             27      N   West
## 1965                 20            J                 5000             43      J    Ost
## 346                  10            J                 5000             26      N   Sued
## 1030                 38            N                 5000             40      J    Ost
## 265                   0            N                 5000             32      J    Ost
## 1539                  1            J                 5000             24      N   Sued
##      schadenmeldung       prob row_ids partition
## 241               0 0.01799357       1         1
## 1965              0 0.23152937       2         1
## 346               0 0.02922682       3         2
## 1030              0 0.30553599       4         1
## 265               0 0.33329356       5         1
## 1539              1 0.89606226       6         2

Let us perform a bivariate descriptive analysis of the clusters (as defined in the variable partition):


plot of chunk setupclusterfinal1243



plot of chunk setupclusterfinal123331


With four clusters, there seems to be a relatively random cluster, that is overlapping with all other clusters. This may be due to artificially splitting the clusters more than they should be.


Silhouette plot from predictions


plot of chunk setupclusterfinal123332


The plots include
a dotted line which visualizes the average silhouette coefficient across all data points and
each data point’s silhouette value is represented by a bar colored by their assigned cluster.
In our particular case, the average silhouette index is about 0.15. If the average silhouette value
for a given cluster is below the average silhouette coefficient line then this implies that the
cluster is not well defined.

In our pplication,
most observations fall below the average line for the majority of the segments, indicating that the quality of the cluster assignments is suboptimal for these segments. This suggests that many observations may have been assigned to incorrect clusters. However, it is important to note that no segment is entirely below the average line.


Analysis of Feature Distribution Across Clusters

Table: Table: Feature Distribution Across Different Cluster Segments

level1234p
n83343021182
schadenfreie_jahre (median [IQR])15.00 [6.00, 25.00]2.00 [0.00, 11.00]14.00 [7.00, 21.00]17.00 [6.25, 25.00]<0.001
rabattschutz (%)J52 (6.2)142 (33.0)5 (23.8)30 (16.5)<0.001
N781 (93.8)288 (67.0)16 (76.2)152 (83.5)
fahrleistung_im_jahr (%)3021 (2.5)4 (0.9)2 (9.5)3 (1.6)<0.001
5000812 (97.5)426 (99.1)14 (66.7)0 (0.0)
300000 (0.0)0 (0.0)5 (23.8)179 (98.4)
eintrittsalter (median [IQR])35.00 [26.00, 42.00]30.00 [24.00, 37.00]31.00 [26.00, 36.00]34.00 [26.00, 39.00]<0.001
garage (%)J394 (47.3)166 (38.6)7 (33.3)77 (42.3)0.019
N439 (52.7)264 (61.4)14 (66.7)105 (57.7)
region (%)Nord264 (31.7)63 (14.7)0 (0.0)53 (29.1)<0.001
NORD0 (0.0)0 (0.0)21 (100.0)0 (0.0)
Ost228 (27.4)76 (17.7)0 (0.0)38 (20.9)
Sued93 (11.2)220 (51.2)0 (0.0)44 (24.2)
West248 (29.8)71 (16.5)0 (0.0)47 (25.8)
schadenmeldung (%)0813 (97.6)321 (74.7)20 (95.2)156 (85.7)<0.001
120 (2.4)109 (25.3)1 (4.8)26 (14.3)
prob (median [IQR])0.03 [0.01, 0.15]0.32 [0.02, 0.73]0.03 [0.01, 0.26]0.06 [0.04, 0.29]<0.001

We observed notable differences (statistically significant) across the four cluster groups for all features.
Recall that.

Key Observations on Cluster Segments

Segment 2: High-Risk Participants

This is the most critical segment, representing participants with a high probability of filing a claim.

  • The median (with IQR) claim probability in Segment 2 is 0.32 [0.02, 0.73], compared to 0.03 [0.01, 0.15], 0.03 [0.01, 0.26], 0.06 [0.04, 0.29] in the other two segments 1, 3, 4 respectively.

  • Discount protection is higher in Segment 2 (33.0%) than in Segment 1 (6.6%), 3 (23.8%), and Segment 4 (16.5%).

  • The median prior claim-free years in Segment 2 is significantly lower 2.00 [0.00, 11.00] compared to Segment 1, 15.00 [6.00, 25.00], Segment 3, 14.00 [7.00, 21.00] and Segment 4, 17.00 [6.25, 25.00].

  • The annual mileage for contracts in Segment 2 is nearly always 5000, with 99.1% of contracts reporting this value.

  • Contracts in Segment 2 predominantly originate from the Sud region.

  • Claims was reported last year mostly for contracts in Segment 2.

It is important to note that the clustering algorithm did not use the outcome variable Schadenmeldung (claims filed last year) directly. Instead, it leveraged the predicted claim probability generated by a supervised learning model with near-perfect accuracy. The fact that Segment 2 has the high percentage of filed claims last year serves as an indication of a well-functioning methodology and effective application.


IV. Summary and Conclusion

We successfully identified four distinct contract clusters, although the clusters were not entirely homogeneous. The initial goal of creating four clusters was achieved, enabling a focus on the higher-risk cluster first, with the possibility of addressing other clusters in subsequent stages.

An alternative approach could have been to use a supervised learning algorithm to classify contracts into two or more groups based on predicted claim probabilities. However, this method involves selecting appropriate cutoff points, which can be just as challenging as implementing clustering techniques.

Upon examining the clusters, we observed that some segments share notable similarities and could potentially be merged for specific applications, simplifying further analysis and decision-making.

Segment 2 emerged as the group with the highest likelihood of filing claims. This segment should be prioritized for targeted marketing efforts and risk mitigation strategies, as it represents the most critical group requiring intervention.

The next step involves deploying the model into production, a process that includes several critical stages such as model retraining, monitoring, and integration into operational workflows. Each of these steps will be addressed in detail within separate use cases to ensure a comprehensive understanding and effective implementation.