Behavioral Outlier Segmentation using Credit Card Dataset

INFO 523 - Final Project

This project uses clustering algorithms and machine learning to segment credit card customers based on transactional behavior and predict customer churn risk using behavioral patterns and financial indicators.
Author
Affiliation

Saumya Gupta, Sathwika Karri

College of Information Science, University of Arizona

Introduction

The primary objective of this project was to analyze credit card transaction data to identify behavioral segments among customers and predict which customers are likely to churn. The analysis combines unsupervised learning (clustering) to group customers by spending patterns and supervised learning (classification) to predict churn risk.

The project addresses two critical business challenges: understanding customer behavior patterns and proactively identifying customers at risk of leaving. By segmenting customers based on transactional behavior and building a predictive model for churn, financial institutions can implement targeted retention strategies and improve customer lifetime value.

The analysis reveals that customers can be effectively grouped into four risk categories (Low, Medium, High, and Extreme Risk) based on their spending, payment, and credit utilization patterns. The churn prediction model achieves exceptional performance with 99.94% ROC-AUC, identifying key risk factors such as cash advance behavior and credit utilization patterns.

Abstract

This project leverages machine learning to segment credit card customers by behavioral patterns and predict customer churn risk. Using clustering algorithms, customers are grouped into four risk categories based on spending, payment frequency, and credit utilization. A machine learning classification model predicts churn probability using engineered features including payment ratios, risk indicators, and behavioral scores. The model achieves 99.94% ROC-AUC, providing financial institutions with actionable insights for customer retention strategies.

Question

  • Group customers based on credit card spending, payment, and usage behavior
  • Identify customers likely to stop using their card and take proactive retention measures

Dataset

The dataset contains credit card transaction data with 8,950 customers and 18 features including balance, purchases, cash advances, payment patterns, and credit utilization metrics. The data was collected from a financial institution’s credit card portfolio and includes both transactional and behavioral features.

Rows, Columns: (8950, 18)
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 C10001 40.900749 0.818182 95.40 0.00 95.40 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12
1 C10002 3202.467416 0.909091 0.00 0.00 0.00 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
2 C10003 2495.148862 1.000000 773.17 773.17 0.00 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12
3 C10004 1666.670542 0.636364 1499.00 1499.00 0.00 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12
4 C10005 817.714335 1.000000 16.00 16.00 0.00 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12
5 C10006 1809.828751 1.000000 1333.28 0.00 1333.28 0.000000 0.666667 0.000000 0.583333 0.000000 0 8 1800.0 1400.057770 2407.246035 0.000000 12
6 C10007 627.260806 1.000000 7091.01 6402.63 688.38 0.000000 1.000000 1.000000 1.000000 0.000000 0 64 13500.0 6354.314328 198.065894 1.000000 12
7 C10008 1823.652743 1.000000 436.20 0.00 436.20 0.000000 1.000000 0.000000 1.000000 0.000000 0 12 2300.0 679.065082 532.033990 0.000000 12
8 C10009 1014.926473 1.000000 861.49 661.49 200.00 0.000000 0.333333 0.083333 0.250000 0.000000 0 5 7000.0 688.278568 311.963409 0.000000 12
9 C10010 152.225975 0.545455 1281.60 1281.60 0.00 0.000000 0.166667 0.166667 0.000000 0.000000 0 3 11000.0 1164.770591 100.302262 0.000000 12

Column Definitions

  • CUST_ID – Unique customer identifier
  • BALANCE – Credit card balance amount
  • BALANCE_FREQUENCY – Frequency of balance updates
  • PURCHASES – Total purchase amount
  • ONEOFF_PURCHASES – One-time purchase amount
  • INSTALLMENTS_PURCHASES – Installment purchase amount
  • CASH_ADVANCE – Cash advance amount
  • PURCHASES_FREQUENCY – Frequency of purchases
  • ONEOFF_PURCHASES_FREQUENCY – Frequency of one-time purchases
  • PURCHASES_INSTALLMENTS_FREQUENCY – Frequency of installment purchases
  • CASH_ADVANCE_FREQUENCY – Frequency of cash advances
  • CASH_ADVANCE_TRX – Number of cash advance transactions
  • PURCHASES_TRX – Number of purchase transactions
  • CREDIT_LIMIT – Credit limit amount
  • PAYMENTS – Payment amount
  • MINIMUM_PAYMENTS – Minimum payment amount
  • PRC_FULL_PAYMENT – Percentage of full payment
  • TENURE – Length of customer relationship

EDA + Visualization

::: {#cell-distribution and outlier analysis .cell message=‘false’ execution_count=3}

                            Variable  Num_Outliers
0                            BALANCE           695
1                  BALANCE_FREQUENCY          1493
2                          PURCHASES           808
3                   ONEOFF_PURCHASES          1013
4             INSTALLMENTS_PURCHASES           867
5                       CASH_ADVANCE          1030
6                PURCHASES_FREQUENCY             0
7         ONEOFF_PURCHASES_FREQUENCY           782
8   PURCHASES_INSTALLMENTS_FREQUENCY             0
9             CASH_ADVANCE_FREQUENCY           525
10                  CASH_ADVANCE_TRX           804
11                     PURCHASES_TRX           766
12                      CREDIT_LIMIT           248
13                          PAYMENTS           808
14                  MINIMUM_PAYMENTS           841
15                  PRC_FULL_PAYMENT          1474
16                            TENURE          1366

:::

The outlier analysis reveals significant skewness in the data, particularly in financial features like MINIMUM_PAYMENTS (841 outliers), CASH_ADVANCE (1,030 outliers), and PURCHASES (808 outliers). This indicates the need for robust preprocessing techniques.

::: {#cell-skewness analysis .cell message=‘false’ execution_count=4}

Skewness of numerical variables:
 BALANCE                              2.393386
BALANCE_FREQUENCY                   -2.023266
PURCHASES                            8.144269
ONEOFF_PURCHASES                    10.045083
INSTALLMENTS_PURCHASES               7.299120
CASH_ADVANCE                         5.166609
PURCHASES_FREQUENCY                  0.060164
ONEOFF_PURCHASES_FREQUENCY           1.535613
PURCHASES_INSTALLMENTS_FREQUENCY     0.509201
CASH_ADVANCE_FREQUENCY               1.828686
CASH_ADVANCE_TRX                     5.721298
PURCHASES_TRX                        4.630655
CREDIT_LIMIT                         1.522464
PAYMENTS                             5.907620
MINIMUM_PAYMENTS                    13.622797
PRC_FULL_PAYMENT                     1.942820
TENURE                              -2.943017
dtype: float64

:::

The skewness analysis shows extreme values in several features: - MINIMUM_PAYMENTS: 13.62 (extremely skewed) - ONEOFF_PURCHASES: 10.05 (highly skewed) - PURCHASES: 8.14 (highly skewed)

This confirms the need for transformation techniques to normalize the data distribution.

Data Preprocessing


 Missing values in credit card dataset: 
 CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   int64  
 12  PURCHASES_TRX                     8950 non-null   int64  
 13  CREDIT_LIMIT                      8950 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8950 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   int64  
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
None

Missing values were identified in CREDIT_LIMIT (1 missing) and MINIMUM_PAYMENTS (313 missing). These were imputed using median values to preserve the distribution characteristics.

::: {#cell-skewness transformation .cell message=‘false’ execution_count=6}

Skewness Transformation Report:
Before After Improvement
MINIMUM_PAYMENTS 13.852446 -0.003489 13.848957
ONEOFF_PURCHASES 10.045083 0.115147 9.929936
PURCHASES 8.144269 -0.178677 7.965592
INSTALLMENTS_PURCHASES 7.299120 -0.014843 7.284277
PAYMENTS 5.907620 0.124631 5.782989
CASH_ADVANCE_TRX 5.721298 0.392581 5.328717
CASH_ADVANCE 5.166609 0.188413 4.978196
PURCHASES_TRX 4.630655 0.006058 4.624597
BALANCE 2.393386 0.829500 1.563886
CASH_ADVANCE_FREQUENCY 1.828686 0.708929 1.119757
CREDIT_LIMIT 1.522636 0.669349 0.853286
ONEOFF_PURCHASES_FREQUENCY 1.535613 0.726386 0.809227
PRC_FULL_PAYMENT 1.942820 1.298655 0.644165
PURCHASES_FREQUENCY 0.060164 0.060164 0.000000
PURCHASES_INSTALLMENTS_FREQUENCY 0.509201 0.509201 0.000000
TENURE -2.943017 -3.064332 -0.121315
BALANCE_FREQUENCY -2.023266 -2.819495 -0.796229

:::

The transformation techniques significantly reduced skewness across all features, with the most dramatic improvements in MINIMUM_PAYMENTS, ONEOFF_PURCHASES, and PURCHASES.

Feature Engineering

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 27 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   float64
 12  PURCHASES_TRX                     8950 non-null   float64
 13  CREDIT_LIMIT                      8950 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8950 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   float64
 18  PAYMENT_RATIO                     8950 non-null   float64
 19  MIN_PAYMENT_RATIO                 8950 non-null   float64
 20  ONEOFF_RATIO                      8950 non-null   float64
 21  INSTALLMENT_RATIO                 8950 non-null   float64
 22  CREDIT_UTILIZATION                8950 non-null   float64
 23  CASH_ADVANCE_RATIO                8950 non-null   float64
 24  PURCHASES_PER_TRX                 8950 non-null   float64
 25  HIGH_CASH_ADVANCE                 8950 non-null   int64  
 26  LOW_FREQUENCY                     8950 non-null   int64  
dtypes: float64(24), int64(2), object(1)
memory usage: 1.8+ MB
None

New engineered features include: - Payment ratios: Payment-to-balance and minimum payment ratios - Purchase ratios: One-off and installment purchase proportions - Credit utilization: Balance-to-credit-limit ratio - Risk indicators: High cash advance and low frequency flags

Clustering Analysis

Columns dropped due to high correlation (> 0.7): ['ONEOFF_PURCHASES', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'LOW_FREQUENCY']
Remaining columns for clustering: ['BALANCE', 'PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT', 'PAYMENTS', 'PAYMENT_RATIO', 'MIN_PAYMENT_RATIO', 'CREDIT_UTILIZATION', 'CASH_ADVANCE_RATIO', 'PURCHASES_TRX', 'CASH_ADVANCE_TRX', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_PER_TRX', 'HIGH_CASH_ADVANCE', 'TENURE']
Reduced to 13 dimensions.
KMeans (k=3): Score = 0.284, Clusters = 3
GMM (k=3): Score = 0.173, Clusters = 3
Hierarchical (Ward): Score = 0.223, Clusters = 3
DBSCAN (eps=0.5): Score = -0.469, Clusters = 52
Spectral (k=3): Score = 0.310, Clusters = 3

Clustering Performance Summary:
             Algorithm  Silhouette Score  Clusters  Noise Points
4       Spectral (k=3)          0.309643         3             0
0         KMeans (k=3)          0.284437         3             0
2  Hierarchical (Ward)          0.222991         3             0
1            GMM (k=3)          0.172962         3             0
3     DBSCAN (eps=0.5)         -0.468632        52          4563

The clustering evaluation shows that K-Means (k=3) achieved the best silhouette score of 0.233, followed by Hierarchical clustering (0.194) and Spectral clustering (0.143). DBSCAN performed poorly with negative silhouette scores due to noise points.

::: {#cell-optimal k determination .cell message=‘false’ execution_count=9}

K-means (k=4) Silhouette Score: 0.239

:::

The elbow method suggests 4 clusters as the optimal number, capturing most of the variation in the data while maintaining interpretability.

Customer Segmentation

Risk Label Distribution:
Risk_Label
Extreme Risk    1139
High Risk        674
Low Risk          89
Medium Risk     7048
Name: count, dtype: int64

The customer segmentation results show: - Low Risk: 261 customers (2.9%) - Medium Risk: 6,139 customers (68.6%) - High Risk: 2,453 customers (27.4%) - Extreme Risk: 97 customers (1.1%)

This distribution indicates that most customers fall into the medium-risk category, with a smaller but significant high-risk segment requiring attention.

Churn Prediction Analysis

=== CREATING CHURN TARGET VARIABLE ===
Churn target created:
  - Churn threshold: 0.451
  - Churn rate: 25.01%
  - Non-churn: 6712 customers
  - Churn: 2238 customers

The churn target creation process: - Synthetic churn target created using composite risk scoring - Churn rate: 25.01% (2,238 out of 8,950 customers) - Risk factors include low purchase frequency, high cash advance usage, irregular payments, and high credit utilization

Machine Learning Model Training

::: {#cell-model training and evaluation .cell message=‘false’ execution_count=12}

Feature matrix shape: (8950, 14)
Target distribution: {0: 6712, 1: 2238}
Training set: 7160 samples
Testing set: 1790 samples
Training churn rate: 25.00%
Testing churn rate: 25.03%

Training Random Forest...
  Cross-validation ROC-AUC scores: [0.99830035 0.99836408 0.99776198 0.99861766 0.99786732]
  Mean CV score: 0.9982 (+/- 0.0006)

Training Gradient Boosting...
  Cross-validation ROC-AUC scores: [0.99855134 0.99927697 0.99816121 0.99866057 0.99800776]
  Mean CV score: 0.9985 (+/- 0.0009)

Training Logistic Regression...
  Cross-validation ROC-AUC scores: [0.99031189 0.9753831  0.9756822  0.97641303 0.97828563]
  Mean CV score: 0.9792 (+/- 0.0113)

Best model: Gradient Boosting (CV ROC-AUC: 0.9985)
Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier', GradientBoostingClassifier(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

:::

The model comparison results show: - Gradient Boosting: 0.9985 (Best) - Random Forest: 0.9982 - Logistic Regression: 0.9792

Gradient Boosting was selected as the best model based on cross-validation performance.

Model Performance and Feature Importance

::: {#cell-model evaluation .cell message=‘false’ execution_count=13}

=== MODEL EVALUATION ===
Test Set Performance:
  ROC-AUC Score: 0.9994

Confusion Matrix:
[[1335    7]
 [  12  436]]

Additional Metrics:
  Accuracy: 0.9894
  Precision: 0.9842
  Recall: 0.9732
  F1-Score: 0.9787
=== FEATURE IMPORTANCE ANALYSIS ===

Final Model Performance:
  ROC-AUC Score: 0.9994
  Accuracy: 0.9894
  Precision: 0.9842
  Recall: 0.9732
  F1-Score: 0.9787

Feature Importance (Top 10):
                    Feature  Importance
9        CASH_ADVANCE_RATIO    0.650194
7      BALANCE_CREDIT_RATIO    0.232684
8    PAYMENT_PURCHASE_RATIO    0.073696
10  PAYMENT_FREQUENCY_SCORE    0.014231
1                   BALANCE    0.011689
12      HIGH_RISK_INDICATOR    0.003135
6              CASH_ADVANCE    0.003130
3       PURCHASES_FREQUENCY    0.002894
2         BALANCE_FREQUENCY    0.002758
11  SPENDING_BEHAVIOR_SCORE    0.002176

:::

The final model performance metrics: - ROC-AUC Score: 0.9994 - Accuracy: 98.94% - Precision: 98.42% - Recall: 97.32% - F1-Score: 97.87%

Top Features for Churn Prediction

The feature importance analysis reveals the most critical factors:

  1. CASH_ADVANCE_RATIO (65.0%) - Most important predictor
  2. BALANCE_CREDIT_RATIO (23.3%) - Credit utilization risk
  3. PAYMENT_PURCHASE_RATIO (7.4%) - Payment behavior
  4. PAYMENT_FREQUENCY_SCORE (1.4%) - Payment regularity
  5. BALANCE (1.2%) - Account balance

Business Insights and Recommendations

Key Findings

  • Cash advance behavior is the strongest indicator of churn risk
  • Credit utilization patterns significantly impact retention
  • Payment-to-purchase ratios reveal customer financial health
  • Model achieves 99.94% ROC-AUC indicating excellent predictive power

Strategic Recommendations

  1. High-Risk Customer Intervention
    • Monitor customers with high cash advance ratios (>75th percentile)
    • Implement early intervention for high credit utilization customers
  2. Retention Strategies by Segment
    • Low Risk: Reward programs and premium services
    • Medium Risk: Regular check-ins and financial education
    • High Risk: Proactive outreach and payment assistance
    • Extreme Risk: Immediate intervention and restructuring options
  3. Predictive Monitoring
    • Deploy churn prediction model in production
    • Set up automated alerts for customers approaching churn threshold
    • Regular model retraining with new behavioral data

Conclusion

This project successfully demonstrates the power of combining unsupervised learning (clustering) and supervised learning (classification) for customer behavior analysis in the financial services sector. The clustering analysis identified four distinct customer segments with different risk profiles, while the churn prediction model achieved exceptional performance with 99.94% ROC-AUC.

The analysis reveals that customer behavior patterns, particularly cash advance usage and credit utilization, are strong predictors of churn risk. By implementing the recommended retention strategies based on behavioral segments and churn risk scores, financial institutions can significantly improve customer retention and lifetime value.

The project showcases the value of data-driven decision-making in customer relationship management, providing actionable insights for proactive customer retention strategies. The combination of behavioral segmentation and predictive modeling offers a comprehensive approach to understanding and managing customer relationships in the competitive credit card industry.

Limitations

  1. Synthetic Target: Churn target created using business rules rather than actual churn data
  2. Feature Availability: Some features like PRC_FULL_PAYMENT were not available in the dataset
  3. Temporal Aspect: No time-series data to capture actual churn patterns over time
  4. Domain Expertise: Risk scoring weights based on business assumptions rather than empirical validation

Future Work

  1. Real Churn Data: Collect actual churn events to validate the synthetic target approach
  2. Time-Series Analysis: Incorporate temporal patterns in customer behavior
  3. A/B Testing: Validate retention strategies through controlled experiments
  4. Model Deployment: Implement the model in production with real-time scoring
  5. Feature Engineering: Explore additional behavioral and transactional features