Behavioral Outlier Segmentation using credit card dataset

Proposal

Project description- The project Behavioral Outlier Segmentation focuses on analyzing credit card usage data from Kaggle to identify customer segments that exhibit unusual behavior patterns. By uncovering deviations such as irregular payments, abnormal spending, or infrequent card usage, the project aims to detect behavioral outliers and predict customers who are likely to stop using their cards or switch to competitors.
Author
Affiliation

The Classifiers - Saumya Gupta, Jeevana Sai Devi Sathwika Karri

College of Information Science, University of Arizona

import numpy as np
import pandas as pd

Dataset

credit_card = pd.read_csv("data/CC GENERAL.csv")

print(credit_card.info())
print('')
print("\nShape of the dataset:", credit_card.shape)
print('')
print("\nData types:\n", credit_card.dtypes)
print('')
print(credit_card.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   int64  
 12  PURCHASES_TRX                     8950 non-null   int64  
 13  CREDIT_LIMIT                      8949 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8637 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   int64  
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
None


Shape of the dataset: (8950, 18)


Data types:
 CUST_ID                              object
BALANCE                             float64
BALANCE_FREQUENCY                   float64
PURCHASES                           float64
ONEOFF_PURCHASES                    float64
INSTALLMENTS_PURCHASES              float64
CASH_ADVANCE                        float64
PURCHASES_FREQUENCY                 float64
ONEOFF_PURCHASES_FREQUENCY          float64
PURCHASES_INSTALLMENTS_FREQUENCY    float64
CASH_ADVANCE_FREQUENCY              float64
CASH_ADVANCE_TRX                      int64
PURCHASES_TRX                         int64
CREDIT_LIMIT                        float64
PAYMENTS                            float64
MINIMUM_PAYMENTS                    float64
PRC_FULL_PAYMENT                    float64
TENURE                                int64
dtype: object

            BALANCE  BALANCE_FREQUENCY     PURCHASES  ONEOFF_PURCHASES  \
count   8950.000000        8950.000000   8950.000000       8950.000000   
mean    1564.474828           0.877271   1003.204834        592.437371   
std     2081.531879           0.236904   2136.634782       1659.887917   
min        0.000000           0.000000      0.000000          0.000000   
25%      128.281915           0.888889     39.635000          0.000000   
50%      873.385231           1.000000    361.280000         38.000000   
75%     2054.140036           1.000000   1110.130000        577.405000   
max    19043.138560           1.000000  49039.570000      40761.250000   

       INSTALLMENTS_PURCHASES  CASH_ADVANCE  PURCHASES_FREQUENCY  \
count             8950.000000   8950.000000          8950.000000   
mean               411.067645    978.871112             0.490351   
std                904.338115   2097.163877             0.401371   
min                  0.000000      0.000000             0.000000   
25%                  0.000000      0.000000             0.083333   
50%                 89.000000      0.000000             0.500000   
75%                468.637500   1113.821139             0.916667   
max              22500.000000  47137.211760             1.000000   

       ONEOFF_PURCHASES_FREQUENCY  PURCHASES_INSTALLMENTS_FREQUENCY  \
count                 8950.000000                       8950.000000   
mean                     0.202458                          0.364437   
std                      0.298336                          0.397448   
min                      0.000000                          0.000000   
25%                      0.000000                          0.000000   
50%                      0.083333                          0.166667   
75%                      0.300000                          0.750000   
max                      1.000000                          1.000000   

       CASH_ADVANCE_FREQUENCY  CASH_ADVANCE_TRX  PURCHASES_TRX  CREDIT_LIMIT  \
count             8950.000000       8950.000000    8950.000000   8949.000000   
mean                 0.135144          3.248827      14.709832   4494.449450   
std                  0.200121          6.824647      24.857649   3638.815725   
min                  0.000000          0.000000       0.000000     50.000000   
25%                  0.000000          0.000000       1.000000   1600.000000   
50%                  0.000000          0.000000       7.000000   3000.000000   
75%                  0.222222          4.000000      17.000000   6500.000000   
max                  1.500000        123.000000     358.000000  30000.000000   

           PAYMENTS  MINIMUM_PAYMENTS  PRC_FULL_PAYMENT       TENURE  
count   8950.000000       8637.000000       8950.000000  8950.000000  
mean    1733.143852        864.206542          0.153715    11.517318  
std     2895.063757       2372.446607          0.292499     1.338331  
min        0.000000          0.019163          0.000000     6.000000  
25%      383.276166        169.123707          0.000000    12.000000  
50%      856.901546        312.343947          0.000000    12.000000  
75%     1901.134317        825.485459          0.142857    12.000000  
max    50721.483360      76406.207520          1.000000    12.000000  

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

The dataset used in this project is the Credit Card Customer Data sourced from Kaggle. It consists of 8,950 rows and 18 columns, each representing anonymized customer data related to credit card usage. The features include various behavioral indicators such as balance, purchase amounts, cash advances, credit limits, and payment patterns.

Why we chose this dataset

We chose this credit card dataset from Kaggle because it contains detailed information about nearly 9,000 credit card users. It includes data such as their spending habits, payment frequency, and cash advances. This makes it a good dataset for identifying different types of customers and detecting unusual behavior. Additionally, we can use it to predict customers who might stop using their cards or switch to other providers, assess the risk of issuing credit cards to customers, and identify opportunities for targeted offers and credit limit increases.

Aim

Our group is working on a project titled “Behavioral Outlier Segmentation,” which involves analyzing credit card usage data from Kaggle to identify unusual customer behavior patterns. The primary goal of this project is to uncover customer segments that behave similarly but exhibit patterns that deviate from typical usage. These unusual behaviors may include excessive use of cash advances, irregular payment activity, abnormally high or low spending, or infrequent use of the credit card. Additionally, we aim to predict customers who may stop using their cards and switch to competitors.

Make sure to load the data and use inline code for some of this information.

This dataset has r credit_card.shape[0] rows and r credit_card.shape[1] columns.

Questions

The two questions you want to answer.

  1. We identify clusters of credit card customers based on their transaction behavior (recency, frequency, and monetary value) to detect atypical patterns and classify customers into risk levels (high, medium, low).

  2. We predict which customers might stop using their credit cards or switch to a competitor.

Risk Level Definitions

We will define customer risk levels based on the following criteria:

  • High Risk: Customers with excessive cash advances (>75th percentile), irregular payments (low PRCFULLPAYMENT), and high balance-to-credit-limit ratios (>0.8)
  • Medium Risk: Customers with moderate cash advances (25th-75th percentile), occasional late payments, and balance-to-credit-limit ratios between 0.4-0.8
  • Low Risk: Customers with minimal cash advances (<25th percentile), consistent full payments, and balance-to-credit-limit ratios <0.4

Target Variable Creation for Prediction

Since the dataset doesn’t contain churn/attrition labels, we will create a synthetic target variable based on behavioral indicators that typically precede customer churn:

  • Churn Indicators: Low purchase frequency (<0.3), declining payment amounts, high cash advance usage, and irregular payment patterns
  • Target Variable: Binary classification (1 = likely to churn, 0 = likely to stay) based on composite risk score

Dataset Overview

Name: Credit Card Dataset Source: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata Size: 8950 instances of customer credit card details

Data Preprocessing Plan

  1. Data Quality Assessment: Handle missing values, check for duplicates, identify outliers
  2. Feature Engineering: Create derived features like balance-to-credit-limit ratio, payment-to-purchase ratio
  3. Dimensionality Reduction: Apply PCA to reduce 18 features to 8-10 principal components for clustering
  4. Scaling: Standardize numerical features using StandardScaler
  5. Feature Selection: Use correlation analysis and domain knowledge to select most relevant features

Analysis plan

  • A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

We will use a type of machine learning called clustering to group customers who have similar spending and payment habits. This method helps us find clear groups of customers who behave alike. It also helps us spot customers who don’t fit into any group.

For the second part of our project, we want to predict which customers might stop using their credit cards or switch to a different company. To do this, we will use prediction models that learn data about customer behavior.

Question Variables Used
Clustering BALANCE, BALANCE_FREQUENCY, PURCHASES, ONEOFF_PURCHASES, INSTALLMENTS_PURCHASES, CASH_ADVANCE, PURCHASES_FREQUENCY
Prediction TENURE, BALANCE, BALANCE_FREQUENCY, PURCHASES_FREQUENCY, PAYMENTS, MINIMUM_PAYMENTS, PRCFULLPAYMENT, CASH_ADVANCE

Data Dictionary

Variable Description
CUST_ID Identification of Credit Card holder
BALANCE Balance amount left in their account to make purchases
BALANCE_FREQUENCY How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES Amount of purchases made from account
ONEOFF_PURCHASES Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES Amount of purchase done in installment
CASH_ADVANCE Cash in advance given by the user
PURCHASES_FREQUENCY How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY How frequently the cash in advance is being paid
CASHADVANCETRX Number of Transactions made with “Cash in Advanced”
PURCHASES_TRX Number of purchase transactions made
CREDIT_LIMIT Limit of Credit Card for user
PAYMENTS Amount of Payment done by user
MINIMUM_PAYMENTS Minimum amount of payments made by user
PRCFULLPAYMENT Percent of full payment paid by user
TENURE Tenure of credit card service for user

Plan of Attack

Week Dates Activity Status
Week 2 25 July 2025 • Review the Dataset and finalize the team
• Select data mining techniques and clustering methods
Completed
Week 3 1 August 2025 • Proposal and Peer Review with other teams
• Data Preprocessing
Completed
Week 4 8 August 2025 • Perform feature engineering/selection
• Transform and scale features
• Apply clustering algorithms
Completed
Week 5 15 August 2025 • Evaluate clustering performance
• Visualize our results
Completed
Week 6 20 August 2025 • Conduct a peer code review
• Present projects and turn in final write-up
Completed