Behavioral Outlier Segmentation using credit card dataset
Proposal
Dataset
credit_card = pd.read_csv("data/CC GENERAL.csv")
print(credit_card.info())
print('')
print("\nShape of the dataset:", credit_card.shape)
print('')
print("\nData types:\n", credit_card.dtypes)
print('')
print(credit_card.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CUST_ID 8950 non-null object
1 BALANCE 8950 non-null float64
2 BALANCE_FREQUENCY 8950 non-null float64
3 PURCHASES 8950 non-null float64
4 ONEOFF_PURCHASES 8950 non-null float64
5 INSTALLMENTS_PURCHASES 8950 non-null float64
6 CASH_ADVANCE 8950 non-null float64
7 PURCHASES_FREQUENCY 8950 non-null float64
8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64
9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64
10 CASH_ADVANCE_FREQUENCY 8950 non-null float64
11 CASH_ADVANCE_TRX 8950 non-null int64
12 PURCHASES_TRX 8950 non-null int64
13 CREDIT_LIMIT 8949 non-null float64
14 PAYMENTS 8950 non-null float64
15 MINIMUM_PAYMENTS 8637 non-null float64
16 PRC_FULL_PAYMENT 8950 non-null float64
17 TENURE 8950 non-null int64
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
None
Shape of the dataset: (8950, 18)
Data types:
CUST_ID object
BALANCE float64
BALANCE_FREQUENCY float64
PURCHASES float64
ONEOFF_PURCHASES float64
INSTALLMENTS_PURCHASES float64
CASH_ADVANCE float64
PURCHASES_FREQUENCY float64
ONEOFF_PURCHASES_FREQUENCY float64
PURCHASES_INSTALLMENTS_FREQUENCY float64
CASH_ADVANCE_FREQUENCY float64
CASH_ADVANCE_TRX int64
PURCHASES_TRX int64
CREDIT_LIMIT float64
PAYMENTS float64
MINIMUM_PAYMENTS float64
PRC_FULL_PAYMENT float64
TENURE int64
dtype: object
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES \
count 8950.000000 8950.000000 8950.000000 8950.000000
mean 1564.474828 0.877271 1003.204834 592.437371
std 2081.531879 0.236904 2136.634782 1659.887917
min 0.000000 0.000000 0.000000 0.000000
25% 128.281915 0.888889 39.635000 0.000000
50% 873.385231 1.000000 361.280000 38.000000
75% 2054.140036 1.000000 1110.130000 577.405000
max 19043.138560 1.000000 49039.570000 40761.250000
INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY \
count 8950.000000 8950.000000 8950.000000
mean 411.067645 978.871112 0.490351
std 904.338115 2097.163877 0.401371
min 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.083333
50% 89.000000 0.000000 0.500000
75% 468.637500 1113.821139 0.916667
max 22500.000000 47137.211760 1.000000
ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY \
count 8950.000000 8950.000000
mean 0.202458 0.364437
std 0.298336 0.397448
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.083333 0.166667
75% 0.300000 0.750000
max 1.000000 1.000000
CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT \
count 8950.000000 8950.000000 8950.000000 8949.000000
mean 0.135144 3.248827 14.709832 4494.449450
std 0.200121 6.824647 24.857649 3638.815725
min 0.000000 0.000000 0.000000 50.000000
25% 0.000000 0.000000 1.000000 1600.000000
50% 0.000000 0.000000 7.000000 3000.000000
75% 0.222222 4.000000 17.000000 6500.000000
max 1.500000 123.000000 358.000000 30000.000000
PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
count 8950.000000 8637.000000 8950.000000 8950.000000
mean 1733.143852 864.206542 0.153715 11.517318
std 2895.063757 2372.446607 0.292499 1.338331
min 0.000000 0.019163 0.000000 6.000000
25% 383.276166 169.123707 0.000000 12.000000
50% 856.901546 312.343947 0.000000 12.000000
75% 1901.134317 825.485459 0.142857 12.000000
max 50721.483360 76406.207520 1.000000 12.000000
A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
The dataset used in this project is the Credit Card Customer Data sourced from Kaggle. It consists of 8,950 rows and 18 columns, each representing anonymized customer data related to credit card usage. The features include various behavioral indicators such as balance, purchase amounts, cash advances, credit limits, and payment patterns.
Why we chose this dataset
We chose this credit card dataset from Kaggle because it contains detailed information about nearly 9,000 credit card users. It includes data such as their spending habits, payment frequency, and cash advances. This makes it a good dataset for identifying different types of customers and detecting unusual behavior. Additionally, we can use it to predict customers who might stop using their cards or switch to other providers, assess the risk of issuing credit cards to customers, and identify opportunities for targeted offers and credit limit increases.
Aim
Our group is working on a project titled “Behavioral Outlier Segmentation,” which involves analyzing credit card usage data from Kaggle to identify unusual customer behavior patterns. The primary goal of this project is to uncover customer segments that behave similarly but exhibit patterns that deviate from typical usage. These unusual behaviors may include excessive use of cash advances, irregular payment activity, abnormally high or low spending, or infrequent use of the credit card. Additionally, we aim to predict customers who may stop using their cards and switch to competitors.
Make sure to load the data and use inline code for some of this information.
This dataset has r credit_card.shape[0]
rows and r credit_card.shape[1]
columns.
Questions
The two questions you want to answer.
We identify clusters of credit card customers based on their transaction behavior (recency, frequency, and monetary value) to detect atypical patterns and classify customers into risk levels (high, medium, low).
We predict which customers might stop using their credit cards or switch to a competitor.
Risk Level Definitions
We will define customer risk levels based on the following criteria:
- High Risk: Customers with excessive cash advances (>75th percentile), irregular payments (low PRCFULLPAYMENT), and high balance-to-credit-limit ratios (>0.8)
- Medium Risk: Customers with moderate cash advances (25th-75th percentile), occasional late payments, and balance-to-credit-limit ratios between 0.4-0.8
- Low Risk: Customers with minimal cash advances (<25th percentile), consistent full payments, and balance-to-credit-limit ratios <0.4
Target Variable Creation for Prediction
Since the dataset doesn’t contain churn/attrition labels, we will create a synthetic target variable based on behavioral indicators that typically precede customer churn:
- Churn Indicators: Low purchase frequency (<0.3), declining payment amounts, high cash advance usage, and irregular payment patterns
- Target Variable: Binary classification (1 = likely to churn, 0 = likely to stay) based on composite risk score
Dataset Overview
Name: Credit Card Dataset Source: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata Size: 8950 instances of customer credit card details
Data Preprocessing Plan
- Data Quality Assessment: Handle missing values, check for duplicates, identify outliers
- Feature Engineering: Create derived features like balance-to-credit-limit ratio, payment-to-purchase ratio
- Dimensionality Reduction: Apply PCA to reduce 18 features to 8-10 principal components for clustering
- Scaling: Standardize numerical features using StandardScaler
- Feature Selection: Use correlation analysis and domain knowledge to select most relevant features
Analysis plan
- A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
We will use a type of machine learning called clustering to group customers who have similar spending and payment habits. This method helps us find clear groups of customers who behave alike. It also helps us spot customers who don’t fit into any group.
For the second part of our project, we want to predict which customers might stop using their credit cards or switch to a different company. To do this, we will use prediction models that learn data about customer behavior.
Question | Variables Used |
---|---|
Clustering | BALANCE, BALANCE_FREQUENCY, PURCHASES, ONEOFF_PURCHASES, INSTALLMENTS_PURCHASES, CASH_ADVANCE, PURCHASES_FREQUENCY |
Prediction | TENURE, BALANCE, BALANCE_FREQUENCY, PURCHASES_FREQUENCY, PAYMENTS, MINIMUM_PAYMENTS, PRCFULLPAYMENT, CASH_ADVANCE |
Data Dictionary
Variable | Description |
---|---|
CUST_ID | Identification of Credit Card holder |
BALANCE | Balance amount left in their account to make purchases |
BALANCE_FREQUENCY | How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) |
PURCHASES | Amount of purchases made from account |
ONEOFF_PURCHASES | Maximum purchase amount done in one-go |
INSTALLMENTS_PURCHASES | Amount of purchase done in installment |
CASH_ADVANCE | Cash in advance given by the user |
PURCHASES_FREQUENCY | How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) |
ONEOFFPURCHASESFREQUENCY | How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) |
PURCHASESINSTALLMENTSFREQUENCY | How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) |
CASHADVANCEFREQUENCY | How frequently the cash in advance is being paid |
CASHADVANCETRX | Number of Transactions made with “Cash in Advanced” |
PURCHASES_TRX | Number of purchase transactions made |
CREDIT_LIMIT | Limit of Credit Card for user |
PAYMENTS | Amount of Payment done by user |
MINIMUM_PAYMENTS | Minimum amount of payments made by user |
PRCFULLPAYMENT | Percent of full payment paid by user |
TENURE | Tenure of credit card service for user |
Plan of Attack
Week | Dates | Activity | Status |
---|---|---|---|
Week 2 | 25 July 2025 | • Review the Dataset and finalize the team • Select data mining techniques and clustering methods |
Completed |
Week 3 | 1 August 2025 | • Proposal and Peer Review with other teams • Data Preprocessing |
Completed |
Week 4 | 8 August 2025 | • Perform feature engineering/selection • Transform and scale features • Apply clustering algorithms |
Completed |
Week 5 | 15 August 2025 | • Evaluate clustering performance • Visualize our results |
Completed |
Week 6 | 20 August 2025 | • Conduct a peer code review • Present projects and turn in final write-up |
Completed |