This project applies unsupervised machine learning techniques to perform customer segmentation on a marketing dataset. It uses the BIRCH, K-Means and DBSCAN clustering algorithms to group customers based on their demographic and behavioral features, with a focus on interpretability, performance, and scalability.
The dataset is a cleaned and scaled version of the Customer Personality Analysis dataset available on Kaggle. It contains features such as:
- Demographic Features: Age, Income, Family Size
- Behavioral Features: Total Spend, Recency, Tenure, Campaign Responses, Purchases via different channels (Web, Catalog, Store)
- Target: no labeled output
-
✅ BIRCH Clustering
-
✅ KMeans Clustering
-
✅ DBSCAN
-
✅ PCA for dimensionality reduction (visualization)
-
Segment customers into meaningful groups
-
Perform 3D visualization of clusters using PCA
-
Extract cluster-wise insights based on statistical summaries
-
Generate automatic customer personas per cluster
-
Make the code modular and reusable for other models
customer-segmentation
│
├── data/
│ └── processed_customers.csv # Cleaned and scaled dataset
│
├── models/
│ ├── birch_clustering.ipynb # BIRCH analysis
│ ├── kmeans_clustering.ipynb # KMeans analysis
│ ├── dbscan_clustering.ipynb # DBSCAN analysis
│ └── persona_generator.py # assign persona to each cluster
│
├── notebooks
│ ├── eda.ipynb # Exploratory data analysis
│ ├── data_preprocessing.ipynb
│ ├── model_comparison.ipynb
│
├── processed
│ ├── birch.csv # dataset containing cluster labels by BIRCH Algorithm
│ ├── kmeans.csv # dataset containing cluster labels by K-Means Algorithm
│ ├── dbcsan.csv # dataset containing cluster labels by DBSCAN Algorithm
│ ├── processed_customers.csv # dataset after performing data manipulation
│ ├── processed_customers_raw.csv #
│ └── processed_customers_unscaled.csv
│
├── README.md # This file
└── requirements.txt # Dependencies
Each cluster is analyzed using range estimates: (mean ± std) for features like:
- Income: Customer’s income
- Recency: Days since last purchase
- Total Spend: Combined spend across all channels
- Customer Tenure: Days since becoming a customer
- Web, Catalog, and Store Purchases
- Campaign Responses, Complaints, Family Size, Age
Insights are used to label each cluster with customer personas, like:
- 🎯 "High-Value, Multi-Channel Spenders"
- 🧍 "Low-Engagement, Budget Shoppers"
- 📦 "Catalog Loyalists with High Income"
Clustering results are visualized using 3D PCA plots to interpret how customer segments are distributed in reduced dimensions
- Implement CURE hierarchical clustering
- Add cluster evaluation metrics (Silhouette, Davies-Bouldin)
-
Clone the repository:
git clone https://github.com/shubhro2002/Comparing-Clustering-Algorithms-by-Customer-Segmentation.git
-
Install required dependencies:
pip install -r requirements.txt
-
Run the Jupyter notebooks or Python scripts to start using.