Skip to content

sgandhi797/project-python-customer-segmentation-churn-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

24 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ‘ฅ Customer Segmentation & Churn Prediction (Python Project)

Understanding customer behavior is vital for enhancing retention strategies. This project aims to:

  • Segment customers based on purchasing behavior using RFM analysis.
  • Identify high-risk customers likely to churn.
  • Develop predictive models to anticipate customer churn.

๐Ÿ“ˆ Project Overview

  • This project leverages Python to perform customer segmentation using RFM (Recency, Frequency, Monetary) analysis and predicts customer churn using machine learning techniques. The analysis is based on e-commerce transaction data.

๐Ÿ“ฆ Dataset

  • Source: Kaggle โ€“ E-Commerce Data
  • The dataset was originally created by the UC Irvine Machine Learning Repository.
  • The dataset includes transactions from an online retailer between 01/12/2010 and 09/12/2011 and contains fields such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country.

๐Ÿ’ก Key Steps

  • ๐Ÿ“Š RFM Analysis
    • Calculated Recency, Frequency, and Monetary Value for each customer
    • Standardized features for clustering
  • ๐Ÿ“ˆ Customer Segmentation (K-Means)
    • Used the Elbow Method to find optimal number of clusters
    • Segmented customers into 4 behavior-based groups
  • ๐Ÿ”ฎ Churn Prediction
    • Labeled churned customers based on recency > 180 days
    • Trained a Random Forest classifier to predict churn
    • Evaluated model performance with precision, recall, F1-score

๐Ÿ” Key Python Queries

  • STEP 1: Load and Explore Dataset

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import datetime
  • Read the Dataset

    df = pd.read_csv(r"C:\Users\sgand\OneDrive\Documents\Data Analysis\Python\Customer Segmentation and Churn 
    Prediction\ecommerce-data.csv")
  • Basic Cleanup

    df.dropna(subset=['CustomerID'], inplace=True)
    df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
    df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
    df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
  • STEP 2: RFM Analysis

    snapshot_date = df['InvoiceDate'].max() + datetime.timedelta(days=1)
    rfm = df.groupby('CustomerID').agg({
        'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
        'InvoiceNo': 'nunique',
        'TotalPrice': 'sum'
    })
    rfm.rename(columns={
        'InvoiceDate': 'Recency',
        'InvoiceNo': 'Frequency',
        'TotalPrice': 'MonetaryValue'
    }, inplace=True)
  • STEP 3: Scaling and Clustering

    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    
    scaler = StandardScaler()
    rfm_scaled = scaler.fit_transform(rfm)
  • Elbow Method

    sse = {}
    for k in range(1, 10):
        kmeans = KMeans(n_clusters=k, random_state=1)
        kmeans.fit(rfm_scaled)
        sse[k] = kmeans.inertia_
    
    plt.figure(figsize=(8,5))
    plt.plot(list(sse.keys()), list(sse.values()), marker='o')
    plt.xlabel("Number of clusters")
    plt.ylabel("SSE")
    plt.title("Elbow Method for Optimal K")
    plt.show()
  • Apply KMeans with K=4

    kmeans = KMeans(n_clusters=4, random_state=1)
    rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)
  • Visualize Clusters

    sns.pairplot(rfm.reset_index(), hue='Cluster', palette='Set1', height=3)
    plt.savefig(r"C:\Users\sgand\OneDrive\Documents\Data Analysis\Python\Customer Segmentation and Churn 
    Prediction/clusters_plot.png")
  • STEP 4: Churn Prediction

    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report
    import joblib
  • Define Churn Label (Recency > 180 days)

    rfm['Churn'] = rfm['Recency'].apply(lambda x: 1 if x > 180 else 0)
  • Train/Test Split

    model = RandomForestClassifier()
    model.fit(X_train, y_train)
  • Evaluate

    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))
  • Save Model

    joblib.dump(model, r"C:\Users\sgand\OneDrive\Documents\Data Analysis\Python\Customer Segmentation and Churn  
    Prediction/customer-segmentation-churn_model.pkl")

๐Ÿ“‚ Project Structure

project-python-customer-segmentation-churn-prediction/
โ”œโ”€โ”€ Data/
โ”‚   โ””โ”€โ”€ ecommerce-data.csv                   # Original Kaggle dataset
โ”‚
โ”œโ”€โ”€ Notebooks/
โ”‚   โ””โ”€โ”€ customer_segmentation.ipynb          # Complete Jupyter Notebook analysis
โ”‚
โ”œโ”€โ”€ Source/
โ”‚   โ”œโ”€โ”€ data_preprocessing.py                # Cleaning, feature engineering, total price calculation
โ”‚   โ”œโ”€โ”€ rfm_analysis.py                      # Recency, Frequency, Monetary value calculation
โ”‚   โ”œโ”€โ”€ clustering.py                        # K-Means clustering logic and visualizations
โ”‚   โ””โ”€โ”€ churn_prediction.py                  # Model training, evaluation, and saving
โ”‚
โ”œโ”€โ”€ Models/
โ”‚   โ””โ”€โ”€ churn_model.pkl                      # Trained Random Forest model
โ”‚
โ”œโ”€โ”€ Visuals/
โ”‚   โ””โ”€โ”€ clusters_plot.png                    # Visualization of customer segments
โ”‚
โ”œโ”€โ”€ Reports/
โ”‚   โ””โ”€โ”€ insights_summary.md                  # Business-style insights and summary report
โ”‚
โ”œโ”€โ”€ requirements.txt                         # Python libraries needed to run this project
โ”œโ”€โ”€ .gitignore                               # Ignore checkpoints, system files, and data
โ”œโ”€โ”€ README.md                                # Full project overview and usage guide
โ””โ”€โ”€ LICENSE                                  # MIT License file

๐Ÿ“Š Visualizations

clusters_plot


๐Ÿ“Œ Key Insights

  • High-value customers often have low recency (recent activity) and high frequency
  • A cluster of customers showed high spend but long inactivity โ†’ ideal for retention targeting
  • The churn prediction model achieved strong recall on identifying at-risk customers

๐Ÿš€ How to Use

  • Clone this repository:
    • git clone https://github.com/sgandhi797/project-python-customer-segmentation-churn-prediction.git
    • cd project-python-customer-segmentation-churn-prediction
  • Install requirements:
    • Download Anaconda Navigator
    • Install Jupyter Notebook from the Navigator
  • Open and run the notebook:
    • jupyter notebook/Project - Jupyter Notebook - Customer Segmentation and Churn Prediction.ipynb

๐Ÿ“š Tools & Technologies

  • Python 3
  • Pandas and NumPy for data handling
  • Matplotlib and Seaborn for visualization
  • Scikit-learn for clustering and classification
  • Jupyter Notebook for interactive analysis

๐Ÿ“„ License

  • This project is licensed under the MIT License.

About

This project leverages Python to perform customer segmentation using RFM (Recency, Frequency, Monetary) analysis and predicts customer churn using machine learning techniques. The analysis is based on e-commerce transaction data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors