This project focuses on predicting and analyzing customer churn in the telecom industry using Databricks. By leveraging big data processing, machine learning, and real-time dashboards, we provide actionable insights to help businesses reduce churn and improve customer retention strategies.
- Databricks (Apache Spark, PySpark, Databricks SQL)
- Python (pandas, NumPy, scikit-learn, XGBoost)
- SQL (Data querying and transformation)
- MLflow (Model tracking and drift monitoring)
- Power BI (Data visualization and dashboarding)
- Source: Telco Customer Churn Dataset
- Description: Customer demographics, subscription details, service usage, and churn status.
- Key Features:
CustomerID
- Unique customer identifierTenure, Contract, Payment Method
- Subscription detailsMonthlyCharges, TotalCharges
- Financial detailsChurn
- Target variable (Yes/No)
- Data ingestion, cleaning, and transformation using Apache Spark
- Handling missing values & data type conversion
- Feature engineering (categorical encoding, scaling)
- Models Used: Logistic Regression, Random Forest, XGBoost
- Best Model: XGBoost (AUC-ROC: 0.85)
- Hyperparameter tuning using Grid Search
- Kaplan-Meier Curve: Customer retention probabilities over time
- Cox Proportional Hazards Model: Identified churn risk factors
- A/B Testing: Evaluated retention strategies (discounts, offers, incentives)
- Drift detection using MLflow tracking
- Kolmogorov-Smirnov & Chi-Square tests for feature distribution monitoring
- KPIs Tracked: Overall Churn Rate, Monthly Recurring Revenue, Customer Retention Rate
- Visualizations: Heatmap, Line Chart, Scatter Plot, Pivot Table
✅ Month-to-month contracts have the highest churn
✅ Customers with high monthly charges are more likely to churn
✅ Long-term contracts improve customer retention
✅ Early churn can be predicted using survival analysis
✅ Retention strategies (discounts, offers) significantly reduce churn
- Upload dataset to Databricks File Store
- Load data using Apache Spark (
pyspark.sql
)
- Execute EDA, Feature Engineering, and ML model training
- Run churn predictions and evaluate models
- Deploy trained model for real-time churn prediction
- Use Power BI or Databricks SQL Dashboard for visualization
📌 Automate churn prediction pipeline for real-time scoring
📌 Integrate customer feedback analysis using NLP
📌 Expand A/B testing for personalized retention offers
📌 Deploy model using REST API for business integration
This project successfully demonstrates a data-driven approach to predicting customer churn using Databricks and machine learning. The insights help businesses optimize retention strategies and reduce churn rates. The dashboard enhances these insights by providing real-time visualization for data-driven decision-making.
👤 Sayan Kashyap
🔗 [https://www.linkedin.com/in/sayankashyap/]
Feel free to open issues, submit PRs, or suggest improvements! 🚀