A comprehensive machine learning project to analyze customer churn patterns and predict customer attrition using Python, SQL, and Scikit-learn. The project involves data preprocessing, exploratory analysis, feature engineering, model training, hyperparameter tuning, and evaluation.
Customer churn prediction is crucial for businesses aiming to retain customers and reduce revenue loss. This project processes telecom customer data to build a predictive model that accurately identifies customers likely to churn. It helps stakeholders focus retention efforts efficiently.
- Data Preprocessing: Handles missing values, categorical encoding, and feature scaling.
- Feature Engineering: Creation of new features such as average charges per month.
- Model Training: Uses Logistic Regression, Random Forest, and Gradient Boosting with hyperparameter tuning.
- Evaluation: Metrics including Accuracy, Precision, Recall, F1-score, and ROC AUC.
- Visualization: Important feature ranking and ROC curve for model performance visualization.
The feature importance plot illustrates the top 10 variables that most influence the model's churn prediction. This ranking is derived from the Random Forest classifier's feature importance scores, highlighting which factors contribute most to customer attrition. Key observations:
- Contract type is the most significant factor, indicating customers with month-to-month contracts have a higher risk of churn.
- Monthly Charges and Tenure also play major roles, where higher charges and shorter tenure correlate with churn likelihood.
- Payment Method and Internet Service Type impact churn, reflecting customer preferences and service satisfaction.
- This insight helps business strategists prioritize retention efforts on critical customer segments.

The Receiver Operating Characteristic (ROC) curve measures the classifier’s ability to distinguish between churners and non-churners at various threshold settings:
- The ROC AUC (Area Under Curve) value of approximately 0.85 shows strong predictive power.
- The curve demonstrates a good balance of True Positive Rate (sensitivity) vs. False Positive Rate, indicating the model effectively identifies most churn cases while minimizing false alarms.
- This plot guides threshold selection for real-world deployment to optimize between precision and recall depending on business needs.

-
Clone the repo and navigate to the project root.
-
Install dependencies:
pip install -r requirements.txt
-
Run the analysis script:
python src/churn_analysis.py
-
View model results and visualizations.
- Python 3.6+
- pandas, numpy, matplotlib, seaborn
- scikit-learn
- Jupyter Notebook (optional for exploratory analysis)
This project delivers a robust churn prediction model backed by clear visualizations, providing actionable business insights. Feature importance guides targeted interventions, and the ROC curve validates model reliability. It serves as a practical tool for telecom companies to reduce churn and increase customer lifetime value.