This project applies machine learning to predict customer churn for SyriaTel, a leading telecommunications provider. It follows the full data science pipeline from data exploration and preprocessing to model development and actionable recommendations to enable the business to proactively identify at-risk customers and reduce churn.
Customer churn presents a critical threat to SyriaTel’s revenue and customer base. Retaining existing customers is far more cost-effective than acquiring new ones. High churn leads to revenue loss, increased customer acquisition costs, and lower lifetime value.
SyriaTel seeks a data-driven churn prediction model that can:
- Identify customers likely to churn
- Uncover behavioral and service-related churn drivers
- Guide personalized, cost-effective retention actions
This project aims to support SyriaTel’s retention strategy by answering:
- Which customer behaviors and service patterns predict churn?
- Can we identify segments at higher risk of leaving?
- What proactive actions can reduce churn based on model insights?
- Removed unnecessary columns like
phone number,area codeandstate - Converted categorical variables that is
international plan,voice mail planto numeric - Checked and confirmed absence of null values
- Created dummy variables for categorical features
EDA was conducted to understand distributions, spot imbalances, and detect patterns:
- Target Variable: Dataset is moderately imbalanced, with ~85.5% non-churned and ~14.5% churned customers.
- Categorical Features: Strong relationship found between churn and features like:
- International plan: Users with international plans had significantly higher churn rates.
- Customer service calls: More calls often correlated with dissatisfaction.
- Numerical Features:
- High day charge and long day minutes were associated with increased churn risk.
- Visualizations (e.g., histograms and boxplots) revealed useful churn signals.
- Correlation Heatmap: Identified multicollinearity between features such as
total_day_minutesandtotal_day_charge.
- Multicollinearity Check:
- Applied Variance Inflation Factor (VIF) to drop highly correlated features for example dropped
total_day_chargein favor oftotal_day_minutes.
- Applied Variance Inflation Factor (VIF) to drop highly correlated features for example dropped
- Class Imbalance Handling:
- Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance the target variable.
- Encoding:
- Label encoding was used for binary categorical features that is churn.
Tableau Dashboard:
View interactive visual insights here:
Tableau Dashboard
The following models were trained and evaluated:
| Model | Accuracy | Recall | Precision | ROC AUC |
|---|---|---|---|---|
| Logistic Regression | 0.681 | 0.680 | 0.266 | 0.768 |
| Decision Tree | 0.882 | 0.732 | 0.573 | 0.837 |
| Tuned Decision Tree | 0.873 | 0.680 | 0.550 | 0.821 |
| Random Forest | 0.882 | 0.608 | 0.590 | 0.860 |
| Tuned Random Forest | 0.894 | 0.660 | 0.627 | 0.877 |
While Random Forest slightly outperformed in AUC, the Decision Tree model was selected for its:
- High recall (important for identifying churners),
- Competitive performance,
- Interpretability.
- Optimal Threshold:
0.421 - Default Recall:
0.732 - Improved Recall:
0.753 - Improved F1 Score:
0.655
Confusion Matrix (Optimized Threshold): [[517 53] [ 24 73]]
Classification Report (Optimized Threshold):
precision recall f1-score support
0 0.96 0.91 0.93 570
1 0.58 0.75 0.65 97
- Annual Revenue at Risk: $59,243.04
- False Positive Costs: $2,650.00
- Missed Churn Revenue: $19,056.84
- Recall Rate After Threshold Tuning: 75.3%
-
Target Variable Distribution
Shows the imbalance in churn vs non-churn customers. -
Top 10 Features Most Correlated with Churn
Highlights which variables have strongest relationships with churn . -
Top 10 Feature Importances from the Decision Tree
Demonstrates which features had the most predictive power in the Decision Tree model.
-
Target Customers Without International Plans
Offer bundled or discounted international plans to reduce churn in this group. -
Address High Customer Support Interaction Early
Use follow-ups and satisfaction surveys to retain users with >3 support calls. -
Personalize Strategies for Heavy Day/Night Callers
Loyalty bonuses or discounted rates can reduce churn for high-usage segments. -
Leverage Churn Predictions in CRM
Integrate the model into CRM workflows to trigger personalized offers in real time. -
Prioritize High-Risk Users
Focus retention resources on customers with high day usage and support calls. -
Regional & Onboarding Strategy
Focus on churn-heavy states like NJ, CA, TX with improved onboarding experiences. -
Incentivize High-Charge Users
Implement personalized or tiered pricing for customers with high monthly bills.
-
SMOTE May Cause Overfitting
Synthetic samples may reduce generalization without proper validation. -
Model Diversity
Only three models were explored; more could boost performance. -
Support Call Outcome Missing
Call resolution data could improve prediction accuracy. -
Model Drift Over Time
Telecom behavior evolves — periodic retraining will be needed.
-
Try Advanced Models
Explore XGBoost, LightGBM, and neural networks to enhance accuracy. -
A/B Test Retention Tactics
Validate model-informed strategies with real-world experiments.