An end-to-end Data Engineering + Data Science project
This project demonstrates how to build a decision-oriented data platform using a real-world open dataset.
Rather than focusing solely on predictive modeling, the project emphasizes the full lifecycle from raw data ingestion
to actionable business decisions under resource constraints.
The use case is customer churn prevention in a subscription-based telecom business.
Customer churn directly impacts recurring revenue.
While predictive models can estimate churn probability, business value is only realized when predictions are
translated into decisions.
Constraints:
- Retention actions (e.g. offers, calls) have limited budget
- Not all high-risk customers are worth intervening
- Data must be reliable, reproducible, and explainable
Goal
Design a system that:
- Produces stable, well-defined datasets
- Trains a churn prediction model
- Converts predictions into a resource-constrained intervention policy
- Evaluates expected business impact offline
- Telco Customer Churn Dataset (open-source)
- Source: Kaggle / OpenML
- Granularity: one row per customer (snapshot)
Telco-Customer-Churn/
├── data/
│ ├── raw/ # Original dataset (immutable)
│ ├── staging/ # Cleaned, typed data
│ ├── features/ # Business features
│ └── artifacts/ # Models, scores, decisions
│
├── src/
│ ├── ingestion/ # Raw → staging
│ ├── features/ # Feature construction
│ ├── datasets/ # Train / inference datasets
│ ├── models/ # Prediction models
│ ├── decision/ # Decision policies
│ └── evaluation/ # Offline evaluation
│
├── docs/
│ ├── data_contracts.md
│ ├── feature_definitions.md
│ ├── decision_policy.md
│ └── evaluation.md
│
├── pipelines/
│ └── run_all.sh
│
└── README.md
- Data Engineering
- Schema definition and validation
- Separation of raw, staging, and feature layers
- Reproducible dataset construction
- Data Science
- Baseline churn prediction model
- Transparent evaluation (ROC-AUC, Precision@K)
- Decision Science
- Budget-constrained intervention policy
- Ranking-based decision making
- Offline value estimation
- Prediction vs Decision
- Data contracts and schema stability
- Feature engineering with business meaning
- Resource-constrained decision policies
- Offline evaluation of strategies
This project is designed for learning and demonstration purposes.
The dataset is a static snapshot and does not include real intervention outcomes; therefore,
causal impact is approximated rather than identified.