Binary classification model predicting diabetes onset using the Pima Indians Diabetes Dataset and a tuned CART (Classification and Regression Trees) decision tree.
Given 8 diagnostic measurements, predict whether a patient has diabetes (Outcome = 1) or not (Outcome = 0). The dataset is the Pima Indians Diabetes Database containing 768 observations.
| Feature | Description |
|---|---|
| Pregnancies | Number of pregnancies |
| Glucose | Plasma glucose concentration |
| BloodPressure | Diastolic blood pressure (mm Hg) |
| SkinThickness | Triceps skin fold thickness (mm) |
| Insulin | 2-Hour serum insulin (mu U/ml) |
| BMI | Body mass index |
| DiabetesPedigreeFunction | Diabetes pedigree function |
| Age | Age in years |
| Outcome | Target: 1 = diabetic, 0 = non-diabetic |
- EDA — Distribution analysis, target balance check, feature correlations
- Missing Value Handling — Zeros treated as missing; imputed with class-conditional medians
- Feature Engineering — BMI clinical categories (Underweight → Obesity 3), Insulin normality flag
- Outlier Handling — IQR-based winsorization
- Encoding — Label encoding for binary, rare encoding, one-hot encoding for multi-category
- Model — CART Decision Tree with GridSearchCV (5-fold CV) over max_depth and min_samples_split
| Model | Accuracy | ROC-AUC |
|---|---|---|
| CART (default) | ~72% | — |
| CART (tuned via GridSearchCV) | ~75% | Computed via roc_auc_score |
Why CART? Decision trees offer full interpretability — critical in healthcare applications where clinicians need to understand the reasoning behind predictions.
| Feature | Description |
|---|---|
NewBMI |
BMI bucketed into 6 clinical categories |
New_Insulin |
Insulin flagged as Normal (16–166) or Abnormal |
- Python 3.8+ — Core language
- Scikit-learn — DecisionTreeClassifier, GridSearchCV, metrics
- Pandas / NumPy — Data manipulation
- Seaborn / Matplotlib — Visualization
- pydotplus — Decision tree visualization
Diabetes_Prediction_ML_CART/
├── helpers/
│ ├── __init__.py
│ ├── data_prep.py # Outlier handling, imputation, encoding utilities
│ └── eda.py # EDA summary and visualization functions
├── Diabetes_prediction_CART.py # Main ML pipeline
├── requirements.txt
└── README.md
git clone https://github.com/eboekenh/Diabetes_Prediction_ML_CART.git
cd Diabetes_Prediction_ML_CART
pip install -r requirements.txtDownload diabetes.csv from the Kaggle dataset and place it in the project root.
python Diabetes_prediction_CART.pyMIT