🩺 Diabetes Prediction with CART Decision Tree

Binary classification model predicting diabetes onset using the Pima Indians Diabetes Dataset and a tuned CART (Classification and Regression Trees) decision tree.

Problem Statement

Given 8 diagnostic measurements, predict whether a patient has diabetes (Outcome = 1) or not (Outcome = 0). The dataset is the Pima Indians Diabetes Database containing 768 observations.

Dataset Features

Feature	Description
Pregnancies	Number of pregnancies
Glucose	Plasma glucose concentration
BloodPressure	Diastolic blood pressure (mm Hg)
SkinThickness	Triceps skin fold thickness (mm)
Insulin	2-Hour serum insulin (mu U/ml)
BMI	Body mass index
DiabetesPedigreeFunction	Diabetes pedigree function
Age	Age in years
Outcome	Target: 1 = diabetic, 0 = non-diabetic

Approach

EDA — Distribution analysis, target balance check, feature correlations
Missing Value Handling — Zeros treated as missing; imputed with class-conditional medians
Feature Engineering — BMI clinical categories (Underweight → Obesity 3), Insulin normality flag
Outlier Handling — IQR-based winsorization
Encoding — Label encoding for binary, rare encoding, one-hot encoding for multi-category
Model — CART Decision Tree with GridSearchCV (5-fold CV) over max_depth and min_samples_split

Results

Model	Accuracy	ROC-AUC
CART (default)	~72%	—
CART (tuned via GridSearchCV)	~75%	Computed via `roc_auc_score`

Why CART? Decision trees offer full interpretability — critical in healthcare applications where clinicians need to understand the reasoning behind predictions.

Engineered Features

Feature	Description
`NewBMI`	BMI bucketed into 6 clinical categories
`New_Insulin`	Insulin flagged as Normal (16–166) or Abnormal

Tech Stack

Python 3.8+ — Core language
Scikit-learn — DecisionTreeClassifier, GridSearchCV, metrics
Pandas / NumPy — Data manipulation
Seaborn / Matplotlib — Visualization
pydotplus — Decision tree visualization

Project Structure

Diabetes_Prediction_ML_CART/
├── helpers/
│   ├── __init__.py
│   ├── data_prep.py          # Outlier handling, imputation, encoding utilities
│   └── eda.py                # EDA summary and visualization functions
├── Diabetes_prediction_CART.py  # Main ML pipeline
├── requirements.txt
└── README.md

Getting Started

git clone https://github.com/eboekenh/Diabetes_Prediction_ML_CART.git
cd Diabetes_Prediction_ML_CART
pip install -r requirements.txt

Download diabetes.csv from the Kaggle dataset and place it in the project root.

python Diabetes_prediction_CART.py

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 Diabetes Prediction with CART Decision Tree

Problem Statement

Dataset Features

Approach

Results

Engineered Features

Tech Stack

Project Structure

Getting Started

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
helpers		helpers
Diabetes_prediction_CART.py		Diabetes_prediction_CART.py
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🩺 Diabetes Prediction with CART Decision Tree

Problem Statement

Dataset Features

Approach

Results

Engineered Features

Tech Stack

Project Structure

Getting Started

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages