Skip to content

nitinog10/Diabetes-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿฉบ Diabetes Prediction Model Kaggle Playground Series โ€“ Season 5, Episode 12 (S5E12)

This project builds a machine-learning classification model to predict the probability of diagnosed_diabetes using the Kaggle Playground Series โ€“ S5E12 dataset. The target leaderboard performance goal was an AUC-ROC โ‰ฅ 0.78940, with a full end-to-end pipeline including EDA, preprocessing, model selection, hyperparameter tuning, and submission generation.

๐Ÿ“ Dataset

The dataset originates from a synthetic deep learning model trained on the Diabetes Health Indicators Dataset. It includes:

Demographic variables

Lifestyle factors

Medical history

Clinical measurements

Target variable: diagnosed_diabetes โ€” a binary indicator (0/1) for diabetes diagnosis.

Files used:

train.csv

test.csv

sample_submission.csv

๐Ÿ”ง Methodology

  1. Data Download & Setup

Initially attempted using Kaggle API, but encountered:

401 Client Error: Unauthorized

Invalid/missing kaggle.json

Competition rules not accepted

Solution: Manual upload of dataset files into Google Colab.

  1. Exploratory Data Analysis (EDA)

Performed the following:

โœ” Basic Integrity Checks

Validated data types

Identified minimal missing values

โœ” Visualizations

Histograms (numerical features)

Count plots (categorical features)

Boxplots comparing numerical features & target

Count plots with hue = diagnosed_diabetes

Correlation heatmap

โœ” Key Findings

Strong associations with diabetes were found in:

age, bmi, waist_to_hip_ratio

systolic_bp, diastolic_bp

cholesterol_total, ldl_cholesterol, triglycerides

family_history_diabetes, hypertension_history, cardiovascular_history

Negative correlation:

hdl_cholesterol

Potential multicollinearity was also detected among numerical variables.

๐Ÿงน Data Preprocessing โœ” Missing Values

Rows with isolated missing entries (1 per column) were dropped in both train & test sets.

โœ” Type Conversion

Binary fields converted to:

Boolean โ†’ Integer for modeling (family_history_diabetes, hypertension_history, cardiovascular_history, diagnosed_diabetes)

โœ” Feature Engineering

Five new features were added:

BMI_Age_Interaction = bmi * age

WH_Ratio_Age_Interaction = waist_to_hip_ratio * age

BP_Interaction = systolic_bp * diastolic_bp

Cholesterol_Ratio = ldl_cholesterol / hdl_cholesterol (Safe division handling zero/NaN)

History_Sum = sum of all family/medical history binary indicators

โœ” Scaling & Encoding

Using ColumnTransformer:

StandardScaler โ†’ numerical variables

OneHotEncoder โ†’ categorical variables

id column removed before transformation

๐Ÿค– Model Training & Selection Baseline Model

Logistic Regression

Accuracy: 0.6623

AUC-ROC: 0.6904

Advanced Models Tested Model Accuracy AUC-ROC RandomForestClassifier 0.6562 0.6829 XGBClassifier 0.6722 0.7053

XGBoost performed best, becoming the primary candidate for optimization.

โš™๏ธ Hyperparameter Tuning

Approach: GridSearchCV (3-fold cross-validation) Parameters tuned:

n_estimators

max_depth

learning_rate

subsample

colsample_bytree

Best parameters identified:

{ 'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.9 }

๐Ÿ“ˆ Optimized Performance

Cross-validated AUC-ROC: 0.7164

๐Ÿ“ค Prediction & Submission โœ” Final Model

Retrained optimized XGBoost on all processed training data.

โœ” Predictions

Generated prediction probabilities for all test rows.

โœ” Submission File (submission2.csv)

Contains 300,000 rows as required

Rows missing from processed test set filled with default probability 0.5

Columns:

id

diagnosed_diabetes (probability)

โš  Kaggle Public Score

0.57840 โ€” significantly lower than local validation results. Likely causes:

Validationโ€“test distribution mismatch

Submission misalignment

Differences in feature preprocessing

Issues with missing-row probability padding

๐Ÿšง Next Steps / Future Improvements โœ” Kaggle Submission Fixes

Resolve 401 authorization errors

Ensure:

API credentials correctly configured

Competition rules accepted

Proper dataset file structure

โœ” Modeling Improvements

Try LightGBM, CatBoost, or small neural nets

More advanced feature engineering:

Polynomial features

Clinical domain-based ratios & interactions

Apply robust cross-validation:

Stratified K-Fold (recommended)

Ensemble & stacking approaches

Perform post-hoc error analysis to find major failure patterns

๐Ÿ“ฆ Repository Structure โ”œโ”€โ”€ train.csv โ”œโ”€โ”€ test.csv โ”œโ”€โ”€ sample_submission.csv โ”œโ”€โ”€ diabetes_prediction.ipynb โ”œโ”€โ”€ submission2.csv โ””โ”€โ”€ README.md

๐Ÿ Conclusion

This project implements a complete Kaggle workflowโ€”from EDA through modeling and submission. While local performance exceeded baseline expectations, public leaderboard results revealed opportunities for improved validation and preprocessing consistency.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published