๐ฉบ Diabetes Prediction Model Kaggle Playground Series โ Season 5, Episode 12 (S5E12)
This project builds a machine-learning classification model to predict the probability of diagnosed_diabetes using the Kaggle Playground Series โ S5E12 dataset. The target leaderboard performance goal was an AUC-ROC โฅ 0.78940, with a full end-to-end pipeline including EDA, preprocessing, model selection, hyperparameter tuning, and submission generation.
๐ Dataset
The dataset originates from a synthetic deep learning model trained on the Diabetes Health Indicators Dataset. It includes:
Demographic variables
Lifestyle factors
Medical history
Clinical measurements
Target variable: diagnosed_diabetes โ a binary indicator (0/1) for diabetes diagnosis.
Files used:
train.csv
test.csv
sample_submission.csv
๐ง Methodology
- Data Download & Setup
Initially attempted using Kaggle API, but encountered:
401 Client Error: Unauthorized
Invalid/missing kaggle.json
Competition rules not accepted
Solution: Manual upload of dataset files into Google Colab.
- Exploratory Data Analysis (EDA)
Performed the following:
โ Basic Integrity Checks
Validated data types
Identified minimal missing values
โ Visualizations
Histograms (numerical features)
Count plots (categorical features)
Boxplots comparing numerical features & target
Count plots with hue = diagnosed_diabetes
Correlation heatmap
โ Key Findings
Strong associations with diabetes were found in:
age, bmi, waist_to_hip_ratio
systolic_bp, diastolic_bp
cholesterol_total, ldl_cholesterol, triglycerides
family_history_diabetes, hypertension_history, cardiovascular_history
Negative correlation:
hdl_cholesterol
Potential multicollinearity was also detected among numerical variables.
๐งน Data Preprocessing โ Missing Values
Rows with isolated missing entries (1 per column) were dropped in both train & test sets.
โ Type Conversion
Binary fields converted to:
Boolean โ Integer for modeling (family_history_diabetes, hypertension_history, cardiovascular_history, diagnosed_diabetes)
โ Feature Engineering
Five new features were added:
BMI_Age_Interaction = bmi * age
WH_Ratio_Age_Interaction = waist_to_hip_ratio * age
BP_Interaction = systolic_bp * diastolic_bp
Cholesterol_Ratio = ldl_cholesterol / hdl_cholesterol (Safe division handling zero/NaN)
History_Sum = sum of all family/medical history binary indicators
โ Scaling & Encoding
Using ColumnTransformer:
StandardScaler โ numerical variables
OneHotEncoder โ categorical variables
id column removed before transformation
๐ค Model Training & Selection Baseline Model
Logistic Regression
Accuracy: 0.6623
AUC-ROC: 0.6904
Advanced Models Tested Model Accuracy AUC-ROC RandomForestClassifier 0.6562 0.6829 XGBClassifier 0.6722 0.7053
XGBoost performed best, becoming the primary candidate for optimization.
โ๏ธ Hyperparameter Tuning
Approach: GridSearchCV (3-fold cross-validation) Parameters tuned:
n_estimators
max_depth
learning_rate
subsample
colsample_bytree
Best parameters identified:
{ 'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.9 }
๐ Optimized Performance
Cross-validated AUC-ROC: 0.7164
๐ค Prediction & Submission โ Final Model
Retrained optimized XGBoost on all processed training data.
โ Predictions
Generated prediction probabilities for all test rows.
โ Submission File (submission2.csv)
Contains 300,000 rows as required
Rows missing from processed test set filled with default probability 0.5
Columns:
id
diagnosed_diabetes (probability)
โ Kaggle Public Score
0.57840 โ significantly lower than local validation results. Likely causes:
Validationโtest distribution mismatch
Submission misalignment
Differences in feature preprocessing
Issues with missing-row probability padding
๐ง Next Steps / Future Improvements โ Kaggle Submission Fixes
Resolve 401 authorization errors
Ensure:
API credentials correctly configured
Competition rules accepted
Proper dataset file structure
โ Modeling Improvements
Try LightGBM, CatBoost, or small neural nets
More advanced feature engineering:
Polynomial features
Clinical domain-based ratios & interactions
Apply robust cross-validation:
Stratified K-Fold (recommended)
Ensemble & stacking approaches
Perform post-hoc error analysis to find major failure patterns
๐ฆ Repository Structure โโโ train.csv โโโ test.csv โโโ sample_submission.csv โโโ diabetes_prediction.ipynb โโโ submission2.csv โโโ README.md
๐ Conclusion
This project implements a complete Kaggle workflowโfrom EDA through modeling and submission. While local performance exceeded baseline expectations, public leaderboard results revealed opportunities for improved validation and preprocessing consistency.