This project implements a machine learning pipeline to predict whether a person is diabetic or not. It is a binary classification problem using clinical health indicators. Multiple algorithms such as Random Forest, Decision Tree, SVC, GaussianNB, and Logistic Regression were trained, optimized with GridSearchCV, and the best-performing model along with preprocessing steps were saved as .pkl files for seamless deployment. The system provides fast and interpretable predictions, demonstrating practical ML workflow from data preprocessing to model persistence.
-
End-to-end ML pipeline including preprocessing, model training, and evaluation
-
Tested multiple classification models: Random Forest, Decision Tree, SVC, GaussianNB, Logistic Regression
-
Hyperparameter optimization using GridSearchCV
-
Preprocessing and model serialization using .pkl files
-
Ready for deployment or integration in applications
Name: Diabetes Health Indicators Dataset
Source: Kaggle
Records - 100000 medical records
Dataset link - https://www.kaggle.com/datasets/mohankrishnathalla/diabetes-health-indicators-dataset
Input Features such as : age, gender, ethnicity, education_level, income_level, employment_status, smoking_status, alcohol_consumption_per_week, physical_activity_minutes_per_week ,diet_score ,sleep_hours_per_day ,screen_time_hours_per_day ,family_history_diabetes ,hypertension_history ,cardiovascular_history ,bmi ,waist_to_hip_ratio ,systolic_bp ,diastolic_bp ,heart_rate ,cholesterol_total ,hdl_cholesterol ,ldl_cholesterol ,triglycerides ,glucose_fasting ,glucose_postprandial ,insulin_level ,hba1c ,diabetes_risk_score ,diabetes_stage diagnosed_diabetes
Target: 0 = Low Risk of Diabetes , 1 = High Risk of Diabetes
Language: Python
Libraries: Pandas, Scikit-learn, Joblib/Pickle
Data preprocessing: missing value handling, scaling, and feature selection
Trained multiple classifiers:
Random Forest Classifier
Decision Tree Classifier
Support Vector Classifier (SVC)
Gaussian Naive Bayes (GaussianNB)
Logistic Regression
Hyperparameter tuning with GridSearchCV to select the best parameters
Final model & preprocessing pipeline saved as .pkl for prediction
Load saved preprocessing and model .pkl files
Input health parameters:
Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age
Preprocessing applied to input data (scaling, transformations)
Prediction generated by the trained model:
0 = Low Risk
1 = High Risk