Skip to content

Kushagra524/Diabetes-Risk-Prediction-using-ML

Repository files navigation

Diabetes Prediction Model using Machine Learning

Overview

This project implements a machine learning pipeline to predict whether a person is diabetic or not. It is a binary classification problem using clinical health indicators. Multiple algorithms such as Random Forest, Decision Tree, SVC, GaussianNB, and Logistic Regression were trained, optimized with GridSearchCV, and the best-performing model along with preprocessing steps were saved as .pkl files for seamless deployment. The system provides fast and interpretable predictions, demonstrating practical ML workflow from data preprocessing to model persistence.

Features

  1. End-to-end ML pipeline including preprocessing, model training, and evaluation

  2. Tested multiple classification models: Random Forest, Decision Tree, SVC, GaussianNB, Logistic Regression

  3. Hyperparameter optimization using GridSearchCV

  4. Preprocessing and model serialization using .pkl files

  5. Ready for deployment or integration in applications

Dataset

Name: Diabetes Health Indicators Dataset

Source: Kaggle

Records - 100000 medical records

Dataset link - https://www.kaggle.com/datasets/mohankrishnathalla/diabetes-health-indicators-dataset

Input Features such as : age, gender, ethnicity, education_level, income_level, employment_status, smoking_status, alcohol_consumption_per_week, physical_activity_minutes_per_week ,diet_score ,sleep_hours_per_day ,screen_time_hours_per_day ,family_history_diabetes ,hypertension_history ,cardiovascular_history ,bmi ,waist_to_hip_ratio ,systolic_bp ,diastolic_bp ,heart_rate ,cholesterol_total ,hdl_cholesterol ,ldl_cholesterol ,triglycerides ,glucose_fasting ,glucose_postprandial ,insulin_level ,hba1c ,diabetes_risk_score ,diabetes_stage diagnosed_diabetes

Target: 0 = Low Risk of Diabetes , 1 = High Risk of Diabetes

Tech Stack

Language: Python

Libraries: Pandas, Scikit-learn, Joblib/Pickle

Model Training

Data preprocessing: missing value handling, scaling, and feature selection

  Trained multiple classifiers:

  Random Forest Classifier

  Decision Tree Classifier

  Support Vector Classifier (SVC)

  Gaussian Naive Bayes (GaussianNB)

  Logistic Regression

Hyperparameter tuning with GridSearchCV to select the best parameters

Final model & preprocessing pipeline saved as .pkl for prediction

How Prediction Works

Load saved preprocessing and model .pkl files

Input health parameters:

Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age

Preprocessing applied to input data (scaling, transformations)

Prediction generated by the trained model:

    0 = Low Risk

    1 = High Risk

Output returned instantly after runnning the Final Presentation File

Algorithms Used

Logistic Regression: Served as the baseline linear classifier for interpretability.

46-4-e1715636469361

Random Forest: Performed well on non-linear patterns and provided high predictive stability.

1737778055The-Random-Forest-Algorithm-in-Machine-Learning

Decision Tree: Used to understand feature splits and provide interpretability.

1_vfcue7SpRnDoyqGuj9NJMg

SVC: Tested with RBF kernel for capturing complex boundaries in the data.

support-vector-machine-1-1280x720

GaussianNB: Provided a fast probabilistic benchmark model.

75270svm18

About

watch this repo if you want to do a free diabetes test.....😈😈😈

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors