Diabetes Prediction Model using Machine Learning

Overview

This project implements a machine learning pipeline to predict whether a person is diabetic or not. It is a binary classification problem using clinical health indicators. Multiple algorithms such as Random Forest, Decision Tree, SVC, GaussianNB, and Logistic Regression were trained, optimized with GridSearchCV, and the best-performing model along with preprocessing steps were saved as .pkl files for seamless deployment. The system provides fast and interpretable predictions, demonstrating practical ML workflow from data preprocessing to model persistence.

Features

End-to-end ML pipeline including preprocessing, model training, and evaluation
Tested multiple classification models: Random Forest, Decision Tree, SVC, GaussianNB, Logistic Regression
Hyperparameter optimization using GridSearchCV
Preprocessing and model serialization using .pkl files
Ready for deployment or integration in applications

Dataset

Name: Diabetes Health Indicators Dataset

Source: Kaggle

Records - 100000 medical records

Dataset link - https://www.kaggle.com/datasets/mohankrishnathalla/diabetes-health-indicators-dataset

Input Features such as : age, gender, ethnicity, education_level, income_level, employment_status, smoking_status, alcohol_consumption_per_week, physical_activity_minutes_per_week ,diet_score ,sleep_hours_per_day ,screen_time_hours_per_day ,family_history_diabetes ,hypertension_history ,cardiovascular_history ,bmi ,waist_to_hip_ratio ,systolic_bp ,diastolic_bp ,heart_rate ,cholesterol_total ,hdl_cholesterol ,ldl_cholesterol ,triglycerides ,glucose_fasting ,glucose_postprandial ,insulin_level ,hba1c ,diabetes_risk_score ,diabetes_stage diagnosed_diabetes

Target: 0 = Low Risk of Diabetes , 1 = High Risk of Diabetes

Tech Stack

Language: Python

Libraries: Pandas, Scikit-learn, Joblib/Pickle

Model Training

Data preprocessing: missing value handling, scaling, and feature selection

  Trained multiple classifiers:

  Random Forest Classifier

  Decision Tree Classifier

  Support Vector Classifier (SVC)

  Gaussian Naive Bayes (GaussianNB)

  Logistic Regression

Hyperparameter tuning with GridSearchCV to select the best parameters

Final model & preprocessing pipeline saved as .pkl for prediction

How Prediction Works

Load saved preprocessing and model .pkl files

Input health parameters:

Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age

Preprocessing applied to input data (scaling, transformations)

Prediction generated by the trained model:

    0 = Low Risk

    1 = High Risk

Output returned instantly after runnning the Final Presentation File

Algorithms Used

Logistic Regression: Served as the baseline linear classifier for interpretability.

Random Forest: Performed well on non-linear patterns and provided high predictive stability.

1737778055The-Random-Forest-Algorithm-in-Machine-Learning

Decision Tree: Used to understand feature splits and provide interpretability.

SVC: Tested with RBF kernel for capturing complex boundaries in the data.

GaussianNB: Provided a fast probabilistic benchmark model.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Diabetes Prediction.ipynb		Diabetes Prediction.ipynb
Final_Presentation.ipynb		Final_Presentation.ipynb
Miscellaneous.ipynb		Miscellaneous.ipynb
README.md		README.md
diabetes_dataset.csv		diabetes_dataset.csv
diabetes_model.pkl		diabetes_model.pkl
diabetes_preprocessor.pkl		diabetes_preprocessor.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes Prediction Model using Machine Learning

Overview

Features

Dataset

Tech Stack

Model Training

How Prediction Works

Output returned instantly after runnning the Final Presentation File

Algorithms Used

Logistic Regression: Served as the baseline linear classifier for interpretability.

Random Forest: Performed well on non-linear patterns and provided high predictive stability.

Decision Tree: Used to understand feature splits and provide interpretability.

SVC: Tested with RBF kernel for capturing complex boundaries in the data.

GaussianNB: Provided a fast probabilistic benchmark model.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diabetes Prediction Model using Machine Learning

Overview

Features

Dataset

Tech Stack

Model Training

How Prediction Works

Output returned instantly after runnning the Final Presentation File

Algorithms Used

Logistic Regression: Served as the baseline linear classifier for interpretability.

Random Forest: Performed well on non-linear patterns and provided high predictive stability.

Decision Tree: Used to understand feature splits and provide interpretability.

SVC: Tested with RBF kernel for capturing complex boundaries in the data.

GaussianNB: Provided a fast probabilistic benchmark model.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages