This repository serves a dual purpose, bridging the gap between academic research replication and practical software deployment:
- Research Phase (
.ipynb): A strict replication of a specific medical study on diabetes prediction using Random Forest, SMOTE, and Global Scaling. - Deployment Phase (
.py): A modular web application built with Streamlit that serves the trained model to end-users in a user-friendly interface.
β οΈ NOTE: This is the Academic Replication version which intentionally contains data leakage to reproduce a paper's results. For the industry-standard, robust version (Leakage-Free & Fine-Tuned), please visit: π Diabetes Prediction Fine-Tuned Project
Please Read Before Reviewing the Research Notebook:
To strictly adhere to the cited reference paper's methodology and reproduce their reported metrics, the Jupyter Notebook (ml-prediction-diabetic-code.ipynb) follows a specific preprocessing workflow:
- Global Scaling:
MinMaxScaleris applied to the entire dataset before splitting. - Global SMOTE: Oversampling is applied to the entire dataset before splitting.
Methodology Note: I am fully aware that applying these techniques before the Train-Test split introduces Data Leakage and is not standard industry practice. However, this was done intentionally to reproduce the exact results reported in the academic paper.
This project moves beyond a simple notebook by implementing a Modular Architecture for deployment. The logic is separated into distinct responsibilities:
βββ app/ # π» APPLICATION SOURCE CODE
β βββ app.py # Main Streamlit application
β βββ model.py # Backend logic & inference
β βββ preprocess.py # Utils for input formatting
βββ assets/ # πΌοΈ STATIC ASSETS
β βββ diabetes_app_ui.png # App Screenshot
βββ models/ # π¦ ARTIFACTS (Serialized Objects)
β βββ scaler.joblib # Saved MinMaxScaler
β βββ model_rf.joblib # Saved Random Forest Model
βββ notebooks/ # π¬ RESEARCH & EXPERIMENTATION
β βββ ml-prediction-diabetic-code.ipynb # Replicated Research Study
βββ requirements.txt # Dependency list
- Frontend Framework: Streamlit
- Machine Learning Core: Scikit-Learn (Random Forest Classifier)
- Data Handling: Pandas, NumPy, Joblib
- Imbalanced Data: Imbalanced-learn (SMOTE)
- Environment: Python 3.9+
git clone https://github.com/viochris/Diabetes-prediction-project.git
cd Diabetes-prediction-projectMake sure you have the required libraries installed:
pip install -r requirements.txtExecute the main application file from the root directory:
streamlit run app/app.pyOpen your browser and navigate to:
http://localhost:8501
Based on the replication study conducted in the notebook (ml-prediction-diabetic-code.ipynb) using Random Forest with Global SMOTE:
- Algorithm: Random Forest Classifier
- Accuracy: ~83.67%
- Precision: ~84%
- Recall: ~84%
(Metrics are inflated due to the intentional data leakage required for paper replication)
User-friendly interface built with Streamlit allowing for real-time patient data input and instant prediction:
Author: Silvio Christian, Joe "Bridging the gap between Academic Research and Practical Deployment."
