|
1 | | -# UNDER CONSTRUCTION |
| 1 | +# Diabetes Risk Prediction Project |
2 | 2 |
|
3 | | ------ |
| 3 | +> **A full end-to-end machine learning and Flask web application that predicts diabetes risk and visualises explainability insights for individual or batch predictions.** |
| 4 | +> Built with **Python, scikit-learn, pandas, SHAP**, and **Flask**, this project demonstrates both **data science excellence** and **software engineering maturity** — from raw data ingestion to interactive model deployment. |
| 5 | +> |
| 6 | +> ⚙️ **Two integrated components:** |
| 7 | +> |
| 8 | +> 1. **End-to-End Machine Learning Pipeline** — data processing → training → evaluation → explainability. |
| 9 | +> 2. **Interactive Flask Dashboard** — real-time single & batch prediction app powered by the trained model. |
4 | 10 |
|
5 | | -# 🩺 Diabetes Risk Prediction Project: End-to-End Pipeline & Web Dashboard |
| 11 | +--- |
6 | 12 |
|
7 | | -This repository contains an end-to-end data science project focused on predicting the risk of diabetes from medical features. It includes the **full machine learning pipeline** (data loading, cleaning, training, evaluation, explainability) and a companion **Flask Web Dashboard** for live predictions. |
| 13 | +## Highlights |
8 | 14 |
|
9 | | -## 🎯 Objective |
| 15 | +* **Automated ML Pipeline:** Modular scripts for loading, preprocessing, EDA, training, evaluation, and model explainability. |
| 16 | +* **Interactive Web Dashboard:** Built with Flask + Bootstrap + Chart.js, for clinicians and analysts to interactively explore model predictions. |
| 17 | +* **Explainability:** Integrated SHAP/LIME interpretability tools, visualising local and global feature contributions. |
| 18 | +* **Production-ready structure:** Logging, testing, CI, and pre-commit hooks aligned with professional ML engineering standards. |
| 19 | +* **Collaborative foundation:** Code documented with Javadoc-style docstrings, pytest coverage, and a clean file architecture. |
10 | 20 |
|
11 | | -The primary goal is to **build and deploy a robust machine learning model** (Logistic Regression, Random Forest, etc.) that accurately predicts the onset of diabetes based on medical features (e.g., Age, Glucose, BMI). The secondary goal is to serve this model via a simple, functional web interface to assist healthcare professionals in identifying high-risk individuals for early intervention. |
| 21 | +--- |
12 | 22 |
|
13 | | -## 💻 Overview: Pipeline & Dashboard |
| 23 | +## Project Architecture |
14 | 24 |
|
15 | | -The project is split into two core components: |
| 25 | +``` |
| 26 | +Diabetes-Risk-Prediction-Project/ |
| 27 | +├── data/ # raw dataset (diabetes.csv) |
| 28 | +├── models/ # saved ML models (.joblib) |
| 29 | +├── reports/ # generated plots, explainability & metrics |
| 30 | +│ ├── explain/ # SHAP/LIME visual outputs |
| 31 | +│ ├── models/ # trained model artifacts |
| 32 | +│ └── figures/ # EDA & evaluation plots |
| 33 | +├── src/ |
| 34 | +│ ├── data_loading.py |
| 35 | +│ ├── data_processing.py |
| 36 | +│ ├── data_exploration.py |
| 37 | +│ ├── data_visualisation.py |
| 38 | +│ ├── statistical_analysis.py |
| 39 | +│ ├── model_training.py |
| 40 | +│ ├── model_evaluation.py |
| 41 | +│ └── dashboard/ # Flask dashboard app |
| 42 | +│ ├── app.py |
| 43 | +│ ├── predict.py |
| 44 | +│ ├── routes.py |
| 45 | +│ ├── templates/ |
| 46 | +│ │ └── index.html |
| 47 | +│ └── static/ |
| 48 | +├── tests/ # unit, integration, and dashboard tests |
| 49 | +├── main.py # unified pipeline runner |
| 50 | +├── requirements.txt |
| 51 | +├── pyproject.toml |
| 52 | +└── README.md |
| 53 | +``` |
16 | 54 |
|
17 | | -1. **Data Science Pipeline (`src/`):** A modular workflow for data preparation, model training, evaluation, and interpretability. The master script for this is `main.py`. |
18 | | -2. **Flask Dashboard (`src/dashboard/`):** A lightweight web application that loads a pre-trained model artifact and provides a user interface for real-time risk prediction. |
| 55 | +--- |
19 | 56 |
|
20 | | ------ |
| 57 | +## Part 1 — End-to-End Machine Learning Pipeline |
21 | 58 |
|
22 | | -## 📂 Repository Structure |
| 59 | +This component implements a **complete data science workflow** — from ingestion to explainability — using the Pima Indians Diabetes dataset. |
23 | 60 |
|
24 | | -``` |
25 | | -diabetes_risk_prediction_project/ |
26 | | -├── data/ |
27 | | -│ └── diabetes.csv |
28 | | -├── models/ |
29 | | -│ └── diabetes_prediction_model.joblib # Trained models saved here |
30 | | -├── reports/ |
31 | | -│ ├── bmi_distribution_by_outcome.png |
32 | | -│ ├── ... (logs, plots, evaluation results) |
33 | | -├── src/ |
34 | | -│ ├── data_loading.py |
35 | | -│ ├── data_processing.py |
36 | | -│ ├── data_exploration.py |
37 | | -│ ├── ... (other pipeline scripts) |
38 | | -│ └── dashboard/ |
39 | | -│ ├── app.py # Flask application entry |
40 | | -│ └── predict.py # Model prediction logic |
41 | | -├── .gitignore |
42 | | -├── LICENCE |
43 | | -├── **main.py** # Runs the end-to-end ML pipeline |
44 | | -└── requirements.txt |
45 | | -``` |
| 61 | +### ⚙️ Workflow Overview |
46 | 62 |
|
47 | | ------ |
| 63 | +1. **Data Loading:** |
| 64 | + Load `data/diabetes.csv` into pandas and validate structure. |
| 65 | + `python src/data_loading.py --data ./data/diabetes.csv` |
48 | 66 |
|
49 | | -## 🛠️ Getting Started |
50 | | - |
51 | | -### 1\. Initial Setup |
52 | | - |
53 | | -1. **Clone the repository:** |
54 | | - ```bash |
55 | | - git clone https://github.com/AAdewunmi/Diabetes-Risk-Prediction-Project.git |
56 | | - cd Diabetes-Risk-Prediction-Project |
57 | | - ``` |
58 | | -2. **Create and activate a virtual environment (macOS/Linux):** |
59 | | - ```bash |
60 | | - python3 -m venv venv |
61 | | - source venv/bin/activate # On Windows: venv\Scripts\activate |
62 | | - ``` |
63 | | -3. **Install dependencies:** |
64 | | - ```bash |
65 | | - pip install -r requirements.txt |
66 | | - # Include SHAP/LIME if you want model explainability support |
67 | | - pip install shap lime |
68 | | - ``` |
69 | | -4. **Place the Dataset:** |
70 | | - Download the `diabetes.csv` file (likely the Pima Indians Diabetes Dataset) and place it in the **`data/`** directory. |
71 | | - |
72 | | -### 2\. Running the Data Science Pipeline |
73 | | - |
74 | | -Execute the full pipeline to train a model and generate reports: |
| 67 | +2. **Data Preprocessing:** |
| 68 | + Handle missing values, normalize numerical features, and encode categorical variables. |
| 69 | + `python src/data_processing.py --data ./data/diabetes.csv --out reports` |
75 | 70 |
|
76 | | -```bash |
77 | | -python main.py |
78 | | -``` |
| 71 | +3. **Exploratory Data Analysis (EDA):** |
| 72 | + Generate descriptive statistics, correlations, and visualizations (BMI, glucose, etc.). |
| 73 | + `python src/data_exploration.py --data ./data/diabetes.csv --out reports` |
| 74 | + |
| 75 | +4. **Statistical Analysis:** |
| 76 | + Run hypothesis tests and feature significance analysis. |
| 77 | + `python src/statistical_analysis.py --data ./data/diabetes.csv --out reports` |
| 78 | + |
| 79 | +5. **Model Training:** |
| 80 | + Train Logistic Regression, Random Forest, Gradient Boosting, or XGBoost models. |
| 81 | + Save the best model to `reports/models/`. |
79 | 82 |
|
80 | | - * This script performs all steps: data loading, processing, EDA, training (default model is Logistic Regression), and evaluation. |
81 | | - * The trained model artifact (e.g., `logreg_best.joblib`) will be saved in the **`models/`** directory. |
82 | | - * Plots and log files will be saved in the **`reports/`** directory. |
| 83 | + ```bash |
| 84 | + python src/model_training.py --data ./data/diabetes.csv --model rf --out_dir reports |
| 85 | + ``` |
83 | 86 |
|
84 | | -#### Customizing Model Training |
| 87 | +6. **Model Evaluation:** |
| 88 | + Evaluate accuracy, ROC AUC, and confusion matrix; save plots. |
| 89 | + `python src/model_evaluation.py --model reports/models/rf_best.joblib --out reports` |
85 | 90 |
|
86 | | -You can specify a different model for training using the `--model` CLI flag in `src/model_training.py`. |
| 91 | +7. **Explainability & Feature Importance:** |
| 92 | + Generate SHAP plots and local explanations stored under `reports/explain/`. |
87 | 93 |
|
88 | | -| Option | Model | Artifact Saved As | |
89 | | -| :--- | :--- | :--- | |
90 | | -| `logreg` | Logistic Regression (Default baseline) | `logreg_best.joblib` | |
91 | | -| `rf` | Random Forest Classifier | `rf_best.joblib` | |
92 | | -| `gb` | Gradient Boosting | `gb_best.joblib` | |
93 | | -| `xgb` | XGBoost (requires `xgboost` package) | `xgb_best.joblib` | |
| 94 | +8. **Run Entire Pipeline Automatically:** |
94 | 95 |
|
95 | | -**Example (Training Gradient Boosting):** |
| 96 | + ```bash |
| 97 | + python main.py |
| 98 | + ``` |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +## Part 2 — Flask Web Application (Dashboard) |
| 103 | + |
| 104 | +An interactive dashboard that loads the trained model from `reports/models/` and enables both **single** and **batch** predictions. |
| 105 | + |
| 106 | +### Quickstart (Local Run) |
96 | 107 |
|
97 | 108 | ```bash |
98 | | -python src/model_training.py --data data/diabetes.csv --model gb --out_dir reports |
| 109 | +# 1. From the repo root |
| 110 | +python -m pip install -r requirements.txt |
| 111 | + |
| 112 | +# 2. Run the Flask app |
| 113 | +PYTHONPATH=src python src/dashboard/app.py |
| 114 | + |
| 115 | +# 3. Visit |
| 116 | +http://127.0.0.1:5000 |
99 | 117 | ``` |
100 | 118 |
|
101 | | -### 3\. Running the Flask Dashboard (Web App) |
| 119 | +or explicitly specify a model path: |
102 | 120 |
|
103 | | -Once a model has been trained and saved in `models/`, you can launch the prediction dashboard: |
| 121 | +```bash |
| 122 | +python src/dashboard/app.py --model reports/models/rf_best.joblib |
| 123 | +``` |
104 | 124 |
|
105 | | -1. **Run locally (development):** |
106 | | - ```bash |
107 | | - # Ensure you are in the project root |
108 | | - python src/dashboard/app.py --host 0.0.0.0 --port 5000 |
109 | | - ``` |
110 | | -2. **Access the Dashboard:** Open your browser to **[http://127.0.0.1:5000](http://127.0.0.1:5000)**. |
| 125 | +### Features |
111 | 126 |
|
112 | | -<!-- end list --> |
| 127 | +| Panel | Description | |
| 128 | +| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | |
| 129 | +| **Quick Single Prediction** | Enter medical features manually → get predicted probability and SHAP explanation. | |
| 130 | +| **Batch CSV Upload** | Upload a `.csv` file with multiple patients → get batch summary, visualized histogram, and downloadable explainability artifacts. | |
| 131 | +| **Notes & Guidance** | Practical interpretation guide for clinicians and data scientists. | |
113 | 132 |
|
114 | | - * By default, the dashboard tries to load the first available model artifact in the `models/` directory. |
115 | | - * You can explicitly specify a model path: |
116 | | - ```bash |
117 | | - python src/dashboard/app.py --model models/gb_best.joblib |
118 | | - ``` |
| 133 | +All predictions, explanations, and generated files are timestamped and stored in `reports/explain/`. |
119 | 134 |
|
120 | | ------ |
| 135 | +--- |
| 136 | + |
| 137 | +## Testing Strategy |
121 | 138 |
|
122 | | -## ⚙️ Data Science Workflow |
| 139 | +Run tests locally before pushing: |
123 | 140 |
|
124 | | -The project follows a structured and modular workflow for end-to-end machine learning pipeline development: |
| 141 | +```bash |
| 142 | +PYTHONPATH=src pytest -q |
| 143 | +``` |
125 | 144 |
|
126 | | -### 1\. **Data Loading** |
| 145 | +The repository includes: |
127 | 146 |
|
128 | | - * Source: [Kaggle – Diabetes Health Indicators Dataset (or similar)](https://www.kaggle.com/datasets/aaron7sun/diabetes-health-indicators-dataset) |
129 | | - * Loaded using `pandas` via `src/data_loading.py`. |
| 147 | +* **Unit tests:** For `ModelWrapper`, preprocessing, and data loaders. |
| 148 | +* **API tests:** Flask routes and endpoints (`/predict`, `/predict_batch`). |
| 149 | +* **Integration tests:** Pipeline execution to ensure end-to-end consistency. |
130 | 150 |
|
131 | | -### 2\. **Data Preprocessing** |
| 151 | +CI/CD integration (via GitHub Actions) ensures tests run automatically on every push. |
132 | 152 |
|
133 | | - * Missing values removed or imputed. |
134 | | - * Categorical variables encoded using one-hot encoding. |
135 | | - * Data normalization/standardization where appropriate. |
136 | | - * Implemented in `src/data_processing.py`. |
| 153 | +--- |
137 | 154 |
|
138 | | -### 3\. **Exploratory Data Analysis (EDA)** |
| 155 | +## Technologies Used |
139 | 156 |
|
140 | | - * Exploratory data analysis (EDA) using `pandas`, `matplotlib`, and `seaborn` in `src/data_exploration.py`. |
141 | | - * Extended statistical tests using `pandas`, `scipy`, and `numpy` in `src/statistical_analysis.py`. |
| 157 | +* **Language:** Python 3.11+ |
| 158 | +* **Core Libraries:** pandas, numpy, scikit-learn, joblib, shap, matplotlib, seaborn |
| 159 | +* **Web Framework:** Flask + Bootstrap + Chart.js |
| 160 | +* **Testing:** pytest, pre-commit, black, isort, ruff |
| 161 | +* **Tools:** VS Code, GitHub Actions CI, pre-commit hooks, reportlab for PDF export |
142 | 162 |
|
143 | | -### 4\. **Model Training** |
| 163 | +--- |
144 | 164 |
|
145 | | - * Data split into training and validation sets. |
146 | | - * Chosen classifier trained with cross-validation. |
147 | | - * Model saved using `joblib` in the `models/` directory. |
148 | | - * Handled in `src/model_training.py`. |
| 165 | +## Outputs |
149 | 166 |
|
150 | | -### 5\. **Model Evaluation** |
| 167 | +| Folder | Description | |
| 168 | +| ------------------ | --------------------------------------- | |
| 169 | +| `reports/models/` | Trained model artifacts (`.joblib`) | |
| 170 | +| `reports/explain/` | SHAP local & global explanations | |
| 171 | +| `reports/` | EDA visuals, evaluation plots, and logs | |
| 172 | +| `data/` | Input dataset (`diabetes.csv`) | |
| 173 | +| `tests/` | Pytest suite | |
151 | 174 |
|
152 | | - * Evaluated using classification report, ROC AUC score, and confusion matrix. |
153 | | - * Results printed and visualised in `src/model_evaluation.py`. |
| 175 | +--- |
154 | 176 |
|
155 | | -### 6\. **Model Explainability & Interpretability 💡** |
| 177 | +## Deployment Notes |
156 | 178 |
|
157 | | - * **Objective:** To understand **why** the model makes a particular prediction, crucial for healthcare applications. |
158 | | - * **Techniques:** **SHAP (SHapley Additive exPlanations)** and **LIME (Local Interpretable Model-agnostic Explanations)** are used to provide global feature importance and local, per-prediction reasoning. |
159 | | - * **Output:** Explanatory plots (force plots, summary plots, etc.) are generated and saved under `reports/`. |
| 179 | +For production: |
160 | 180 |
|
161 | | -### 7\. **Feature Importance Analysis** |
| 181 | +* Replace `app.secret_key` with an environment variable. |
| 182 | +* Serve via **Gunicorn** or **Waitress** instead of Flask’s dev server. |
| 183 | +* Mount static files via Nginx. |
| 184 | +* Optionally containerize using Docker with health checks. |
162 | 185 |
|
163 | | - * Extracted from the trained model (e.g., coefficients for LogReg, feature importances for RF/GB). |
164 | | - * Top features visualized using bar plots via `src/data_visualisation.py`. |
| 186 | +--- |
165 | 187 |
|
166 | | -### 8\. **Data Science (End-To-End) Pipeline 🔄** |
| 188 | +## Example Use Case |
167 | 189 |
|
168 | | - * **Master Script:** The entire sequence of steps is executed via the central script, **`main.py`**. |
169 | | - * **Automation:** Ensures **reproducibility** by running all components sequentially and generating logs. |
170 | | - * **Integration:** The final saved model artifact (`models/*.joblib`) serves as the input dependency for the Flask Dashboard. |
| 190 | +This dashboard enables **clinicians** or **data scientists** to: |
171 | 191 |
|
172 | | ------ |
| 192 | +* Instantly assess diabetes risk for new patients. |
| 193 | +* Interpret which medical features contribute most to the prediction. |
| 194 | +* Batch-evaluate risk profiles for large datasets. |
| 195 | +* Export explainability artifacts (HTML/PNG) for audit and reporting. |
173 | 196 |
|
174 | | -## 🧪 Testing Strategy |
| 197 | +--- |
175 | 198 |
|
176 | | -A combination of unit and API tests ensures code reliability: |
| 199 | +## Acknowledgements |
177 | 200 |
|
178 | | - * **Unit Tests:** For small, pure logic functions (e.g., data preprocessing helpers, model wrapper class in `src/dashboard/predict.py`). |
179 | | - * **API Tests:** Using the Flask `test_client` to verify that web application endpoints respond correctly and return valid predictions. |
180 | | - * **Running Tests:** |
181 | | - ```bash |
182 | | - pytest -q |
183 | | - ``` |
| 201 | +Special thanks to: |
184 | 202 |
|
185 | | -## 🔐 Best Practices & Security Notes (Dashboard) |
| 203 | +* **The National Institute of Diabetes and Digestive and Kidney Diseases** — for the original dataset. |
| 204 | +* **OpenAI’s ChatGPT (GPT-5)** — for advanced assistance in refactoring, debugging, and structuring production-ready code, documentation, and CI integration. |
| 205 | +* The open-source community for continuous innovation in Python, Flask, and ML tooling. |
186 | 206 |
|
187 | | - * **Secrets:** Replace the placeholder `app.secret_key` with a real secret key using an environment variable in production deployments. |
188 | | - * **Model Path Control:** Carefully validate the model file path to prevent arbitrary file system access. |
189 | | - * **Input Validation:** Implement robust validation (e.g., using `pydantic`) on all form inputs to ensure features fall within expected clinical and statistical ranges before passing them to the model. |
190 | | - * **Containerization:** A Dockerfile and healthcheck are recommended for production deployment environments (e.g., using Gunicorn/Waitress). |
| 207 | +--- |
191 | 208 |
|
192 | | -## 🤝 Collaboration and Contact |
| 209 | +## Author |
193 | 210 |
|
194 | | -The project is open to contributions. Feel free to **open an issue** or submit a **pull request** for bug fixes or new features. |
| 211 | +**Adrian Adewunmi** |
195 | 212 |
|
196 | | -| Role | Author | |
197 | | -| :--- | :--- | |
198 | | -| **Author** | AAdewunmi (via GitHub: `https://github.com/AAdewunmi`) | |
199 | | -| **Contact** | Open an issue on the GitHub repository for questions or suggestions. | |
| 213 | +[GitHub](https://github.com/AAdewunmi) |
200 | 214 |
|
201 | 215 | ----- |
202 | 216 |
|
|
0 commit comments