Skip to content

Commit 06fad74

Browse files
authored
Update README.md
1 parent fee2687 commit 06fad74

File tree

1 file changed

+158
-144
lines changed

1 file changed

+158
-144
lines changed

README.md

Lines changed: 158 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -1,202 +1,216 @@
1-
# UNDER CONSTRUCTION
1+
# Diabetes Risk Prediction Project
22

3-
-----
3+
> **A full end-to-end machine learning and Flask web application that predicts diabetes risk and visualises explainability insights for individual or batch predictions.**
4+
> Built with **Python, scikit-learn, pandas, SHAP**, and **Flask**, this project demonstrates both **data science excellence** and **software engineering maturity** — from raw data ingestion to interactive model deployment.
5+
>
6+
> ⚙️ **Two integrated components:**
7+
>
8+
> 1. **End-to-End Machine Learning Pipeline** — data processing → training → evaluation → explainability.
9+
> 2. **Interactive Flask Dashboard** — real-time single & batch prediction app powered by the trained model.
410
5-
# 🩺 Diabetes Risk Prediction Project: End-to-End Pipeline & Web Dashboard
11+
---
612

7-
This repository contains an end-to-end data science project focused on predicting the risk of diabetes from medical features. It includes the **full machine learning pipeline** (data loading, cleaning, training, evaluation, explainability) and a companion **Flask Web Dashboard** for live predictions.
13+
## Highlights
814

9-
## 🎯 Objective
15+
* **Automated ML Pipeline:** Modular scripts for loading, preprocessing, EDA, training, evaluation, and model explainability.
16+
* **Interactive Web Dashboard:** Built with Flask + Bootstrap + Chart.js, for clinicians and analysts to interactively explore model predictions.
17+
* **Explainability:** Integrated SHAP/LIME interpretability tools, visualising local and global feature contributions.
18+
* **Production-ready structure:** Logging, testing, CI, and pre-commit hooks aligned with professional ML engineering standards.
19+
* **Collaborative foundation:** Code documented with Javadoc-style docstrings, pytest coverage, and a clean file architecture.
1020

11-
The primary goal is to **build and deploy a robust machine learning model** (Logistic Regression, Random Forest, etc.) that accurately predicts the onset of diabetes based on medical features (e.g., Age, Glucose, BMI). The secondary goal is to serve this model via a simple, functional web interface to assist healthcare professionals in identifying high-risk individuals for early intervention.
21+
---
1222

13-
## 💻 Overview: Pipeline & Dashboard
23+
## Project Architecture
1424

15-
The project is split into two core components:
25+
```
26+
Diabetes-Risk-Prediction-Project/
27+
├── data/ # raw dataset (diabetes.csv)
28+
├── models/ # saved ML models (.joblib)
29+
├── reports/ # generated plots, explainability & metrics
30+
│ ├── explain/ # SHAP/LIME visual outputs
31+
│ ├── models/ # trained model artifacts
32+
│ └── figures/ # EDA & evaluation plots
33+
├── src/
34+
│ ├── data_loading.py
35+
│ ├── data_processing.py
36+
│ ├── data_exploration.py
37+
│ ├── data_visualisation.py
38+
│ ├── statistical_analysis.py
39+
│ ├── model_training.py
40+
│ ├── model_evaluation.py
41+
│ └── dashboard/ # Flask dashboard app
42+
│ ├── app.py
43+
│ ├── predict.py
44+
│ ├── routes.py
45+
│ ├── templates/
46+
│ │ └── index.html
47+
│ └── static/
48+
├── tests/ # unit, integration, and dashboard tests
49+
├── main.py # unified pipeline runner
50+
├── requirements.txt
51+
├── pyproject.toml
52+
└── README.md
53+
```
1654

17-
1. **Data Science Pipeline (`src/`):** A modular workflow for data preparation, model training, evaluation, and interpretability. The master script for this is `main.py`.
18-
2. **Flask Dashboard (`src/dashboard/`):** A lightweight web application that loads a pre-trained model artifact and provides a user interface for real-time risk prediction.
55+
---
1956

20-
-----
57+
## Part 1 — End-to-End Machine Learning Pipeline
2158

22-
## 📂 Repository Structure
59+
This component implements a **complete data science workflow** — from ingestion to explainability — using the Pima Indians Diabetes dataset.
2360

24-
```
25-
diabetes_risk_prediction_project/
26-
├── data/
27-
│   └── diabetes.csv
28-
├── models/
29-
│   └── diabetes_prediction_model.joblib # Trained models saved here
30-
├── reports/
31-
│   ├── bmi_distribution_by_outcome.png
32-
│   ├── ... (logs, plots, evaluation results)
33-
├── src/
34-
│   ├── data_loading.py
35-
│   ├── data_processing.py
36-
│   ├── data_exploration.py
37-
│   ├── ... (other pipeline scripts)
38-
│   └── dashboard/
39-
│       ├── app.py # Flask application entry
40-
│       └── predict.py # Model prediction logic
41-
├── .gitignore
42-
├── LICENCE
43-
├── **main.py** # Runs the end-to-end ML pipeline
44-
└── requirements.txt
45-
```
61+
### ⚙️ Workflow Overview
4662

47-
-----
63+
1. **Data Loading:**
64+
Load `data/diabetes.csv` into pandas and validate structure.
65+
`python src/data_loading.py --data ./data/diabetes.csv`
4866

49-
## 🛠️ Getting Started
50-
51-
### 1\. Initial Setup
52-
53-
1. **Clone the repository:**
54-
```bash
55-
git clone https://github.com/AAdewunmi/Diabetes-Risk-Prediction-Project.git
56-
cd Diabetes-Risk-Prediction-Project
57-
```
58-
2. **Create and activate a virtual environment (macOS/Linux):**
59-
```bash
60-
python3 -m venv venv
61-
source venv/bin/activate # On Windows: venv\Scripts\activate
62-
```
63-
3. **Install dependencies:**
64-
```bash
65-
pip install -r requirements.txt
66-
# Include SHAP/LIME if you want model explainability support
67-
pip install shap lime
68-
```
69-
4. **Place the Dataset:**
70-
Download the `diabetes.csv` file (likely the Pima Indians Diabetes Dataset) and place it in the **`data/`** directory.
71-
72-
### 2\. Running the Data Science Pipeline
73-
74-
Execute the full pipeline to train a model and generate reports:
67+
2. **Data Preprocessing:**
68+
Handle missing values, normalize numerical features, and encode categorical variables.
69+
`python src/data_processing.py --data ./data/diabetes.csv --out reports`
7570

76-
```bash
77-
python main.py
78-
```
71+
3. **Exploratory Data Analysis (EDA):**
72+
Generate descriptive statistics, correlations, and visualizations (BMI, glucose, etc.).
73+
`python src/data_exploration.py --data ./data/diabetes.csv --out reports`
74+
75+
4. **Statistical Analysis:**
76+
Run hypothesis tests and feature significance analysis.
77+
`python src/statistical_analysis.py --data ./data/diabetes.csv --out reports`
78+
79+
5. **Model Training:**
80+
Train Logistic Regression, Random Forest, Gradient Boosting, or XGBoost models.
81+
Save the best model to `reports/models/`.
7982

80-
* This script performs all steps: data loading, processing, EDA, training (default model is Logistic Regression), and evaluation.
81-
* The trained model artifact (e.g., `logreg_best.joblib`) will be saved in the **`models/`** directory.
82-
* Plots and log files will be saved in the **`reports/`** directory.
83+
```bash
84+
python src/model_training.py --data ./data/diabetes.csv --model rf --out_dir reports
85+
```
8386

84-
#### Customizing Model Training
87+
6. **Model Evaluation:**
88+
Evaluate accuracy, ROC AUC, and confusion matrix; save plots.
89+
`python src/model_evaluation.py --model reports/models/rf_best.joblib --out reports`
8590

86-
You can specify a different model for training using the `--model` CLI flag in `src/model_training.py`.
91+
7. **Explainability & Feature Importance:**
92+
Generate SHAP plots and local explanations stored under `reports/explain/`.
8793

88-
| Option | Model | Artifact Saved As |
89-
| :--- | :--- | :--- |
90-
| `logreg` | Logistic Regression (Default baseline) | `logreg_best.joblib` |
91-
| `rf` | Random Forest Classifier | `rf_best.joblib` |
92-
| `gb` | Gradient Boosting | `gb_best.joblib` |
93-
| `xgb` | XGBoost (requires `xgboost` package) | `xgb_best.joblib` |
94+
8. **Run Entire Pipeline Automatically:**
9495

95-
**Example (Training Gradient Boosting):**
96+
```bash
97+
python main.py
98+
```
99+
100+
---
101+
102+
## Part 2 — Flask Web Application (Dashboard)
103+
104+
An interactive dashboard that loads the trained model from `reports/models/` and enables both **single** and **batch** predictions.
105+
106+
### Quickstart (Local Run)
96107

97108
```bash
98-
python src/model_training.py --data data/diabetes.csv --model gb --out_dir reports
109+
# 1. From the repo root
110+
python -m pip install -r requirements.txt
111+
112+
# 2. Run the Flask app
113+
PYTHONPATH=src python src/dashboard/app.py
114+
115+
# 3. Visit
116+
http://127.0.0.1:5000
99117
```
100118

101-
### 3\. Running the Flask Dashboard (Web App)
119+
or explicitly specify a model path:
102120

103-
Once a model has been trained and saved in `models/`, you can launch the prediction dashboard:
121+
```bash
122+
python src/dashboard/app.py --model reports/models/rf_best.joblib
123+
```
104124

105-
1. **Run locally (development):**
106-
```bash
107-
# Ensure you are in the project root
108-
python src/dashboard/app.py --host 0.0.0.0 --port 5000
109-
```
110-
2. **Access the Dashboard:** Open your browser to **[http://127.0.0.1:5000](http://127.0.0.1:5000)**.
125+
### Features
111126

112-
<!-- end list -->
127+
| Panel | Description |
128+
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
129+
| **Quick Single Prediction** | Enter medical features manually → get predicted probability and SHAP explanation. |
130+
| **Batch CSV Upload** | Upload a `.csv` file with multiple patients → get batch summary, visualized histogram, and downloadable explainability artifacts. |
131+
| **Notes & Guidance** | Practical interpretation guide for clinicians and data scientists. |
113132

114-
* By default, the dashboard tries to load the first available model artifact in the `models/` directory.
115-
* You can explicitly specify a model path:
116-
```bash
117-
python src/dashboard/app.py --model models/gb_best.joblib
118-
```
133+
All predictions, explanations, and generated files are timestamped and stored in `reports/explain/`.
119134

120-
-----
135+
---
136+
137+
## Testing Strategy
121138

122-
## ⚙️ Data Science Workflow
139+
Run tests locally before pushing:
123140

124-
The project follows a structured and modular workflow for end-to-end machine learning pipeline development:
141+
```bash
142+
PYTHONPATH=src pytest -q
143+
```
125144

126-
### 1\. **Data Loading**
145+
The repository includes:
127146

128-
* Source: [Kaggle – Diabetes Health Indicators Dataset (or similar)](https://www.kaggle.com/datasets/aaron7sun/diabetes-health-indicators-dataset)
129-
* Loaded using `pandas` via `src/data_loading.py`.
147+
* **Unit tests:** For `ModelWrapper`, preprocessing, and data loaders.
148+
* **API tests:** Flask routes and endpoints (`/predict`, `/predict_batch`).
149+
* **Integration tests:** Pipeline execution to ensure end-to-end consistency.
130150

131-
### 2\. **Data Preprocessing**
151+
CI/CD integration (via GitHub Actions) ensures tests run automatically on every push.
132152

133-
* Missing values removed or imputed.
134-
* Categorical variables encoded using one-hot encoding.
135-
* Data normalization/standardization where appropriate.
136-
* Implemented in `src/data_processing.py`.
153+
---
137154

138-
### 3\. **Exploratory Data Analysis (EDA)**
155+
## Technologies Used
139156

140-
* Exploratory data analysis (EDA) using `pandas`, `matplotlib`, and `seaborn` in `src/data_exploration.py`.
141-
* Extended statistical tests using `pandas`, `scipy`, and `numpy` in `src/statistical_analysis.py`.
157+
* **Language:** Python 3.11+
158+
* **Core Libraries:** pandas, numpy, scikit-learn, joblib, shap, matplotlib, seaborn
159+
* **Web Framework:** Flask + Bootstrap + Chart.js
160+
* **Testing:** pytest, pre-commit, black, isort, ruff
161+
* **Tools:** VS Code, GitHub Actions CI, pre-commit hooks, reportlab for PDF export
142162

143-
### 4\. **Model Training**
163+
---
144164

145-
* Data split into training and validation sets.
146-
* Chosen classifier trained with cross-validation.
147-
* Model saved using `joblib` in the `models/` directory.
148-
* Handled in `src/model_training.py`.
165+
## Outputs
149166

150-
### 5\. **Model Evaluation**
167+
| Folder | Description |
168+
| ------------------ | --------------------------------------- |
169+
| `reports/models/` | Trained model artifacts (`.joblib`) |
170+
| `reports/explain/` | SHAP local & global explanations |
171+
| `reports/` | EDA visuals, evaluation plots, and logs |
172+
| `data/` | Input dataset (`diabetes.csv`) |
173+
| `tests/` | Pytest suite |
151174

152-
* Evaluated using classification report, ROC AUC score, and confusion matrix.
153-
* Results printed and visualised in `src/model_evaluation.py`.
175+
---
154176

155-
### 6\. **Model Explainability & Interpretability 💡**
177+
## Deployment Notes
156178

157-
* **Objective:** To understand **why** the model makes a particular prediction, crucial for healthcare applications.
158-
* **Techniques:** **SHAP (SHapley Additive exPlanations)** and **LIME (Local Interpretable Model-agnostic Explanations)** are used to provide global feature importance and local, per-prediction reasoning.
159-
* **Output:** Explanatory plots (force plots, summary plots, etc.) are generated and saved under `reports/`.
179+
For production:
160180

161-
### 7\. **Feature Importance Analysis**
181+
* Replace `app.secret_key` with an environment variable.
182+
* Serve via **Gunicorn** or **Waitress** instead of Flask’s dev server.
183+
* Mount static files via Nginx.
184+
* Optionally containerize using Docker with health checks.
162185

163-
* Extracted from the trained model (e.g., coefficients for LogReg, feature importances for RF/GB).
164-
* Top features visualized using bar plots via `src/data_visualisation.py`.
186+
---
165187

166-
### 8\. **Data Science (End-To-End) Pipeline 🔄**
188+
## Example Use Case
167189

168-
* **Master Script:** The entire sequence of steps is executed via the central script, **`main.py`**.
169-
* **Automation:** Ensures **reproducibility** by running all components sequentially and generating logs.
170-
* **Integration:** The final saved model artifact (`models/*.joblib`) serves as the input dependency for the Flask Dashboard.
190+
This dashboard enables **clinicians** or **data scientists** to:
171191

172-
-----
192+
* Instantly assess diabetes risk for new patients.
193+
* Interpret which medical features contribute most to the prediction.
194+
* Batch-evaluate risk profiles for large datasets.
195+
* Export explainability artifacts (HTML/PNG) for audit and reporting.
173196

174-
## 🧪 Testing Strategy
197+
---
175198

176-
A combination of unit and API tests ensures code reliability:
199+
## Acknowledgements
177200

178-
* **Unit Tests:** For small, pure logic functions (e.g., data preprocessing helpers, model wrapper class in `src/dashboard/predict.py`).
179-
* **API Tests:** Using the Flask `test_client` to verify that web application endpoints respond correctly and return valid predictions.
180-
* **Running Tests:**
181-
```bash
182-
pytest -q
183-
```
201+
Special thanks to:
184202

185-
## 🔐 Best Practices & Security Notes (Dashboard)
203+
* **The National Institute of Diabetes and Digestive and Kidney Diseases** — for the original dataset.
204+
* **OpenAI’s ChatGPT (GPT-5)** — for advanced assistance in refactoring, debugging, and structuring production-ready code, documentation, and CI integration.
205+
* The open-source community for continuous innovation in Python, Flask, and ML tooling.
186206

187-
* **Secrets:** Replace the placeholder `app.secret_key` with a real secret key using an environment variable in production deployments.
188-
* **Model Path Control:** Carefully validate the model file path to prevent arbitrary file system access.
189-
* **Input Validation:** Implement robust validation (e.g., using `pydantic`) on all form inputs to ensure features fall within expected clinical and statistical ranges before passing them to the model.
190-
* **Containerization:** A Dockerfile and healthcheck are recommended for production deployment environments (e.g., using Gunicorn/Waitress).
207+
---
191208

192-
## 🤝 Collaboration and Contact
209+
## Author
193210

194-
The project is open to contributions. Feel free to **open an issue** or submit a **pull request** for bug fixes or new features.
211+
**Adrian Adewunmi**
195212

196-
| Role | Author |
197-
| :--- | :--- |
198-
| **Author** | AAdewunmi (via GitHub: `https://github.com/AAdewunmi`) |
199-
| **Contact** | Open an issue on the GitHub repository for questions or suggestions. |
213+
[GitHub](https://github.com/AAdewunmi)
200214

201215
-----
202216

0 commit comments

Comments
 (0)