Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Diabetes Prediction [END 2 END]/diabetes_pipeline/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__pycache__/
*.pyc
model/*.pkl
.venv/
152 changes: 152 additions & 0 deletions Diabetes Prediction [END 2 END]/diabetes_pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Diabetes Prediction – Machine Learning Pipeline

> ⚠️ This repository is a **forked project**.
> The work below represents my **independent contribution and extension** to the original codebase.

This project implements a complete **end-to-end machine learning pipeline** for predicting diabetes using the Pima Indians Diabetes dataset.
The pipeline covers **data preprocessing, model training, evaluation, experimentation, and inference via CLI**.

---

## πŸ“ Project Structure
diabetes_pipeline/
β”‚
β”œβ”€β”€ dataset/
β”‚ └── kaggle_diabetes.csv
β”‚
β”œβ”€β”€ model/
β”‚ β”œβ”€β”€ diabetes_model.pkl
β”‚ └── scaler.pkl
β”‚
β”œβ”€β”€ experiments/
β”‚ └── experiment_runner.py
β”‚
β”œβ”€β”€ data_preprocessing.py
β”œβ”€β”€ train.py
β”œβ”€β”€ predict.py
β”œβ”€β”€ evaluate.py
└── README.md

---

## πŸš€ My Contributions

I independently designed and implemented the following components:

### 1. Data Preprocessing Pipeline
- Handled missing values in medical features:
- `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`
- Replaced invalid zeros with `NaN`
- Applied **mean / median imputation**
- Standardized features using `StandardScaler`
- Ensured consistent feature names across training and inference

πŸ“„ `data_preprocessing.py`

---

### 2. Model Training
- Implemented a reproducible training pipeline
- Trained and persisted:
- Random Forest classifier
- Feature scaler
- Stored trained artifacts for reuse and deployment

πŸ“„ `train.py`

---

### 3. Model Evaluation
- Added evaluation logic with:
- Accuracy
- Precision, Recall, F1-score
- Verified generalization on the test set

πŸ“„ `evaluate.py`

---

### 4. Experimentation Framework
- Benchmarked multiple ML models:
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- Automatically reports accuracy and F1-score

πŸ“„ `experiments/experiment_runner.py`

#### Sample Results

| Model | Accuracy | F1 Score |
|----------------------|----------|----------|
| Logistic Regression | 0.7875 | 0.6320 |
| Decision Tree | 0.9875 | 0.9805 |
| Random Forest | 0.9950 | 0.9921 |
| SVM | 0.8450 | 0.7328 |

βœ”οΈ **Random Forest performs best on this dataset**

---

### 5. Command-Line Prediction Interface
- Built a CLI-based inference script
- Ensures:
- Correct feature order
- Feature-name alignment with trained scaler
- Predicts diabetes for a single patient input

πŸ“„ `predict.py`

Example:
```bash
python predict.py \
--pregnancies 2 \
--glucose 120 \
--bp 70 \
--skin 20 \
--insulin 80 \
--bmi 25 \
--dpf 0.5 \
--age 35



---

## πŸ› οΈ Tech Stack

- Python 3.10+
- pandas
- numpy
- scikit-learn
- joblib

---

## 🧩 Notes

- Project is modular and deployment-ready
- Structured to support FastAPI / Flask integration
- Generated files cleaned using `.gitignore`
- Suitable for internship-level ML engineering evaluation

---

## πŸ‘©β€πŸ’» Author Contribution

**Contributor:** Tandrita Mukherjee

**Contribution Scope:**
- ML pipeline design
- Data preprocessing
- Model training & evaluation
- Experimentation framework
- CLI-based inference system

---

## πŸ“Œ Disclaimer

This repository is a fork of an existing project.
All enhancements, restructuring, and ML pipeline components listed above were implemented independently as part of my learning and internship preparation.
7 changes: 7 additions & 0 deletions Diabetes Prediction [END 2 END]/diabetes_pipeline/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent

MODEL_DIR = BASE_DIR / "model"
MODEL_PATH = MODEL_DIR / "diabetes_model.pkl"
SCALER_PATH = MODEL_DIR / "scaler.pkl"
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# diabetes_pipeline/data_preprocessing.py

import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split

def load_and_preprocess(test_size=0.2, random_state=0):
BASE_DIR = Path(__file__).resolve().parent
csv_path = BASE_DIR / "dataset" / "kaggle_diabetes.csv"
df = pd.read_csv(csv_path)

df = df.rename(columns={'DiabetesPedigreeFunction': 'DPF'})

cols_with_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[cols_with_zero] = df[cols_with_zero].replace(0, np.nan)

df['Glucose'] = df['Glucose'].fillna(df['Glucose'].mean())
df['BloodPressure'] = df['BloodPressure'].fillna(df['BloodPressure'].mean())
df['SkinThickness'] = df['SkinThickness'].fillna(df['SkinThickness'].median())
df['Insulin'] = df['Insulin'].fillna(df['Insulin'].median())
df['BMI'] = df['BMI'].fillna(df['BMI'].median())

X = df.drop(columns='Outcome')
y = df['Outcome']

return train_test_split(
X, y, test_size=test_size, random_state=random_state
)
20 changes: 20 additions & 0 deletions Diabetes Prediction [END 2 END]/diabetes_pipeline/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import joblib
from sklearn.metrics import accuracy_score, classification_report
from data_preprocessing import load_and_preprocess
from config import MODEL_PATH

# Load data
X_train, X_test, y_train, y_test, _ = load_and_preprocess()

# Load trained model
model = joblib.load(MODEL_PATH)

# Predict
y_pred = model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", report)
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# diabetes_pipeline/experiments/experiment_runner.py

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score

from diabetes_pipeline.data_preprocessing import load_and_preprocess

X_train, X_test, y_train, y_test = load_and_preprocess()

models = {
"LogisticRegression": LogisticRegression(max_iter=1000),
"DecisionTree": DecisionTreeClassifier(random_state=0),
"RandomForest": RandomForestClassifier(n_estimators=50, random_state=0),
"SVM": SVC()
}

results = []

for name, model in models.items():
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", model)
])

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

results.append({
"Model": name,
"Accuracy": accuracy_score(y_test, preds),
"F1 Score": f1_score(y_test, preds)
})

df = pd.DataFrame(results)
print(df)

df.to_csv("diabetes_pipeline/experiments/results.csv", index=False)
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Model,Accuracy,F1 Score
LogisticRegression,0.7875,0.6320346320346321
DecisionTree,0.9875,0.980544747081712
RandomForest,0.995,0.9921259842519685
SVM,0.845,0.7327586206896551
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
2025-12-28 11:48:56,518 - INFO - Training started
2025-12-28 11:48:56,641 - INFO - Model and scaler saved successfully
2025-12-28 11:49:14,730 - INFO - Training started
2025-12-28 11:49:14,821 - INFO - Model and scaler saved successfully
45 changes: 45 additions & 0 deletions Diabetes Prediction [END 2 END]/diabetes_pipeline/predict.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import argparse
import joblib
import pandas as pd

MODEL_PATH = "model/diabetes_model.pkl"
SCALER_PATH = "model/scaler.pkl"

parser = argparse.ArgumentParser()
parser.add_argument("--pregnancies", type=int, required=True)
parser.add_argument("--glucose", type=float, required=True)
parser.add_argument("--bp", type=float, required=True)
parser.add_argument("--skin", type=float, required=True)
parser.add_argument("--insulin", type=float, required=True)
parser.add_argument("--bmi", type=float, required=True)
parser.add_argument("--dpf", type=float, required=True)
parser.add_argument("--age", type=int, required=True)

args = parser.parse_args()

# Load model & scaler
model = joblib.load(MODEL_PATH)
scaler = joblib.load(SCALER_PATH)

# IMPORTANT: feature names must match training
input_data = pd.DataFrame([{
"Pregnancies": args.pregnancies,
"Glucose": args.glucose,
"BloodPressure": args.bp,
"SkinThickness": args.skin,
"Insulin": args.insulin,
"BMI": args.bmi,
"DPF": args.dpf,
"Age": args.age
}])

# Scale & predict
input_scaled = scaler.transform(input_data)
prediction = model.predict(input_scaled)[0]

if prediction == 1:
print("⚠️ Diabetes detected")
else:
print("βœ… No diabetes detected")


29 changes: 29 additions & 0 deletions Diabetes Prediction [END 2 END]/diabetes_pipeline/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import logging
import joblib
from sklearn.ensemble import RandomForestClassifier
from data_preprocessing import load_and_preprocess
from config import MODEL_PATH, SCALER_PATH, MODEL_DIR

# Logging setup
logging.basicConfig(
filename="logs/training.log",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)

logging.info("Training started")

# Load data
X_train, X_test, y_train, y_test, scaler = load_and_preprocess()

# Train model
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train)

# Save artifacts
MODEL_DIR.mkdir(exist_ok=True)
joblib.dump(classifier, MODEL_PATH)
joblib.dump(scaler, SCALER_PATH)

logging.info("Model and scaler saved successfully")