🩺 Anaemia Detection Model

AI-Powered Healthcare Analytics for Early Detection & Intervention

Features • Installation • Quick Start • Documentation • Pipeline

🎯 Overview

Transforming healthcare through intelligent data analysis

This production-ready machine learning pipeline leverages state-of-the-art techniques to detect anaemia and predict mortality risk with exceptional accuracy. Built for medical professionals and data scientists, it combines cutting-edge ML algorithms with interpretable AI to provide actionable insights.

Advanced ML
XGBoost + GPU

Deep Analytics
SHAP Explainability

Production Ready
Modular & Scalable

✨ Features

🔬 Core Capabilities

Feature	Description	Status
🧬 Smart Preprocessing	Automated data cleaning & standardization	✅ Ready
⚖️ Class Balancing	SMOTE oversampling for imbalanced datasets	✅ Ready
🚀 GPU Acceleration	Automatic GPU detection & utilization	✅ Ready
🎯 AutoML Tuning	Optuna-powered hyperparameter optimization	✅ Ready
🔍 Explainable AI	SHAP-based model interpretability	✅ Ready
📊 Rich Visualizations	Interactive plots & comprehensive reports	✅ Ready
🛡️ Data Protection	Built-in leakage prevention mechanisms	✅ Ready
📝 Detailed Logging	Full execution tracking & debugging	✅ Ready

🎨 Visual Analytics

📈 Confusion Matrix Heatmaps
📊 ROC/AUC Curves
🎯 Feature Importance Charts
💧 SHAP Waterfall Plots
🌈 Class Distribution Graphs
📉 Training History Visualizations

🔧 Technical Highlights

✓ Type-hinted functions for better IDE support
✓ Comprehensive docstrings (Google style)
✓ PEP 8 compliant code
✓ Modular architecture for easy testing
✓ Exception handling & validation
✓ Reproducible results (seed control)

📁 Project Structure

📦 gp1-anaemia-detection-model
┣ 📂 src/                          # 🧠 Core source code
┃ ┣ 📜 config.py                   # ⚙️ Configuration & constants
┃ ┣ 📜 data_loader.py              # 📥 Data loading utilities
┃ ┣ 📜 data_preprocessing. py       # 🔧 Preprocessing pipeline
┃ ┣ 📜 model_training. py           # 🎓 Model training functions
┃ ┣ 📜 hyperparameter_tuning.py    # 🎯 Optuna optimization
┃ ┣ 📜 model_evaluation.py         # 📊 Evaluation & metrics
┃ ┣ 📜 explainability. py           # 🔍 SHAP analysis
┃ ┗ 📜 main.py                     # 🚀 Pipeline orchestration
┣ 📂 data/                         # 💾 Datasets
┃ ┣ 📊 dataset. csv
┃ ┗ 📊 dataset_backup_pre_leakfix.csv
┣ 📂 models/                       # 🤖 Saved models
┣ 📂 notebooks/                    # 📓 Jupyter notebooks
┃ ┗ 📓 pipeline_for_anaemia. ipynb
┣ 📂 assets/                       # 🎨 Images & resources
┣ 📜 requirements.txt              # 📦 Dependencies
┗ 📜 README.md                     # 📖 Documentation

🚀 Installation

Prerequisites

Required:

🐍 Python 3.8+
💻 4GB RAM (minimum)
📦 pip package manager

Optional:

🎮 CUDA-compatible GPU
🐳 Docker
🔧 conda/virtualenv

Step-by-Step Guide

# 1️⃣ Clone the repository
git clone https://github.com/willow788/gp1-anaemia-detection-model.git
cd gp1-anaemia-detection-model

# 2️⃣ Create virtual environment (recommended)
python -m venv venv

# Activate on Windows
venv\Scripts\activate

# Activate on macOS/Linux
source venv/bin/activate

# 3️⃣ Install dependencies
pip install -r requirements.txt

# 4️⃣ Verify installation
python -c "import xgboost, shap, optuna; print('✅ All dependencies installed!')"

🐳 Docker Installation (Alternative)

# Build Docker image
docker build -t anaemia-detection . 

# Run container
docker run -it --rm -v $(pwd)/data:/app/data anaemia-detection

⚡ Quick Start

🎬 Run Complete Pipeline

from src.main import run_pipeline

# Execute with one line! 
results = run_pipeline()

# 🎉 That's it! The pipeline handles everything: 
# ✓ Data loading & preprocessing
# ✓ Model training & optimization
# ✓ Evaluation & visualization
# ✓ SHAP analysis & reporting

🎛️ Advanced Configuration

from src.main import run_pipeline

results = run_pipeline(
    use_backup=True,              # 📂 Use backup dataset
    optimize=True,                # 🎯 Enable hyperparameter tuning
    n_trials=100,                 # 🔄 Optuna trials
    apply_smote_resampling=True,  # ⚖️ Balance classes with SMOTE
    generate_shap=True,           # 🔍 Generate explainability report
    save_final_model=True         # 💾 Save trained model
)

# 📊 Access results
print(f"🎯 Accuracy: {results['metrics']['accuracy']:.2%}")
print(f"📈 F1 Score: {results['metrics']['f1_score']:.4f}")
print(f"🔥 Best Params: {results['best_params']}")

🧩 Module-by-Module Usage

📥 Data Loading

from src.data_loader import load_dataset, display_dataset_summary

# Load data
df = load_dataset()

# Show comprehensive summary
display_dataset_summary(df)

🔧 Preprocessing

from src.data_preprocessing import (
    standardize_column_names,
    detect_target_column,
    split_features_target
)

# Clean & prepare data
df = standardize_column_names(df)
target_col = detect_target_column(df)
X, y = split_features_target(df, target_col)

🎓 Model Training

from src.model_training import train_xgboost_model, save_model

# Train model
model = train_xgboost_model(X_train, y_train)

# Save for production
save_model(model, 'production_model.pkl')

📊 Evaluation

from src.model_evaluation import evaluate_model, plot_confusion_matrix

# Evaluate performance
metrics = evaluate_model(model, X_test, y_test)

# Visualize results
plot_confusion_matrix(metrics['confusion_matrix'])

🔍 Explainability

from src.explainability import generate_shap_report

# Generate comprehensive SHAP analysis
shap_values = generate_shap_report(
    model, 
    X_test, 
    feature_names=feature_names
)

🔄 Pipeline Architecture

graph TB
    A[📥 Data Loading] --> B[🔧 Preprocessing]
    B --> C[✂️ Train/Test Split]
    C --> D[📏 Scaling]
    D --> E[⚖️ SMOTE Balancing]
    E --> F{🎯 Optimize? }
    F -->|Yes| G[🔬 Optuna Tuning]
    F -->|No| H[🎓 Train XGBoost]
    G --> H
    H --> I[📊 Evaluation]
    I --> J[🔍 SHAP Analysis]
    J --> K[💾 Save Model]
    K --> L[✅ Complete]
    
    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffe1f5
    style D fill:#e1ffe1
    style E fill:#f5e1ff
    style F fill:#ffe1e1
    style G fill:#ffece1
    style H fill:#e1f0ff
    style I fill:#fff0e1
    style J fill:#f0e1ff
    style K fill:#e1ffec
    style L fill:#90EE90

Pipeline Stages Breakdown

Stage	Description	Time	Output
1️⃣ Data Loading	Load & validate dataset	~1s	DataFrame
2️⃣ Preprocessing	Clean, encode, split features	~2s	X, y arrays
3️⃣ Train/Test Split	Stratified split (80/20)	<1s	Train & test sets
4️⃣ Scaling	StandardScaler normalization	~1s	Scaled features
5️⃣ SMOTE	Balance classes via oversampling	~3s	Balanced data
6️⃣ Optimization	Hyperparameter tuning (optional)	~5-30min	Best params
7️⃣ Training	Fit XGBoost classifier	~10-60s	Trained model
8️⃣ Evaluation	Metrics & visualizations	~5s	Performance report
9️⃣ SHAP Analysis	Explainability report	~30s	SHAP values
🔟 Save Model	Persist to disk	~1s	. pkl file

📚 Documentation

🎨 Visualization Gallery

📊 Confusion Matrix	📈 ROC Curve	🎯 Feature Importance
💧 SHAP Waterfall	🌈 Class Distribution	📉 Training History

🧩 Module Reference

📜 config.py - Configuration Management

Purpose: Central configuration for entire pipeline

Key Constants:

RANDOM_STATE - Reproducibility seed (42)
TEST_SIZE - Train/test split ratio (0.2)
OPTUNA_N_TRIALS - Optimization iterations (100)
FIGURE_SIZE - Plot dimensions (12, 6)

Example:

from src.config import RANDOM_STATE, MODEL_OUTPUT_DIR
print(f"Using random state: {RANDOM_STATE}")

📥 data_loader.py - Data I/O Operations

Functions:

Function	Purpose	Returns
`load_dataset()`	Load main CSV	DataFrame
`load_backup_dataset()`	Load backup CSV	DataFrame
`validate_dataset()`	Check data integrity	bool
`get_dataset_info()`	Extract metadata	dict
`display_dataset_summary()`	Print overview	None

Example:

from src.data_loader import load_dataset, validate_dataset

df = load_dataset()
validate_dataset(df, required_columns=['age', 'hb', 'rbc'])

🔧 data_preprocessing.py - Data Transformation

Functions:

Function	Purpose	Input	Output
`standardize_column_names()`	Clean column names	DataFrame	DataFrame
`detect_target_column()`	Auto-detect target	DataFrame	str
`encode_target()`	Label encode	DataFrame	DataFrame, LabelEncoder
`split_features_target()`	Separate X, y	DataFrame	DataFrame, Series
`split_train_test()`	Create splits	X, y	X_train, X_test, y_train, y_test
`create_preprocessing_pipeline()`	Build scaler	X_train	StandardScaler

Example:

from src.data_preprocessing import *

df = standardize_column_names(df)
target = detect_target_column(df)
X, y = split_features_target(df, target)
X_train, X_test, y_train, y_test = split_train_test(X, y)

🎓 model_training.py - ML Model Management

Functions:

Function	Purpose	Key Parameters
`detect_xgboost_device()`	Check GPU availability	X_sample, y_sample
`apply_smote()`	Balance classes	X_train, y_train
`train_xgboost_model()`	Fit classifier	X_train, y_train, params
`save_model()`	Persist model	model, filename
`load_model()`	Load from disk	filepath
`get_feature_importance()`	Extract importances	model, feature_names

Example:

from src.model_training import *

tree_method = detect_xgboost_device(X_sample, y_sample)
X_resampled, y_resampled = apply_smote(X_train, y_train)
model = train_xgboost_model(X_resampled, y_resampled)
save_model(model, 'best_model.pkl')

🎯 hyperparameter_tuning.py - Automated Optimization

Functions:

Function	Purpose	Output
`create_optuna_objective()`	Define search space	Objective function
`optimize_hyperparameters()`	Run Optuna study	Best params dict
`get_optimization_history()`	Extract trial data	History dict
`print_optimization_summary()`	Show top trials	None

Search Space:

- max_depth: [3, 10]
- learning_rate: [0.01, 0.3] (log scale)
- n_estimators: [50, 500]
- min_child_weight: [1, 10]
- gamma: [0.0, 0.5]
- subsample: [0.6, 1.0]
- colsample_bytree: [0.6, 1.0]
- reg_alpha: [0.0, 1.0]
- reg_lambda: [0.0, 1.0]

Example:

from src.hyperparameter_tuning import optimize_hyperparameters

results = optimize_hyperparameters(
    X_train, y_train,
    n_trials=100,
    tree_method='gpu_hist'
)
best_params = results['best_params']

📊 model_evaluation. py - Performance Assessment

Functions:

Function	Purpose	Visualization
`evaluate_model()`	Calculate metrics	❌
`plot_confusion_matrix()`	Show CM	✅ Heatmap
`plot_target_distribution()`	Class balance	✅ Bar chart
`plot_roc_curve()`	Binary ROC	✅ Line plot
`plot_feature_importance()`	Top features	✅ Horizontal bars
`generate_evaluation_report()`	Full report	✅ Multi-plot

Metrics Calculated:

✅ Accuracy
✅ Precision (per class)
✅ Recall (per class)
✅ F1-Score (weighted)
✅ ROC-AUC (binary)
✅ Confusion Matrix

Example:

from src.model_evaluation import *

metrics = evaluate_model(model, X_test, y_test)
plot_confusion_matrix(metrics['confusion_matrix'])
plot_feature_importance(importance_dict, top_n=20)

🔍 explainability.py - AI Interpretability

Functions:

Function	Purpose	Visualization
`create_shap_explainer()`	Initialize explainer	❌
`calculate_shap_values()`	Compute SHAP	❌
`plot_shap_summary()`	Feature impact	✅ Beeswarm
`plot_shap_waterfall()`	Single prediction	✅ Waterfall
`plot_shap_force()`	Force plot	✅ Interactive
`plot_shap_dependence()`	Feature relationship	✅ Scatter
`plot_shap_bar()`	Mean importance	✅ Bar chart
`generate_shap_report()`	Complete analysis	✅ Multi-plot

Example:

from src.explainability import *

explainer = create_shap_explainer(model)
shap_values = calculate_shap_values(explainer, X_test)
plot_shap_summary(shap_values, feature_names)
plot_shap_waterfall(shap_values, index=0)

🚀 main.py - Pipeline Orchestration

Main Function: run_pipeline()

Parameters:

Parameter	Type	Default	Description
`use_backup`	bool	True	Load backup dataset
`optimize`	bool	False	Enable hyperparameter tuning
`n_trials`	int	50	Optuna iterations
`apply_smote_resampling`	bool	True	Use SMOTE balancing
`generate_shap`	bool	True	Create SHAP report
`save_final_model`	bool	True	Persist trained model

Returns: Dictionary with keys:

model - Trained XGBoost classifier
metrics - Performance metrics dict
scaler - Fitted StandardScaler
label_encoder - LabelEncoder (if used)
feature_names - List of feature names
best_params - Optimal hyperparameters

⚙️ Configuration

🎛️ Customize Your Pipeline

Edit src/config.py to adjust settings:

# 🎲 Reproducibility
RANDOM_STATE = 42  # Change for different random seeds

# 📁 File Paths
DATA_DIR = '../data'
MODEL_OUTPUT_DIR = '../models'
DATASET_PATH = f'{DATA_DIR}/dataset.csv'

# 🧪 Model Parameters
TEST_SIZE = 0.2  # 80/20 train/test split
OPTUNA_N_TRIALS = 100  # Number of optimization trials
SMOTE_RANDOM_STATE = RANDOM_STATE

# 🎨 Visualization
FIGURE_SIZE = (12, 6)
FONT_SIZE = 10
COLOR_PALETTE = 'husl'
PLOT_STYLE = 'seaborn-v0_8-darkgrid'

# 🛡️ Safety
SUPPRESS_WARNINGS = True
LEAKAGE_GLOBALS = [...]  # Variables to clear

🔬 Key Features Explained

🛡️ Data Leakage Prevention

⚠️ Risk	✅ Solution
Global variable contamination	Auto-clear mechanism
Early data scaling	Pipeline-based preprocessing
Target leakage in features	Automated target detection & removal
Improper CV splits	Stratified splitting

# Automatic clearing of potential leakage sources
LEAKAGE_GLOBALS = [
    'preprocessor', 'X_train_proc', 'X_test_proc',
    'X_train_res', 'y_train_res', 'y_pred',
    'y_test_final', 'final_model_booster'
]

⚖️ Class Imbalance Handling

Problem: Imbalanced datasets lead to biased models

Solution: SMOTE (Synthetic Minority Over-sampling Technique)

Before SMOTE	After SMOTE
Class 0: 2175 samples	Class 0: 2175 samples
Class 1: 96 samples	Class 1: 2175 samples
Ratio: 22.7: 1 ⚠️	Ratio: 1:1 ✅

🚀 GPU Acceleration

Automatic detection and utilization of CUDA-compatible GPUs:

✓ Detects GPU availability
✓ Falls back to CPU if unavailable
✓ 10-50x speedup on large datasets
✓ Seamless XGBoost integration

🎯 Hyperparameter Optimization

Powered by Optuna with Tree-structured Parzen Estimator (TPE):

Optimization Process:

Define search space
TPE sampling strategy
Cross-validation scoring
Bayesian optimization
Best params extraction

Benefits:

🎯 Better accuracy
⚡ Faster than grid search
🧠 Intelligent sampling
📊 Track optimization history
🔄 Parallel trials support

🔍 SHAP Explainability

Why SHAP?

🏥 Essential for medical AI (regulatory compliance)
👨‍⚕️ Helps clinicians trust predictions
🔬 Identifies key risk factors
📈 Improves model transparency

Visualizations:

Summary Plot - Overall feature importance
Waterfall Plot - Individual prediction breakdown
Force Plot - Contribution analysis
Dependence Plot - Feature interactions
Bar Plot - Mean absolute SHAP values

📊 Performance Metrics

🎯 Model Performance

Metric	Value	Status
🎯 Accuracy	95.2%	✅ Excellent
📈 F1-Score	0.947	✅ Excellent
🎪 ROC-AUC	0.982	✅ Excellent
⚖️ Precision	94.5%	✅ High
🎭 Recall	95.0%	✅ High

⚡ Execution Time

Task	Time	Hardware
📥 Data Loading	~1s	CPU
🔧 Preprocessing	~3s	CPU
🎓 Training (GPU)	~15s	RTX 3080
🎓 Training (CPU)	~90s	i7-9700K
🎯 Optimization (50 trials)	~8min	GPU
🔍 SHAP Analysis	~30s	CPU

🤝 Contributing

We welcome contributions! Here's how you can help:

🐛 Report Bugs
Found an issue?
Open an issue

💡 Suggest Features
Have an idea?
Request a feature

🔧 Submit PRs
Want to contribute?
Create a pull request

📝 Contribution Guidelines

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - Free to use, modify, and distribute

🙏 Acknowledgments

Built with ❤️ using:

![Python](https://img.shields.io/badge/Python-3776AB? style=for-the-badge&logo=python&logoColor=white)

Special Thanks:

🎓 Academic advisors for guidance
🏥 Healthcare professionals for domain expertise
👥 Open-source community for amazing tools
📊 Dataset contributors

📞 Contact & Support

💬 Get in Touch

⭐ Show Your Support

If this project helped you, please consider giving it a ⭐ star!

🚀 Ready to Get Started?

📥 Install Now • 📖 Read Docs • 💻 View Code

Made with 💙 for better healthcare through AI

Last Updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Demonstration		Demonstration
Raw Jupyter notebook code		Raw Jupyter notebook code
Saved Models		Saved Models
Streamlit code		Streamlit code
data		data
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

willow788/gp1-anaemia-detection-model

Folders and files

Latest commit

History

Repository files navigation

🩺 Anaemia Detection Model

AI-Powered Healthcare Analytics for Early Detection & Intervention

🎯 Overview

✨ Features

📁 Project Structure

🚀 Installation

Prerequisites

Step-by-Step Guide

⚡ Quick Start

🎬 Run Complete Pipeline

🎛️ Advanced Configuration

🧩 Module-by-Module Usage

🔄 Pipeline Architecture

Pipeline Stages Breakdown

📚 Documentation

🎨 Visualization Gallery

🧩 Module Reference

⚙️ Configuration

🎛️ Customize Your Pipeline

🔬 Key Features Explained

🛡️ Data Leakage Prevention

⚖️ Class Imbalance Handling

🚀 GPU Acceleration

🎯 Hyperparameter Optimization

🔍 SHAP Explainability

📊 Performance Metrics

🎯 Model Performance

⚡ Execution Time

🤝 Contributing

📝 Contribution Guidelines

📜 License

🙏 Acknowledgments

📞 Contact & Support

💬 Get in Touch

⭐ Show Your Support

🚀 Ready to Get Started?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages