StrokeRisk AI System

Project Phase: 5 (Full Lifecycle: Development $\rightarrow$ MLOps $\rightarrow$ Deployment)
Course: AI Development, Web Development and Cloud Computing Lead Architect: Oluwafemi (Femi) James

Contributing Teams:

Phase 1 (Development): Group 4 (G4 Pulse) – Fuad, Preston, Marrium, Femi
Phase 2 (Maintenance & MLOps): Group 2 – Kevin, Shalin, Femi

Live App: StrokeRisk Tool
GitHub Repository: StrokeRisk_Tool

📌 Introduction

Stroke is one of the leading causes of death and long-term disability worldwide. Early detection of high-risk individuals can improve outcomes, reduce healthcare costs, and save lives.

The StrokeRisk Application is not just a predictive tool; it is a governed AI system. It uses machine learning to predict stroke risk from demographic and medical data, built on the CRISP-DM methodology. Uniquely, it integrates a "Governance-as-Code" layer using MLflow to ensure that every prediction is traceable, reproducible, and compliant with healthcare standards (FDA SaMD/PIPEDA).

Key Capabilities:

Accurate Predictions – Soft-Voting Ensemble of top-performing ML models.
Governance-as-Code – Immutable audit trails and reproducibility locks.
Interpretability – SHAP-based feature explanations for clinician trust.
Ethical Design – Fairness monitoring and Human-Centered UI.

1️⃣ Business Understanding

Problem Statement: Healthcare providers lack an efficient, proactive, and auditable way to identify individuals at high risk of stroke.

Value Proposition: Identify 80% of high-risk patients earlier than current methods, reducing stroke-related readmissions by 15%, while maintaining strict regulatory compliance through MLOps.

Stakeholders:

Stakeholder	Role	Interest
Healthcare Providers	Frontline users	Early detection, patient prioritization
Data Science Team	Model Governance	Accuracy, bias auditing, and version control
Patients	End beneficiaries	Preventative care access & data privacy
Auditors/Regulators	Compliance	Traceability of model decisions (PIPEDA/FDA)

2️⃣ Data Understanding & Preparation

Source: Public stroke dataset (stroke.csv) with 5,110 patient records.
Target Variable: stroke (Binary Classification).

Key Challenges & Solutions:

Class Imbalance: Only ~5% stroke cases. Solved using SMOTE (Synthetic Minority Over-sampling Technique) to achieve a 50:50 training split.
Missing Data: Median imputation grouped by Age and Gender for BMI.
Data Integrity: Implemented log_input(dataset) in MLflow to create a digest of the training data for every run.

3️⃣ Phase 1: Modeling & The Ensemble Innovation

To solve the "Accuracy Paradox" (where a model predicts "No Stroke" 95% of the time and claims high accuracy), we moved beyond single models.

Models Evaluated:

Random Forest (Baseline)
XGBoost (High Variance)
Extra Trees (Low Bias)

The Innovation: We developed a Soft-Voting Ensemble Model (v4.0) that aggregates the probability outputs of all three base models.

Result: Stabilized variance and maximized Recall (96.5%), ensuring the system minimizes false negatives (missed diagnoses).

4️⃣ Phase 2: MLOps & Governance-as-Code

This phase transformed the project from a "research notebook" into a "production system." We implemented an immutable audit trail using MLflow to satisfy PIPEDA & FDA SaMD reproducibility guidelines.

Core Governance Features:

Reproducibility: Enforced conda.yaml environment locking. This prevents "dependency drift," ensuring the model runs exactly the same way in Production as it did in Development.
Auditability: Every single training run logged:
- Git Commit Hash: Links the model binary to the exact code version.
- Dataset Digest: Proves exactly which patient data was used.
- Parameters: Hyperparameters for Random Forest/XGBoost.
Gated Promotion: Implemented a strict Staging $\rightarrow$ Production workflow. Models cannot be deployed without passing specific validation thresholds (Recall > 95%) and receiving manual approval in the Model Registry.

Component	Tool Used	Purpose
Tracking Server	MLflow	Centralized log of metrics and artifacts.
Model Registry	MLflow	Version control for AI models (v1.0 $\rightarrow$ v4.0).
Environment	Conda	Dependency isolation.

5️⃣ Phase 5: Front-End & Deployment

Architecture: The system is deployed on Streamlit Cloud, serving the MLflow-registered model via a backend API.

Human-Centered Design (B=MAT):

Motivation: We build trust by visualizing why the AI made a decision using SHAP plots (e.g., "High Glucose increased risk by 15%").
Ability: The "Patient Data Entry" form is optimized for clinical workflows (under 30 seconds to complete).

Main UI Pages:

Patient Data Entry: Guided input form.
Risk Assessment: Real-time probability scoring.
Practitioner Profile: Overview of patient population statistics.
System Settings: Sensitivity thresholds configuration.

6️⃣ Potential Harms & Mitigation

Harm	Mitigation Strategy
Discriminatory Predictions	Subgroup Audits: We monitor Recall rates across Gender and Age groups to detect bias.
Automation Bias	Human-in-the-Loop: The UI explicitly states this is a "Decision Support Tool," not a diagnosis.
Model Drift	Retraining Protocol: Weekly monitoring of data distribution (PSI/KS Test).

📂 Documentation

🔬 Part 1: AI Development (G4 Pulse)

Focus: Data Science, Modeling, and Clinical Validation

Phase 1: Business Understanding & Objectives
Phase 2: Data Preparation & EDA Report
Phase 3 & 4: Model Creation, Training & Evaluation

🛡️ Part 2: MLOps & Governance (Group 2)

Focus: System Design, Lifecycle Management, and Compliance

Phase 1: Maintenance System Design & Planning
Phase 2: MLOps Development & Implementation
Phase 3: System Reflection & Finalization
Full Suite: Complete Documentation Suite

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
pages		pages
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
stroke.csv		stroke.csv
stroke_data_smoted_scaled_for_pycaret.csv		stroke_data_smoted_scaled_for_pycaret.csv
stroke_predictor_pkl.py		stroke_predictor_pkl.py
strokerisk_model_et.pkl		strokerisk_model_et.pkl
strokerisk_model_rf.pkl		strokerisk_model_rf.pkl
strokerisk_model_xgboost.pkl		strokerisk_model_xgboost.pkl
strokerisk_tune_ensemble_model.pkl		strokerisk_tune_ensemble_model.pkl
validation_data.pkl		validation_data.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StrokeRisk AI System

📌 Introduction

Key Capabilities:

1️⃣ Business Understanding

2️⃣ Data Understanding & Preparation

3️⃣ Phase 1: Modeling & The Ensemble Innovation

4️⃣ Phase 2: MLOps & Governance-as-Code

5️⃣ Phase 5: Front-End & Deployment

6️⃣ Potential Harms & Mitigation

📂 Documentation

🔬 Part 1: AI Development (G4 Pulse)

🛡️ Part 2: MLOps & Governance (Group 2)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StrokeRisk AI System

📌 Introduction

Key Capabilities:

1️⃣ Business Understanding

2️⃣ Data Understanding & Preparation

3️⃣ Phase 1: Modeling & The Ensemble Innovation

4️⃣ Phase 2: MLOps & Governance-as-Code

5️⃣ Phase 5: Front-End & Deployment

6️⃣ Potential Harms & Mitigation

📂 Documentation

🔬 Part 1: AI Development (G4 Pulse)

🛡️ Part 2: MLOps & Governance (Group 2)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages