Skip to content

Ashfinn/maternal-health

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Maternal Health Risk Prediction

A comprehensive machine learning project for predicting maternal health risk levels during pregnancy using various health indicators. This solution focuses on developing an accurate and interpretable model for assessing pregnancy risk factors.

🎯 Project Overview

This project predicts maternal health risk levels (Low Risk, Mid Risk, High Risk) based on key health metrics during pregnancy. The analysis includes extensive data exploration, advanced feature engineering, and ensemble machine learning models designed to support healthcare decision-making.

πŸ“Š Dataset Information

Dataset Overview

  • Training samples: 811 maternal health records
  • Test samples: 203 records for validation
  • Original features: 6 core health indicators
  • Engineered features: 71 advanced medical features (including polynomials, interactions, and statistical transformations)
  • Target distribution: Slightly imbalanced (40.1% Low Risk, 33.2% Mid Risk, 26.8% High Risk)

Core Health Features

Feature Description Unit Clinical Significance
Age Age of pregnant woman years Maternal age risk factors
SystolicBP Upper blood pressure reading mmHg Hypertension indicator
DiastolicBP Lower blood pressure reading mmHg Cardiovascular health
Blood glucose Blood sugar concentration mmol/L Diabetes risk assessment
BodyTemp Core body temperature Β°C Infection/fever detection
HeartRate Cardiac rhythm bpm Cardiovascular status

Advanced Feature Engineering (71 Features)

The project created sophisticated medical features:

🩺 Blood Pressure Analysis

  • Pulse pressure (SystolicBP - DiastolicBP)
  • Mean arterial pressure (DiastolicBP + PulsePressure / 3)
  • Hypertension categories (Normal, Elevated, Stage 1, Stage 2)
  • Hypertension indicators (Stage1, Stage2, Crisis)

🧬 Age-Based Risk Stratification

  • Age groups (Young <30, Adult 30-50, MiddleAge 50-65, Senior >=65)
  • Age risk categories (Low <40, Medium 40-65, High >=65)
  • Age interactions (Age * SystolicBP, Age * DiastolicBP, Age * Glucose, Age * HeartRate)

🍬 Glucose Metabolism Features

  • Glucose categories (Normal <100, Prediabetes 100-126, Diabetes >=126)
  • Diabetes indicators (Normal, Prediabetes, Diabetes, Severe >=200)
  • Glucose interactions (SystolicBP * Glucose, DiastolicBP * Glucose)

❀️ Cardiovascular Indicators

  • Heart rate categories (age-adjusted: Low, Normal, High)
  • Abnormalities (Tachycardia >100, Bradycardia <60, Severe variants)
  • Composite scores (Cardiovascular Risk, Metabolic Risk, Health Risk Score)

🌑️ Temperature Analysis

  • Temperature categories (Normal 36-37.5, Fever >37.5, HighFever >38.5, Hypothermia <36)
  • Deviation metrics (absolute deviation from 37Β°C, squared deviation)

πŸ“Š Statistical and Polynomial Features

  • Z-scores and percentiles for vital signs
  • Polynomial transformations (squared, cubed, sqrt, log) for key features (Age, SystolicBP, DiastolicBP, Blood Glucose)

Target Classification

  • 0: Low Risk - Normal pregnancy parameters
  • 1: Mid Risk - Moderate risk factors present
  • 2: High Risk - Multiple risk factors requiring immediate attention

πŸ“ Project Structure

maternal-health/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                 # Original data
β”‚   β”‚   β”œβ”€β”€ train.csv
β”‚   β”‚   β”œβ”€β”€ test.csv
β”‚   β”‚   └── metaData.csv
β”‚   β”œβ”€β”€ processed/           # Cleaned data
β”‚   β”‚   β”œβ”€β”€ train_processed.csv
β”‚   β”‚   └── test_processed.csv
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ predictions/
β”‚   β”‚   └── prediction.csv
β”‚   └── visualizations/
β”‚       └── feature_importance.png
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ notebook.ipynb
β”‚   └── exploratory_analysis.ipynb
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md

πŸ”¬ Methodology

1. πŸ” Comprehensive Data Exploration

  • Statistical Analysis: Distribution analysis, outlier detection, correlation mapping
  • Medical Domain Insights: Clinical threshold identification, risk pattern recognition
  • Target Balance Assessment: 40.1% Low Risk, 33.2% Mid Risk, 26.8% High Risk
  • Feature Interaction Discovery: Critical health indicator relationships

2. 🧬 Advanced Feature Engineering Pipeline

Implemented comprehensive feature engineering incorporating medical knowledge, resulting in 71 features from 6 originals.

3. πŸ€– Machine Learning Pipeline

Algorithm Selection

  • Ensemble Methods: Random Forest, Extra Trees, Gradient Boosting, AdaBoost, Bagging, Voting, Stacking
  • Advanced Boosting: XGBoost, LightGBM, CatBoost
  • Linear Models: Logistic Regression, Ridge Classifier
  • Support Vector Machines: SVC (RBF)
  • Instance-Based: K-Nearest Neighbors
  • Probabilistic: Gaussian Naive Bayes
  • Tree-Based: Decision Tree

Imbalanced Learning Techniques

  • Oversampling: SMOTE, ADASYN, BorderlineSMOTE, SMOTEENN, SMOTETomek
  • Custom Methods: MCT (Minority Cloning Technique), CBSO (Cluster-Based Synthetic Oversampling), Simplified MCT

4. 🎯 Model Optimization

Hyperparameter Tuning

  • Optuna: Bayesian optimization with 50 trials
  • Cross-Validation: Stratified 5-fold validation
  • Best Model: CatBoost (accuracy: 0.8492) with parameters: iterations=883, learning_rate=0.243, depth=8, l2_leaf_reg=9.92, border_count=67

Feature Selection Strategy

  • Statistical Tests: ANOVA F-test, Mutual Information
  • Model-Based: Recursive Feature Elimination (RFE), SelectKBest
  • Dimensionality Reduction: PCA
  • Medical Relevance: Domain-inspired feature validation

πŸš€ Getting Started

πŸ”§ System Requirements

# Core Dependencies
pandas>=1.3.0          # Data manipulation and analysis
numpy>=1.21.0           # Numerical computing
matplotlib>=3.5.0       # Data visualization
seaborn>=0.11.0         # Statistical visualization
scikit-learn>=1.0.0     # Machine learning toolkit

# Advanced ML Libraries
xgboost>=1.5.0          # Gradient boosting framework
lightgbm>=3.3.0         # Light gradient boosting
catboost>=1.0.0         # CatBoost gradient boosting
optuna>=2.10.0          # Hyperparameter optimization
scipy>=1.7.0            # Scientific computing
imbalanced-learn>=0.8.0 # Sampling techniques

πŸ““ Execution Guide

Step 1: Environment Setup

# Install required packages
pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm catboost optuna scipy imbalanced-learn

# Clone/download project
git clone [repository-url]
cd maternal-health-prediction

Step 2: Run the Pipeline

# Open the solution notebook
jupyter notebook notebook.ipynb

Step 3: Execute Analysis

  1. Data Loading: Load and preprocess data
  2. Feature Engineering: Generate advanced features
  3. Model Training: Train and optimize models
  4. Evaluation: Assess performance
  5. Prediction: Generate submissions

🎯 Implementation Details

Reproducibility Settings

# Fixed random seeds
RANDOM_STATE = 42
np.random.seed(42)

Expected Runtime

  • Full Pipeline: ~18-20 minutes (on standard hardware)
  • Feature Engineering: ~1-2 minutes
  • Model Training & Optimization: ~15-18 minutes
  • Final Predictions: ~1 minute

πŸ“ˆ Results & Performance

πŸ“Š Model Performance Metrics

Cross-Validation Results (Top Models)

Model                    CV Accuracy    Std Dev
XGBoost                  0.8335         0.0302
LightGBM                 0.8310         0.0332
Random Forest            0.8286         0.0270
CatBoost                 0.8249         0.0350
Extra Trees              0.8236         0.0275

Optimized Model

  • Best: CatBoost (CV Accuracy: 0.8492)
  • Ensemble: Voting (CV Accuracy: 0.8503 Β± 0.0127)
  • Stacking: 0.8441 Β± 0.0191

Feature Importance (Top 15 from Final Model)

Feature                      Importance
SystolicBP_Glucose           0.0622        # BP-Glucose interaction
DiastolicBP_Glucose          0.0607        # Diastolic-Glucose interaction
Blood glucose_cubed          0.0473        # Non-linear glucose
Blood glucose_zscore         0.0469        # Standardized glucose
Blood glucose_log            0.0444        # Log-transformed glucose
Blood glucose_sqrt           0.0441        # Square root glucose
Blood glucose                0.0397        # Raw glucose
Blood glucose_squared        0.0372        # Squared glucose
Metabolic_Risk               0.0335        # Composite metabolic score
Blood glucose_percentile     0.0333        # Glucose ranking
Age_Glucose                  0.0320        # Age-glucose interaction
SystolicBP_cubed             0.0297        # Cubed systolic
SystolicBP_log               0.0293        # Log systolic
BP_HeartRate                 0.0291        # BP-heart interaction
SystolicBP_squared           0.0279        # Squared systolic

🧠 Key Medical Insights

  • Glucose Dominance: Glucose-related features (raw, transformed, interactions) dominate top importance, highlighting diabetes risk in pregnancy
  • BP Interactions: Strong predictive power from BP-glucose and BP-heart interactions
  • Non-Linear Effects: Polynomial transformations (squared, cubed, log, sqrt) capture complex health relationships
  • Composite Scores: Risk scores (Metabolic, Cardiovascular) provide holistic health assessment

πŸ₯ Healthcare Applications

🩺 Real-World Benefits

  • Early Detection: Identify at-risk pregnancies using routine measurements
  • Resource Allocation: Prioritize high-risk cases in resource-limited settings
  • Preventive Care: Guide interventions based on risk factors
  • Clinical Support: Augment physician decision-making with ML insights

🌍 Potential Impact

  • Global Health: Applicable in developing regions with basic health monitoring
  • Public Health: Enable population-level maternal risk screening
  • Research: Foundation for advanced maternal health analytics

πŸ“‹ Technical Details

Data Processing

  • Handling: Drop NaN/duplicates, numeric conversion
  • Scaling: StandardScaler/RobustScaler where applicable
  • Encoding: Label encoding for target

Evaluation Metrics

  • Primary: Accuracy
  • Cross-validation: Stratified K-Fold (n=5)
  • Additional: Classification report, confusion matrix

Visualization

  • Uses Matplotlib and Seaborn for distributions, correlations, importance plots

πŸ“Š Outputs

  • Predictions: submission_final_ensemble.csv (Id, RiskLevel)
  • Models: Trained ensembles and base models
  • Analysis: Feature importance, performance rankings

🀝 Contributing

Contributions welcome in:

  • Model enhancements
  • Additional medical features
  • Performance optimizations
  • Clinical validations

πŸ“„ Citation

@misc{maternal_health_prediction_2025,
    title={Maternal Health Risk Prediction},
    author={Obidur Rahman},
    year={2025},
    url={https://github.com/ashfinn/maternal-health}
}

πŸ“œ License

MIT License - Free for use, modification, and distribution with attribution.

Disclaimer: This is for educational/research purposes. Consult healthcare professionals for medical decisions.

About

Predicting maternal health risk levels (Low Risk, Mid Risk, High Risk) based on key health metrics during pregnancy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors