A comprehensive machine learning project for predicting maternal health risk levels during pregnancy using various health indicators. This solution focuses on developing an accurate and interpretable model for assessing pregnancy risk factors.
This project predicts maternal health risk levels (Low Risk, Mid Risk, High Risk) based on key health metrics during pregnancy. The analysis includes extensive data exploration, advanced feature engineering, and ensemble machine learning models designed to support healthcare decision-making.
- Training samples: 811 maternal health records
- Test samples: 203 records for validation
- Original features: 6 core health indicators
- Engineered features: 71 advanced medical features (including polynomials, interactions, and statistical transformations)
- Target distribution: Slightly imbalanced (40.1% Low Risk, 33.2% Mid Risk, 26.8% High Risk)
| Feature | Description | Unit | Clinical Significance |
|---|---|---|---|
| Age | Age of pregnant woman | years | Maternal age risk factors |
| SystolicBP | Upper blood pressure reading | mmHg | Hypertension indicator |
| DiastolicBP | Lower blood pressure reading | mmHg | Cardiovascular health |
| Blood glucose | Blood sugar concentration | mmol/L | Diabetes risk assessment |
| BodyTemp | Core body temperature | Β°C | Infection/fever detection |
| HeartRate | Cardiac rhythm | bpm | Cardiovascular status |
The project created sophisticated medical features:
- Pulse pressure (SystolicBP - DiastolicBP)
- Mean arterial pressure (DiastolicBP + PulsePressure / 3)
- Hypertension categories (Normal, Elevated, Stage 1, Stage 2)
- Hypertension indicators (Stage1, Stage2, Crisis)
- Age groups (Young <30, Adult 30-50, MiddleAge 50-65, Senior >=65)
- Age risk categories (Low <40, Medium 40-65, High >=65)
- Age interactions (Age * SystolicBP, Age * DiastolicBP, Age * Glucose, Age * HeartRate)
- Glucose categories (Normal <100, Prediabetes 100-126, Diabetes >=126)
- Diabetes indicators (Normal, Prediabetes, Diabetes, Severe >=200)
- Glucose interactions (SystolicBP * Glucose, DiastolicBP * Glucose)
- Heart rate categories (age-adjusted: Low, Normal, High)
- Abnormalities (Tachycardia >100, Bradycardia <60, Severe variants)
- Composite scores (Cardiovascular Risk, Metabolic Risk, Health Risk Score)
- Temperature categories (Normal 36-37.5, Fever >37.5, HighFever >38.5, Hypothermia <36)
- Deviation metrics (absolute deviation from 37Β°C, squared deviation)
- Z-scores and percentiles for vital signs
- Polynomial transformations (squared, cubed, sqrt, log) for key features (Age, SystolicBP, DiastolicBP, Blood Glucose)
- 0: Low Risk - Normal pregnancy parameters
- 1: Mid Risk - Moderate risk factors present
- 2: High Risk - Multiple risk factors requiring immediate attention
maternal-health/
βββ data/
β βββ raw/ # Original data
β β βββ train.csv
β β βββ test.csv
β β βββ metaData.csv
β βββ processed/ # Cleaned data
β β βββ train_processed.csv
β β βββ test_processed.csv
βββ results/
β βββ predictions/
β β βββ prediction.csv
β βββ visualizations/
β βββ feature_importance.png
βββ notebooks/
β βββ notebook.ipynb
β βββ exploratory_analysis.ipynb
βββ requirements.txt
βββ .gitignore
βββ README.md
- Statistical Analysis: Distribution analysis, outlier detection, correlation mapping
- Medical Domain Insights: Clinical threshold identification, risk pattern recognition
- Target Balance Assessment: 40.1% Low Risk, 33.2% Mid Risk, 26.8% High Risk
- Feature Interaction Discovery: Critical health indicator relationships
Implemented comprehensive feature engineering incorporating medical knowledge, resulting in 71 features from 6 originals.
- Ensemble Methods: Random Forest, Extra Trees, Gradient Boosting, AdaBoost, Bagging, Voting, Stacking
- Advanced Boosting: XGBoost, LightGBM, CatBoost
- Linear Models: Logistic Regression, Ridge Classifier
- Support Vector Machines: SVC (RBF)
- Instance-Based: K-Nearest Neighbors
- Probabilistic: Gaussian Naive Bayes
- Tree-Based: Decision Tree
- Oversampling: SMOTE, ADASYN, BorderlineSMOTE, SMOTEENN, SMOTETomek
- Custom Methods: MCT (Minority Cloning Technique), CBSO (Cluster-Based Synthetic Oversampling), Simplified MCT
- Optuna: Bayesian optimization with 50 trials
- Cross-Validation: Stratified 5-fold validation
- Best Model: CatBoost (accuracy: 0.8492) with parameters: iterations=883, learning_rate=0.243, depth=8, l2_leaf_reg=9.92, border_count=67
- Statistical Tests: ANOVA F-test, Mutual Information
- Model-Based: Recursive Feature Elimination (RFE), SelectKBest
- Dimensionality Reduction: PCA
- Medical Relevance: Domain-inspired feature validation
# Core Dependencies
pandas>=1.3.0 # Data manipulation and analysis
numpy>=1.21.0 # Numerical computing
matplotlib>=3.5.0 # Data visualization
seaborn>=0.11.0 # Statistical visualization
scikit-learn>=1.0.0 # Machine learning toolkit
# Advanced ML Libraries
xgboost>=1.5.0 # Gradient boosting framework
lightgbm>=3.3.0 # Light gradient boosting
catboost>=1.0.0 # CatBoost gradient boosting
optuna>=2.10.0 # Hyperparameter optimization
scipy>=1.7.0 # Scientific computing
imbalanced-learn>=0.8.0 # Sampling techniques# Install required packages
pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm catboost optuna scipy imbalanced-learn
# Clone/download project
git clone [repository-url]
cd maternal-health-prediction# Open the solution notebook
jupyter notebook notebook.ipynb- Data Loading: Load and preprocess data
- Feature Engineering: Generate advanced features
- Model Training: Train and optimize models
- Evaluation: Assess performance
- Prediction: Generate submissions
# Fixed random seeds
RANDOM_STATE = 42
np.random.seed(42)- Full Pipeline: ~18-20 minutes (on standard hardware)
- Feature Engineering: ~1-2 minutes
- Model Training & Optimization: ~15-18 minutes
- Final Predictions: ~1 minute
Model CV Accuracy Std Dev
XGBoost 0.8335 0.0302
LightGBM 0.8310 0.0332
Random Forest 0.8286 0.0270
CatBoost 0.8249 0.0350
Extra Trees 0.8236 0.0275
- Best: CatBoost (CV Accuracy: 0.8492)
- Ensemble: Voting (CV Accuracy: 0.8503 Β± 0.0127)
- Stacking: 0.8441 Β± 0.0191
Feature Importance
SystolicBP_Glucose 0.0622 # BP-Glucose interaction
DiastolicBP_Glucose 0.0607 # Diastolic-Glucose interaction
Blood glucose_cubed 0.0473 # Non-linear glucose
Blood glucose_zscore 0.0469 # Standardized glucose
Blood glucose_log 0.0444 # Log-transformed glucose
Blood glucose_sqrt 0.0441 # Square root glucose
Blood glucose 0.0397 # Raw glucose
Blood glucose_squared 0.0372 # Squared glucose
Metabolic_Risk 0.0335 # Composite metabolic score
Blood glucose_percentile 0.0333 # Glucose ranking
Age_Glucose 0.0320 # Age-glucose interaction
SystolicBP_cubed 0.0297 # Cubed systolic
SystolicBP_log 0.0293 # Log systolic
BP_HeartRate 0.0291 # BP-heart interaction
SystolicBP_squared 0.0279 # Squared systolic
- Glucose Dominance: Glucose-related features (raw, transformed, interactions) dominate top importance, highlighting diabetes risk in pregnancy
- BP Interactions: Strong predictive power from BP-glucose and BP-heart interactions
- Non-Linear Effects: Polynomial transformations (squared, cubed, log, sqrt) capture complex health relationships
- Composite Scores: Risk scores (Metabolic, Cardiovascular) provide holistic health assessment
- Early Detection: Identify at-risk pregnancies using routine measurements
- Resource Allocation: Prioritize high-risk cases in resource-limited settings
- Preventive Care: Guide interventions based on risk factors
- Clinical Support: Augment physician decision-making with ML insights
- Global Health: Applicable in developing regions with basic health monitoring
- Public Health: Enable population-level maternal risk screening
- Research: Foundation for advanced maternal health analytics
- Handling: Drop NaN/duplicates, numeric conversion
- Scaling: StandardScaler/RobustScaler where applicable
- Encoding: Label encoding for target
- Primary: Accuracy
- Cross-validation: Stratified K-Fold (n=5)
- Additional: Classification report, confusion matrix
- Uses Matplotlib and Seaborn for distributions, correlations, importance plots
- Predictions: submission_final_ensemble.csv (Id, RiskLevel)
- Models: Trained ensembles and base models
- Analysis: Feature importance, performance rankings
Contributions welcome in:
- Model enhancements
- Additional medical features
- Performance optimizations
- Clinical validations
@misc{maternal_health_prediction_2025,
title={Maternal Health Risk Prediction},
author={Obidur Rahman},
year={2025},
url={https://github.com/ashfinn/maternal-health}
}MIT License - Free for use, modification, and distribution with attribution.
Disclaimer: This is for educational/research purposes. Consult healthcare professionals for medical decisions.