Predict patient survival (mortality) using Electronic Health Records (EHR) with machine learning.
This project uses multiple ML models, handles class imbalance, and interprets predictions with SHAP.
Dataset/: Original dataset(https://www.kaggle.com/competitions/patient-survival-prediction/data)Notebooks/: EDA, preprocessing, feature engineering, and model building in Colab
- Source: [Patient Survival Prediction]
- Rows: 91,713
- Columns: 85
- Features: Demographics, vitals, lab results, ICU scores (APACHE)
- Target:
survived(0 = Died, 1 = Survived) - Class distribution: Imbalanced β handled using SMOTE and class-weighted models
-
Data Cleaning & Preprocessing
- Handle missing values using mean/median/mode
- Remove duplicates
- Encode categorical variables
- Scale numeric features
-
Feature Selection
- ANOVA F-test, Mutual Information, Lasso
-
Modeling
- Logistic Regression (with Polynomial Features)
- Random Forest (class_weight='balanced')
- XGBoost
- SVM, MLP Classifier
- Train/Test split with stratification
-
Class Imbalance Handling
- SMOTE oversampling on training data
- Class-weight balancing in models
-
Model Evaluation
- Accuracy, F1-score, Recall, Precision
- ROC-AUC and confusion matrix
-
Interpretability
- SHAP TreeExplainer on XGBoost for feature importance and individual predictions
- Top predictive features: ICU death probability, SpOβ min, temperature min, Glasgow Coma Scale, ventilated status
- Best performing model: XGBoost (Accuracy: ~0.88, F1-score: ~0.93, ROC-AUC: ~0.88)
- SHAP insights:
- High ICU mortality probability β increases predicted death
- Low oxygen saturation β higher mortality risk
- Ventilated patients β higher predicted death risk
- Clone the repo:
git clone https://github.com/Srikeerthiraja/patient_survival_ml.git
cd patient_survival_ml

