This repository contains a specialized Machine Learning pipeline designed to predict clinical outcomes (Severity & MACE) for patients with Hypertrophic Cardiomyopathy (HCM).
⚠️ Important: This repository is an exploratory and research-oriented project developed within the CardI-HACK challenge framework. Despite implementing advanced models and feature engineering strategies, predictive performance remains modest (≈10–15% accuracy).This outcome reflects the intrinsic difficulty of predicting complex cardiovascular outcomes from baseline clinical and genetic data under strict feature constraints, rather than a production-ready clinical solution.
The primary goal of this project is to document the modeling process, highlight real-world limitations of medical ML, and analyze why naive or even advanced approaches may fail in such settings.
This project is not intended for clinical decision-making or medical use.
The solution utilizes XGBoost for binary classification (Severity) and LightGBM for multiclass risk prediction (MACE), enhanced by genetic feature engineering and post-processing optimization.
Challenge: CardI-HACK: Artificial Intelligence for Hypertrophic Cardiomyopathy Platform: Trustii.io Goal: To develop a multimodal AI solution capable of predicting disease severity and major adverse cardiac events (MACE) using clinical data and genetic variants (SNPs).
Key Constraints & Challenges:
- High Dimensionality: >100,000 Genetic Variants (SNPs) vs. small sample size.
- Feature Limit: Models are restricted to using a maximum of 100 SNPs.
- Class Imbalance: Significant imbalance in outcome classes (especially MACE Class 2).
- Metric: Evaluation is based strictly on the Quadratic Weighted Kappa (QWK) score.
-
Severity Pipeline (Binary):
- Model: XGBoost Classifier.
- Technique: Isotonic Calibration (
CalibratedClassifierCV) to ensure probability reliability. - Imbalance Handling: Custom sample weighting (Class 0 multiplier: 1.5).
- Validation: Stratified K-Fold (k=5).
-
MACE Pipeline (Multiclass):
- Model: LightGBM (Objective:
multiclass). - Feature Selection: Model-based SNP Selection (Top 100 SNPs).
- Optimization: Threshold Optimization algorithm to maximize Quadratic Weighted Kappa (QWK) score.
- Safety Constraint: Ensures priority SNPs (SNP1-75) are included.
- Model: LightGBM (Objective:
-
Advanced Preprocessing:
- MICE Imputation for handling missing clinical values.
- Genetic Engineering: Interaction terms (Pathogene × SNP) and Genetic Load calculation.
├── config/ # YAML configuration files for models
├── data/ # Data storage (Not included in repo)
│ ├── raw/ # Place cardihack_final_train.csv and cardihack_final_test.csv here
│ └── processed/ # Intermediate files
├── scripts/ # Executable scripts (setup, training)
├── src/
│ ├── features/ # SNP selection & Genetic engineering
│ ├── models/ # XGBoost & LightGBM wrapper classes
│ ├── pipelines/ # Full execution pipelines (Severity/MACE)
│ ├── preprocessing/ # Missing value handling (MICE)
│ └── utils/ # Metrics (QWK) and Config
├── setup_project.py # To setup project files
└── requirements.txt # Project dependenciesgit clone https://github.com/vuralogzhn/cardihack-project.git
cd cardihack-projectpip install -r requirements.txtSince Git does not track empty folders, run this script to create the necessary data and models directories:
python scripts/setup_project.pyThe dataset is private and not included in this repository.
- Download
cardihack_final_train.csvandcardihack_final_test.csvfrom the competition source: https://app.trustii.io/datasets/1548 - Place them into the
data/raw/folder created in the previous step.
Once the environment is set up and data is placed, run the main prediction script. This will:
- Train both Severity and MACE models.
- Perform threshold optimization.
- Generate the final submission CSV file.
python scripts/predict_submission.pyWe define Severity as a binary outcome. The pipeline uses XGBoost with a low learning rate (0.05) and moderate depth (4). To address class imbalance without over-fitting, we apply specific sample weights during training and calibrate the final probabilities using Isotonic Regression.
MACE (Major Adverse Cardiac Events) is a 3-class problem (0, 1, 2).
- SNP Selection: We filter thousands of genetic markers down to the top 100 most predictive SNPs using a model-based selector.
- Prediction: A LightGBM model predicts class probabilities.
- Optimization: Instead of standard
argmax, we optimize the decision thresholds (e.g., probability cutoffs) to directly maximize the competition metric (Cohen's Kappa).