🧬 Cardi-HACK: Hypertrophic Cardiomyopathy Outcome Prediction

This repository contains a specialized Machine Learning pipeline designed to predict clinical outcomes (Severity & MACE) for patients with Hypertrophic Cardiomyopathy (HCM).

⚠️ Important: This repository is an exploratory and research-oriented project developed within the CardI-HACK challenge framework. Despite implementing advanced models and feature engineering strategies, predictive performance remains modest (≈10–15% accuracy).

This outcome reflects the intrinsic difficulty of predicting complex cardiovascular outcomes from baseline clinical and genetic data under strict feature constraints, rather than a production-ready clinical solution.

The primary goal of this project is to document the modeling process, highlight real-world limitations of medical ML, and analyze why naive or even advanced approaches may fail in such settings.

This project is not intended for clinical decision-making or medical use.

The solution utilizes XGBoost for binary classification (Severity) and LightGBM for multiclass risk prediction (MACE), enhanced by genetic feature engineering and post-processing optimization.

🏆 Competition Context

Challenge: CardI-HACK: Artificial Intelligence for Hypertrophic Cardiomyopathy Platform: Trustii.io Goal: To develop a multimodal AI solution capable of predicting disease severity and major adverse cardiac events (MACE) using clinical data and genetic variants (SNPs).

Key Constraints & Challenges:

High Dimensionality: >100,000 Genetic Variants (SNPs) vs. small sample size.
Feature Limit: Models are restricted to using a maximum of 100 SNPs.
Class Imbalance: Significant imbalance in outcome classes (especially MACE Class 2).
Metric: Evaluation is based strictly on the Quadratic Weighted Kappa (QWK) score.

🚀 Key Features

Severity Pipeline (Binary):
- Model: XGBoost Classifier.
- Technique: Isotonic Calibration (CalibratedClassifierCV) to ensure probability reliability.
- Imbalance Handling: Custom sample weighting (Class 0 multiplier: 1.5).
- Validation: Stratified K-Fold (k=5).
MACE Pipeline (Multiclass):
- Model: LightGBM (Objective: multiclass).
- Feature Selection: Model-based SNP Selection (Top 100 SNPs).
- Optimization: Threshold Optimization algorithm to maximize Quadratic Weighted Kappa (QWK) score.
- Safety Constraint: Ensures priority SNPs (SNP1-75) are included.
Advanced Preprocessing:
- MICE Imputation for handling missing clinical values.
- Genetic Engineering: Interaction terms (Pathogene × SNP) and Genetic Load calculation.

📂 Project Structure

├── config/             # YAML configuration files for models
├── data/               # Data storage (Not included in repo)
│   ├── raw/            # Place cardihack_final_train.csv and cardihack_final_test.csv here
│   └── processed/      # Intermediate files
├── scripts/            # Executable scripts (setup, training)
├── src/
│   ├── features/       # SNP selection & Genetic engineering
│   ├── models/         # XGBoost & LightGBM wrapper classes
│   ├── pipelines/      # Full execution pipelines (Severity/MACE)
│   ├── preprocessing/  # Missing value handling (MICE)
│   └── utils/          # Metrics (QWK) and Config
├── setup_project.py    # To setup project files
└── requirements.txt    # Project dependencies

🛠️ Installation & Setup

1. Clone the Repository

git clone https://github.com/vuralogzhn/cardihack-project.git
cd cardihack-project

2. Install Dependencies

pip install -r requirements.txt

3. Initialize Project Structure

Since Git does not track empty folders, run this script to create the necessary data and models directories:

python scripts/setup_project.py

4. ⚠️ Download Data (Important)

The dataset is private and not included in this repository.

Download cardihack_final_train.csv and cardihack_final_test.csv from the competition source: https://app.trustii.io/datasets/1548
Place them into the data/raw/ folder created in the previous step.

5. 🏃‍♂️ Run Pipeline & Generate Submission

Once the environment is set up and data is placed, run the main prediction script. This will:

Train both Severity and MACE models.
Perform threshold optimization.
Generate the final submission CSV file.

python scripts/predict_submission.py

📊 Methodology

Severity Prediction

We define Severity as a binary outcome. The pipeline uses XGBoost with a low learning rate (0.05) and moderate depth (4). To address class imbalance without over-fitting, we apply specific sample weights during training and calibrate the final probabilities using Isotonic Regression.

MACE Prediction

MACE (Major Adverse Cardiac Events) is a 3-class problem (0, 1, 2).

SNP Selection: We filter thousands of genetic markers down to the top 100 most predictive SNPs using a model-based selector.
Prediction: A LightGBM model predicts class probabilities.
Optimization: Instead of standard argmax, we optimize the decision thresholds (e.g., probability cutoffs) to directly maximize the competition metric (Cohen's Kappa).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup_project.py		setup_project.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Cardi-HACK: Hypertrophic Cardiomyopathy Outcome Prediction

🏆 Competition Context

🚀 Key Features

📂 Project Structure

🛠️ Installation & Setup

1. Clone the Repository

2. Install Dependencies

3. Initialize Project Structure

4. ⚠️ Download Data (Important)

5. 🏃‍♂️ Run Pipeline & Generate Submission

📊 Methodology

Severity Prediction

MACE Prediction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 Cardi-HACK: Hypertrophic Cardiomyopathy Outcome Prediction

🏆 Competition Context

🚀 Key Features

📂 Project Structure

🛠️ Installation & Setup

1. Clone the Repository

2. Install Dependencies

3. Initialize Project Structure

4. ⚠️ Download Data (Important)

5. 🏃‍♂️ Run Pipeline & Generate Submission

📊 Methodology

Severity Prediction

MACE Prediction

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages