Skip to content

vuralogzhn/cardihack-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 Cardi-HACK: Hypertrophic Cardiomyopathy Outcome Prediction

This repository contains a specialized Machine Learning pipeline designed to predict clinical outcomes (Severity & MACE) for patients with Hypertrophic Cardiomyopathy (HCM).

⚠️ Important: This repository is an exploratory and research-oriented project developed within the CardI-HACK challenge framework. Despite implementing advanced models and feature engineering strategies, predictive performance remains modest (≈10–15% accuracy).

This outcome reflects the intrinsic difficulty of predicting complex cardiovascular outcomes from baseline clinical and genetic data under strict feature constraints, rather than a production-ready clinical solution.

The primary goal of this project is to document the modeling process, highlight real-world limitations of medical ML, and analyze why naive or even advanced approaches may fail in such settings.

This project is not intended for clinical decision-making or medical use.

The solution utilizes XGBoost for binary classification (Severity) and LightGBM for multiclass risk prediction (MACE), enhanced by genetic feature engineering and post-processing optimization.

🏆 Competition Context

Challenge: CardI-HACK: Artificial Intelligence for Hypertrophic Cardiomyopathy Platform: Trustii.io Goal: To develop a multimodal AI solution capable of predicting disease severity and major adverse cardiac events (MACE) using clinical data and genetic variants (SNPs).

Key Constraints & Challenges:

  • High Dimensionality: >100,000 Genetic Variants (SNPs) vs. small sample size.
  • Feature Limit: Models are restricted to using a maximum of 100 SNPs.
  • Class Imbalance: Significant imbalance in outcome classes (especially MACE Class 2).
  • Metric: Evaluation is based strictly on the Quadratic Weighted Kappa (QWK) score.

🚀 Key Features

  • Severity Pipeline (Binary):

    • Model: XGBoost Classifier.
    • Technique: Isotonic Calibration (CalibratedClassifierCV) to ensure probability reliability.
    • Imbalance Handling: Custom sample weighting (Class 0 multiplier: 1.5).
    • Validation: Stratified K-Fold (k=5).
  • MACE Pipeline (Multiclass):

    • Model: LightGBM (Objective: multiclass).
    • Feature Selection: Model-based SNP Selection (Top 100 SNPs).
    • Optimization: Threshold Optimization algorithm to maximize Quadratic Weighted Kappa (QWK) score.
    • Safety Constraint: Ensures priority SNPs (SNP1-75) are included.
  • Advanced Preprocessing:

    • MICE Imputation for handling missing clinical values.
    • Genetic Engineering: Interaction terms (Pathogene × SNP) and Genetic Load calculation.

📂 Project Structure

├── config/             # YAML configuration files for models
├── data/               # Data storage (Not included in repo)
│   ├── raw/            # Place cardihack_final_train.csv and cardihack_final_test.csv here
│   └── processed/      # Intermediate files
├── scripts/            # Executable scripts (setup, training)
├── src/
│   ├── features/       # SNP selection & Genetic engineering
│   ├── models/         # XGBoost & LightGBM wrapper classes
│   ├── pipelines/      # Full execution pipelines (Severity/MACE)
│   ├── preprocessing/  # Missing value handling (MICE)
│   └── utils/          # Metrics (QWK) and Config
├── setup_project.py    # To setup project files
└── requirements.txt    # Project dependencies

🛠️ Installation & Setup

1. Clone the Repository

git clone https://github.com/vuralogzhn/cardihack-project.git
cd cardihack-project

2. Install Dependencies

pip install -r requirements.txt

3. Initialize Project Structure

Since Git does not track empty folders, run this script to create the necessary data and models directories:

python scripts/setup_project.py

4. ⚠️ Download Data (Important)

The dataset is private and not included in this repository.

  1. Download cardihack_final_train.csv and cardihack_final_test.csv from the competition source: https://app.trustii.io/datasets/1548
  2. Place them into the data/raw/ folder created in the previous step.

5. 🏃‍♂️ Run Pipeline & Generate Submission

Once the environment is set up and data is placed, run the main prediction script. This will:

  1. Train both Severity and MACE models.
  2. Perform threshold optimization.
  3. Generate the final submission CSV file.
python scripts/predict_submission.py

📊 Methodology

Severity Prediction

We define Severity as a binary outcome. The pipeline uses XGBoost with a low learning rate (0.05) and moderate depth (4). To address class imbalance without over-fitting, we apply specific sample weights during training and calibrate the final probabilities using Isotonic Regression.

MACE Prediction

MACE (Major Adverse Cardiac Events) is a 3-class problem (0, 1, 2).

  1. SNP Selection: We filter thousands of genetic markers down to the top 100 most predictive SNPs using a model-based selector.
  2. Prediction: A LightGBM model predicts class probabilities.
  3. Optimization: Instead of standard argmax, we optimize the decision thresholds (e.g., probability cutoffs) to directly maximize the competition metric (Cohen's Kappa).

About

Research-oriented ML pipeline for Hypertrophic Cardiomyopathy (HCM) outcome prediction using Clinical data & Genetic Variants (SNPs). Features XGBoost/LightGBM models, Isotonic Calibration, and Threshold Optimization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors