In this analysis, we explored various classification models with the intent of predicting whether a patient is at risk of heart failure based on clinical data and lifestyle factors of individuals. After evaluating multiple models through cross-validation, we selected Logistic Regression as our final model due to its overall superior performance across classification metrics. The model demonstrated promising results on the unseen test set, with an accuracy of 86% and F1-scores of 0.88 for the positive class (at risk) and 0.84 for the negative class (not at risk). From the 276 observations in the test set, the model correctly identified 144 cases at risk and 97 not at risk, reporting 23 false positives and 12 false negatives (cases predicted as not at risk when there is risk). Although the scores are encouraging for a first iteration, there is room for improvement to optimize the hyperparameters and the model's threshold settings to minimize false negative cases, which are critical in medical applications. Overall, this model shows potential to support clinical professionals in the assessment of patients during screening.
The dataset used in this project is pulled from a repository of the University of Minho, Portugal. The dataset was created by Federico Soriano Palacios (2021), it integrates five different heart-related datasets combined over 11 common features that can be used to predict a possible heart disease. The five data sets are part of the “Heart Disease” dataset (Janosi et al., 1989) that can be found in the UCI Machine Learning Repository that is originally sourced from the Hungarian Institute of Cardiology, the University Hospital of Zurich, the University Hospital of Basel, the V.A. Medical Center of Long Beach and Cleveland Clinic Foundation. Each row of the dataset contains 11 attributes that describe the patient’s age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG result, maximum heart rate achieved, exercise induced angina, ST depression induced by exercise relative to rest, slope of the peak exercise ST segment, and the presence or absence of heart disease.
The final report can be found here report
All library dependencies are specified in environment.yml for conda or conda-lock.yml for reproducible builds.
git clone https://github.com/EricYangg/Heart-Failure-Classification.git
cd Heart-Failure-Classification
To reproduce the analysis you can use one of the three following options:
Option 1: Use conda-lock
- Install the conda lock from the root of this repository:
conda-lock install --name heart-failure-classification conda-lock.yml- Switch to the project's environment by running the following line from the terminal:
conda activate heart-failure-classificationOption 2: Use environment.yml
conda env create -f environment.yml
conda heart-failure-classificationOption 3: Use docker
- Install and launch Docker on your computer.
- Run the Docker container: Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
docker-compose up
- Access Jupyter Lab: In the terminal, look for a URL that starts with
http://127.0.0.1:8888/lab?token=(for an example, see the highlighted text in the terminal below). Copy and paste that URL into your browser to open Jupyter Notebooks.
Option 1: Use Makefile
- Clean the project: Open a terminal, navigate to the root of this project, and run:
make clean
This removes any previously generated files to start with a clean environment.
- Run the analysis:
make all
This builds the project and run the entire analysis workflow. The Makefile defines the complete analysis pipeline and execution order of all scripts. Users can review it to understand how each stage of the workflow (data processing, modeling, and reporting) fits together.
Option 2: Run scripts manually
- To run the analysis manually, open a terminal and run the following commands:
python scripts/01_download_data.py \
--url="https://epl.di.uminho.pt/~jcr/AULAS/ATP2021/datasets/heart.csv" \
--write_to=data/raw
python scripts/02_validate_n_split.py \
--logs-to=logs \
--raw-data=data/raw/heart.csv \
--data-to=data/validated \
--seed=123
python scripts/03_eda_validate.py \
--training-data=data/validated/heart_train.csv \
--test-data=data/validated/heart_test.csv \
--plot-to=results/figures \
--data-to=data/validated
python scripts/04_preprocessor.py \
--training-data=data/validated/heart_train.csv \
--preprocessor-to=results/models \
--seed=123
python scripts/05_fit_heart_disease_model.py \
--x-train-data=data/validated/X_train.csv \
--y-train-data=data/validated/y_train.csv \
--x-test-data=data/validated/X_test.csv \
--y-test-data=data/validated/y_test.csv \
--preprocessor=results/models/heart_preprocessor.pickle \
--pipeline-to=results/models \
--results-to=results/tables \
--figures-to=results/figures \
--seed=123 \
--cv-folds=5
quarto render reports/heart_disease_analysis.qmd --to html
quarto render reports/heart_disease_analysis.qmd --to pdf
- To verify that each of the functions work appropriately, function tests are written in python scripts. To run these tests go to the root project directory in the terminal and write the following command:
pytest tests/
- To shut down the container and clean up the resources, press
Ctrl+Cin the terminal where the container is running, and then rundocker compose rm
- Affiliation: University of British Columbia
- Email: omar.ramos19@gmail.com
- GitHub: @mayitoxix
- Affiliation: University of British Columbia
- Email: marasanchezrom@gmail.com
- GitHub: @mara-sanchez1
- Affiliation: University of British Columbia
- Email: eric99yang@gmail.com
- GitHub: @EricYangg
The reproducible data science workflow implemented in this project was greatly inspired by Dr. Tiffany Timbers's DSCI 522 course.
The Heart Failure Classification project are licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. Check out the license file for more information. If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. Check out the license file for more information.
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.
Savarese, G., Lund, L. H., & Becher, P. M. (2023). Global burden of heart failure: A comprehensive and updated review of epidemiology. Cardiovascular Research, 118(17), 3272–3287. https://doi.org/10.1093/cvr/cvac013https://pubmed.ncbi.nlm.nih.gov/35150240/
Barnett, M. P., Koppes, L. L. J., & … [et al.]. (2020). Cardiovascular risk factors: It’s time to focus on variability! Frontiers in Cardiovascular Medicine, 7, Article 80. <https://doi.org/10.3389/fcvm.2020.00080(PMC published version) https://pmc.ncbi.nlm.nih.gov/articles/PMC7379092/>