Machine Learning pipeline for predicting heart disease risk using clinical features and statistical modeling techniques.
This project focuses on data distribution analysis, preprocessing, and classification performance evaluation, combining exploratory analysis with supervised learning.
The objective of this project is to analyze cardiovascular health indicators and build a predictive model capable of identifying heart disease presence.
The workflow includes:
βοΈ Numerical feature distribution analysis βοΈ Data preprocessing & feature preparation βοΈ Logistic Regression modeling βοΈ Performance evaluation using confusion matrix
heart_disease_classification/
β
βββ heart_disase_classification.ipynb
β
β
βββ README.md
Understanding the distribution of medical features is critical before training predictive models.
Observed Patterns:
- Age and Max Heart Rate follow near-normal distributions.
- Cholesterol shows wider variance and potential outliers.
- Oldpeak is heavily right-skewed, indicating potential scaling considerations.
The confusion matrix below shows the performance of the baseline classification model.
Interpretation:
- The model correctly identifies a strong portion of positive heart disease cases.
- Some false positives and false negatives remain, suggesting room for improvement with advanced models.
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- Jupyter Notebook
git clone https://github.com/your-username/heart_disease_classification.git
cd heart_disease_classificationInstall dependencies:
pip install pandas numpy matplotlib seaborn scikit-learnRun:
heart_disase_classification.ipynb
- Feature scaling experiments
- Hyperparameter tuning
- Tree-based models (Random Forest / XGBoost)
- ROC-AUC & Precision-Recall analysis
Arzu Selda AvcΔ± Computer Engineering β Final Year Data Science & AI Enthusiast