Stroke Prediction

Description

Data science and health is a field that increasingly merges as time goes on. This project is a glimpse at the capability machine learning models have in predicting stroke risk. The files in this repo contain the work, modules, and report that walks through the data science pipeline, resulting in a classification model used to predist stroke risk with a 74% recall.

Goals

The project aims to create a model that identifies individuals with a high risk of stroke based on stroke data.

Initial Questions

What does stroke look like in the dataset?
Is there a relationship between stroke and age?
Is there a relationship between stroke and gender?
Is there a relatio nship between blood sugar level and stroke?

Plan

Acquire data
Prepare, clean, & split data
Explore the data to find drivers and answer intital questions
Create a model
Evaluate
Conclude with recommendations and next steps

Data Dictionary

Feature	Definition
id	unique identifier
gender	"Male", "Female" or "Other"
age	age of the patient
hypertension	0 if the patient doesn't have hypertension, 1 if the patient has hypertension
heart_disease	0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
ever_married	"No" or "Yes"
work_type	"children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
Residence_type	"Rural" or "Urban"
avg_glucose_level	average glucose level in blood
bmi	body mass index
smoking_status	"formerly smoked", "never smoked", "smokes" or "Unknown"
stroke	1 if the patient had a stroke or 0 if not

Steps to Reproduce

Clone this repo
Go to https://kaggle.com/me/account (sign in if required).
Scroll down to the "API" section and click "Create New API Token". This will download a file kaggle.json
Move kaggle.json file to the same directory/folder as the 'final_report.ipynb' notebook
Run the 'final_report.ipynb' notebook

Takeaways

Stroke represented roughly 5% of the data which influenced the decision to oversample to accomodate an imbalanced dataset
Demographically, only age had a significant relationship to stroke, while gender's independence could not be rejected
Average blood sugar level was found to have a statistically significant relationship to stroke
On test, the VotingClassifier model performed with a 74% recall and a 66% accuracy.

Recommendations

Acquire more health and demographic data
Increase the robustness of the smoking_status data
Use this model as a preliminary screening tool to asses stroke risk

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
explore.py		explore.py
explore_module.py		explore_module.py
final_report.ipynb		final_report.ipynb
model.py		model.py
prepare_module.py		prepare_module.py
stroke_age.jpg		stroke_age.jpg
stroke_age.png		stroke_age.png
working_explore.ipynb		working_explore.ipynb
working_model.ipynb		working_model.ipynb
working_wrangle.ipynb		working_wrangle.ipynb
wrangle.py		wrangle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Prediction

Description

Goals

Initial Questions

Plan

Data Dictionary

Steps to Reproduce

Takeaways

Recommendations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stroke Prediction

Description

Goals

Initial Questions

Plan

Data Dictionary

Steps to Reproduce

Takeaways

Recommendations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages