- INTRODUCTION
- DATA SCIENCE
- BENEFITS OF DATA SCIENCE
- THE BIG DATA
- CHALLENGES OF BIG DATA
- STROKE
- LITERATURE REVIEW
- STATISTICAL ANALYSIS OF STROKE
- RACE AND ETHNICITY STATISTICS
- AGE STATISTICS
- AIM AND OBJECTIVES
- STATEMENT OF PROBLEM
- METHODOLOGY
- EXPLORATORY DATA ANALYSIS
- S/NO: FEATURES: DESCRIPTION: TYPES OF VARIABLES
- DATA PRE-PROCESSING
- DATA VISUALIZATION
- APPLICATION OF MACHINE LEARNING ALGORITHM FOR PREDICTION
- MODEL SELECTION AND APPLICATION
- XGBOOST PERFORMANCE AND PROJECT CHALLENGES
- LIMITATIONS AND CHALLENGES
- RECOMMENDATIONS
- CONCLUSION
Data science can be defined as the process by which data informatics, statistics, algorithms, data analysis and other related methods are being unified in order to analyse, interpret and understand data (Hayashi, 1998). Data science encompasses of several fields like statistics, scientific methods, data analysis and artificial intelligence in extraction of quality values from data. Data science is a subset of artificial intelligence.
Data science has become the top trending skills in every sector, particularly the health sector and we all need its application for quality analysis and interpretations of data. It has become so important in the health sector and really growing fast. It’s being beneficial in the following ways:
- Data science application has made it easy in detecting the symptoms of disease at very early stage.
- Doctors can now monitor patient’s health remotely with the vast development of technologies and tools through data science.
- Data science has provided a deep understanding of genetic related disorder
- The pharmaceutical companies have been using the application of data science to manufacture drugs.
- Data science is being widely used for medical imaging, examples are MRI, X-Ray, Ultra-sound.
Big data according to McKinsey can be referred to as the datasets with sizes that are large and cannot be capture, store, maintain and analyse by typical database software tools (Manyika, 2011). The big data are also referred to as the 5V’s.
- Capturing of clean, complete, accurate and well formatted data has been a major challenge.
- Less storage for dataset.
- Provision of maximum security for the devices from being hacked or prevention from malware.
The bleeding or blood clot in the brain which can cause permanent damage with great effect on mobility, cognition, sight or communication is as a result of stroke. Stroke is an urgent medical state that leads to long term neurological destruction and can as well lead to death. They can be classified into 2.
- The Ischemic embolic
- Haemorrhagic
- The Ischemic embolic happens when a blood clot from the heart moves through the blood stream to the narrower brain arteries.
- The Haemorrhagic stroke happens as a result of ruptured arterial blood vessels or leakage of arterial blood vessel in the brain. Some symptoms occur before a stroke happen which could be sudden numbness on one of the sides or part of the body, sudden confusion and difficulty in speech (Centres for Disease Control and Prevention, 2020). Most times, other health-related issues surface before a person can have stroke but we don’t take cognisance of them. So therefore, it is essential to know and understand the risk factors behind stroke so that one can have appropriate and timely treatment to prevent it.
To gain deeper insights into the "Predicting Stroke Using Machine Learning" dataset, I reviewed previous studies that utilized similar stroke datasets. This investigation allowed me to evaluate the performance of various models, ultimately guiding me in selecting the one most likely to yield optimal predictive results based on established research.
Cardiovascular disease was caused by stroke in 2020 which led to 1 in 6 deaths. Someone dies of stroke every 3.5 minutes and in the United States, someone is prompt to having stroke every 40 seconds. In the United States, more than 795,000 people have a stroke every year, and ischemic strokes takes 87% of all strokes. Long term disability has been caused by stroke (Tsao, 2022. Pg 145), it reduces survival movements especially the older ones from 65 years and above.
There has been reduction in stroke death rate since 2013, and blacks are twice at risk of stroke compared to Caucasians according to disease control centre.
Stroke can happen any age, but risk increases with age. People less than 65 years Old at rate of 38% were hospitalized in 2014. (Jackson, 2019.). In 2017, Chantamit-o-pas, predicted stroke using deep learning, he predicted with the aid of neural network. The models used for comparison in prediction accuracy were NaĂŻve Bayes, SVM. NaĂŻve Bayes was used for discrete predictor, while support vector machine was used for linear performance. The result shows that deep learning was good in heart stroke detection. Another machine learning approach to stroke risk prediction project was carried out by Yu Sao et al. The prediction was carried out using machine learning models such as SVM, decision trees, nearest neighbours and multi-layer perception. SVM was highly recommended because of its evaluation metrics which was more accurate. It was stated that machine learning was much more perfect in prediction of stroke dataset with better accuracies.
The goal of this project is to forecast the likelihood of a stroke using key patient variables. To achieve this, we will:
- Leverage Data Mining: Apply data mining techniques to identify patients who are at risk of stroke.
- Analyze Key Variables: Examine various patient attributes to detect individuals who have a higher propensity for developing stroke.
- Develop a Predictive Model: Create and validate a machine learning model that accurately predicts stroke risk based on the identified factors.
Stroke remains one of the leading causes of disability and death worldwide, highlighting the need for effective early detection methods. Despite the availability of comprehensive stroke datasets, current approaches often fall short in accurately predicting stroke risk due to the complex interplay of various patient factors. This project addresses this gap by developing a machine learning model that utilizes critical features from a stroke dataset to predict the likelihood of stroke occurrence. The aim is to provide healthcare professionals with a reliable, data-driven tool for early identification of high-risk patients, ultimately improving preventative care and patient outcomes.
In this part, we will discuss the method used, including source of dataset, attributes, the data pre-processing algorithms and the metrics evaluation.
The dataset was extracted from https://www.kaggle.com. It contains 5110 observations with 12 columns. Our prediction is discrete output which shows that the project work is Classification machine learning because the output is either the individual might be at risk of having stroke or not. Details of our dataset is shown below. Details of dataset;5110 OBSERVATION, 12 COLUMNS
- GENDER: MALE=2115, FEMALE=2994, OTHER=1. TYPE: CATEGORICAL/INDEPENDENT.
- AGE: THE YEARS ATTAINED BY INDIVIDUAL. TYPE: INDEPENDENT/ORDINAL
- HYPERTENSION: IMPACT OF BLOOD PRESSURE ON STROKE. TYPE:INDEPENDENT/NOMINAL
- HEART DISEASE: IMPACT OF STATUS OF THE HEART CONDITION: 1=HAS HEART DISEASE, 0=NO HEART DISEASE. TYPE: INDEPENDENT/CATEGORICAL
- EVER MARRIED: EFFECT OF MARITAL STATUS ON STROKE PREDICTION: MARRIED= 1, NOT MARRIED=0 TYPE:INDEPENDENT/CATEGORICAL
- WORK TYPE: CHECKING IF THE NATURE OF JOB HAVE IMPACT ON TENDENCY OF HAVING STROKE: PRIVATE=0, SELF EMPLOYED=1, GOV JOB=2, CHILDREN=3, NEVER WORKED=4. TYPE: INDEPENDENT/CATEGORICAL
- RESIDENCE TYPE: EFFECT OF INDIVIDUAL RESIDENCE ON STROKE: RURAL=0, URBAN=1. RURAL NOTE: RESIDENCE ARE BASICALLY PEOPLE LIVING IN VILLAGES OR NON- DEVELOPED ARE WHILE URBAN REFERS TO PEOPLE LIVING IN THE CITY. TYPE: INDEPENDENT/CATEGORICAL
- AVG GLUCOSE LEVEL: TO SEE THE EFFECT OF AVERAGE GLUCOSE LEVEL ON STROKE. TYPE: INDEPENDENT/NORMINAL
- SMOKING STATUS: EFFECT OF SMOKING STATUS ON POSSIBILITY OF HAVING STROKE: FORMERLY SMOKED= 0, NEVER SMOKE= 1, SMOKES= 2, UNKNOWN= 3 TYPE: INDEPENDENT/CATEGORICAL
- STROKE: STROKE IS THE MAIN PROJECT WE ARE WORKING ON. IT IS OUR Y/OUTPUT VARIABLE. IT DEPENDS ON ALL OTHER VARIABLES FOR OUTPUT. TYPE: DEPENDENT/CATEGORICAL
- BMI: BODY MASS INDEX EFFECT ON PREDICTION OF STROKE TYPE:INDEPENDENT/NOMINAL
Data pre-processing is a fundamental data mining technique that transforms raw, unstructured data into a clean, consistent, and understandable format. Real-world datasets often contain errors due to inconsistencies, missing values, and incomplete trends. Pre-processing addresses these issues, ensuring that the data is reliable and ready for effective analysis and predictive modeling.
Step 3: Check for the missing values. Since dataset could be messy and incomplete, so we need to perfect our data for quality outcome for prediction.
We can fix the missing values using two methods.
- If removing a row does not compromise our overall outcome, we can simply drop it from the dataset.
- Alternatively, missing values can be imputed by calculating statistical measures—such as the mean, median, or standard deviation—and using these values to fill in the gaps. The code snippet below demonstrates how to perform this imputation.
It is essential to encode categorical variables into numerical values. This conversion ensures that all features are in a consistent format, which is critical for accurate matching and visualization outcomes.
Some values were also missing for gender when values were rechecked if there were any missing values.
ID feature won’t be needed in the prediction of stroke, so the ID feature was dropped from the column
After addressing missing values and encoding categorical data into numerical values, we can proceed to visualize the key variables that contribute to stroke prediction. By comparing these features with the target variable (Stroke), we can uncover meaningful patterns and relationships. The visualizations below, along with their brief explanations, illustrate these comparisons and support our predictive analysis.
A count plot visualization of work type and stroke incidence reveals that individuals in private employment exhibit the highest risk of stroke, followed by those who are self-employed. Government employees and children show similar stroke risk levels to the self-employed group, while individuals who have never worked display a comparatively lower risk.
The obtained values of 55.12 and 271.74 reveal a significant disparity in the column's data, indicating the need for standardization. This large variation could negatively impact the accuracy of the predictions.
In the visual display below, age increases the rate of stroke risk. Age 80 has the highest risk.
The BMI variable exhibits high positive skewness, indicating significant asymmetry in its distribution. This suggests that normalizing the dataset is necessary to improve the accuracy of our predictions.
The average glucose level distribution exhibits a positive skew, and the dataset requires balancing to ensure accurate and reliable predictions.
Heatmaps provide a visual representation of data using color gradients, simplifying the understanding of complex datasets. Correlation maps, specifically using Pearson's method, can be displayed as heatmaps to effectively visualize and analyze correlations within the data.
The visualization reveals correlations between several variables. Specifically, "work type" and "BMI" exhibit a negative correlation, while "stroke" and "age" show a positive correlation. Other variables also demonstrate varying degrees and directions of correlation.
The scatterplot indicates a positive correlation between age and glucose level, suggesting that glucose levels tend to increase with age.
As observed in the visualizations, the target variable exhibits a significant class imbalance. This imbalance can negatively impact model performance, leading to biased predictions. To mitigate this issue, the Synthetic Minority Over-sampling Technique (SMOTE) will be employed. SMOTE addresses class imbalance by generating synthetic samples for the minority class. The algorithm randomly selects a point from the minority class and identifies its K-nearest neighbors. Synthetic samples are then created along the line segments joining the selected point and its neighbors. The following code demonstrates the implementation of SMOTE to balance the dataset.
The dataset is partitioned into training and testing sets to evaluate model performance and ensure generalization to unseen data. The feature matrix (independent variables) is split into X_train (training set) and X_test (testing set). Correspondingly, the dependent variable is divided into y_train (training labels) and y_test (testing labels). This separation allows the model to learn patterns from the training data (X_train, y_train) and then assess its predictive capabilities on the held-out test data (X_test, y_test). This process is crucial for assessing model accuracy and preventing overfitting. The code for splitting the data, including the specified percentages for the training and testing sets, is shown below.
For stroke prediction, two classification models, Logistic Regression and XGBoost, were selected for comparative analysis to determine which provides higher accuracy and better predictive performance.
Logistic Regression is a suitable model for predicting binary outcomes, where the dependent variable has two possible values (e.g., presence or absence of stroke). It estimates the probability of the outcome based on a set of independent variables.
The model achieved 83% accuracy, as measured by the F1-score. Furthermore, the Area Under the Curve (AUC) score, exceeding 80%, suggests good predictive capability.
XGBoost is a versatile and powerful machine learning model suitable for both classification and regression tasks. It is known for its effectiveness and often achieves high performance.
XGBoost demonstrated superior performance, as evidenced by its F1-score and ROC curve. Achieving 95% accuracy, XGBoost significantly outperformed Logistic Regression, making it the preferred model for this prediction task.
Model performance was evaluated using the F1-score. XGBoost achieved the highest accuracy, reaching 95%.
This project encountered several challenges:
- Time Constraints: The limited timeframe restricted the extent of data exploration and model development.
- Missing Data: A significant amount of missing data in the BMI column posed a challenge, which was addressed using mean imputation.
- Class Imbalance: The imbalanced target variable necessitated the use of SMOTE to improve model performance.
- Tableau Connectivity and Display Issues: Difficulties connecting the dataset to Tableau and subsequent display problems, including unexpected characters and file generation, hindered data visualization efforts.
For future stroke prediction projects, it is recommended to explore a wider range of models to optimize predictive accuracy. Longer project timelines are essential, and relying on multiple evaluation metrics is crucial for a comprehensive assessment of model performance.
This stroke dataset prediction project involved several key steps: data preprocessing, feature selection (removing the ID column), handling missing values (mean imputation), standardization, and addressing the class imbalance using SMOTE. XGBoost achieved the highest performance, with an F1-score of approximately 95%. Given the initial class imbalance, a confusion matrix was used to provide a more detailed understanding of the model's accuracy, further validating XGBoost's performance. Hyperparameter tuning was employed to optimize the XGBoost model and enhance its predictive capabilities. Data visualization was performed in both Python and Tableau, utilizing various dependent and independent variables to gain insights into the dataset and validate initial assumptions.