Problem definition: Diabetes prediction from PIMA dataset
Objective: Explore the dataset, analysis and build models to predict diabetes. Here, multiple models will be compared and the best model will be selected for determining the best prediction. The focus will be mainly on DecsionTree, RandomForest, XGBoost and LogarithmicClassifier
Skills: Data extraction, EDA, visualization, transformation, model building and evaluation.
Tools: Pandas, Scikit-learn, Matplotlib, Seaborn and Numpy
About the Dataset: The objective of the dataset is to be able to ascertain if a person has diabetes or not based on the attributes that is provided. The data collected are all females above 21 years old of Indian PIMA heritage. The following are the attributes of this dataset,
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0.
- Data Extraction
- Exploratory Data Analysis
- Model Building
- Model Evaluation
- Gathering high level understanding about factors affecting diabetes.
- Reading and evaluating columns and correlation plot with the outcome to be determined.
- Highlighting the best ways to assess the model prediction. What parameters are the most important to this particular use case.
Looking at the model evaluation by finding the accuracy, precision and recall we can conclude that for this dataset, Decision Tree and Logistic Classifier is a better fit as compared to Random forest and XGBoost. This could possibly be because the dataset is considerably small thereby making simpler models work more efficiently than ensemble methods.
On further analysis between Decision Tree and Logistic Classifier, Decision Tree will be a better choice because, although logistic classifier has better accuracy rate, but when we check the recall score we can see that it is 74% which is 10% more than that of logistic classifier.
Since in this case of healthcare sector where diabetes detection which lays more importance where missing a positive (or false negative) is more dangerous so high recall is prioritized. Hence we can conclude that Decision Tree will provide the best model for this particular use case.