DiabetesDetection-ML

Description

Problem definition: Diabetes prediction from PIMA dataset

Objective: Explore the dataset, analysis and build models to predict diabetes. Here, multiple models will be compared and the best model will be selected for determining the best prediction. The focus will be mainly on DecsionTree, RandomForest, XGBoost and LogarithmicClassifier

Skills: Data extraction, EDA, visualization, transformation, model building and evaluation.

Tools: Pandas, Scikit-learn, Matplotlib, Seaborn and Numpy

About

About the Dataset: The objective of the dataset is to be able to ascertain if a person has diabetes or not based on the attributes that is provided. The data collected are all females above 21 years old of Indian PIMA heritage. The following are the attributes of this dataset,

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0.

Topics covered

Data Extraction
Exploratory Data Analysis
Model Building
Model Evaluation

Challenges

Gathering high level understanding about factors affecting diabetes.
Reading and evaluating columns and correlation plot with the outcome to be determined.
Highlighting the best ways to assess the model prediction. What parameters are the most important to this particular use case.

Conclusion

Looking at the model evaluation by finding the accuracy, precision and recall we can conclude that for this dataset, Decision Tree and Logistic Classifier is a better fit as compared to Random forest and XGBoost. This could possibly be because the dataset is considerably small thereby making simpler models work more efficiently than ensemble methods.

On further analysis between Decision Tree and Logistic Classifier, Decision Tree will be a better choice because, although logistic classifier has better accuracy rate, but when we check the recall score we can see that it is 74% which is 10% more than that of logistic classifier.

Since in this case of healthcare sector where diabetes detection which lays more importance where missing a positive (or false negative) is more dangerous so high recall is prioritized. Hence we can conclude that Decision Tree will provide the best model for this particular use case.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.devcontainer		.devcontainer
StreamlitApp		StreamlitApp
TestingSteamlit		TestingSteamlit
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiabetesDetection-ML

Description

About

Topics covered

Challenges

Conclusion

About

Uh oh!

Releases

Packages

Languages

Cnair02/DiabetesDetection-ML

Folders and files

Latest commit

History

Repository files navigation

DiabetesDetection-ML

Description

About

Topics covered

Challenges

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages