Skip to content

to assist in identifying features to take note of when trying to improve adoption rates

License

Notifications You must be signed in to change notification settings

Kayden-lolasery/Animal-adoption-EDA-ML---Capstone-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Animal-adoption-EDA-ML---Capstone-project

Table of contents

Problem statement

  • need to maximise adoption rates to ensure adoption shelters are not completely dependent on government funding as well as to improve the quality of life of animals and pet owners.
  • This is based off our label: (Adopted, adopted and Returned & Not adopted)
  • label

Goal

  • to assist shelters in identifying features to take note of when trying to improve adoption rates.

EDA (exploratory data analysis) interesting points:

Side note: The most notable issue with this dataset is that it does not have any numerical/continuous data for me to manipulate. I must adjust everything to turn them into numbers

  • Black cats and dogs have the highest abandon and adoption rates
  • blackblack
  • cats & dogs are the most popular animals to be abandoned
  • catdog
  • Adoption rate is not time sensitive
  • not time sensitive2
  • The most abandoned breed is the domestic shorthair in both cats and dogs (so much so that we can shorten them to a new column)
  • breednmaes

Data pre-processing

Data cleaning

Missing values

  • Such as deceaseddate, returneddate & identichip indicate alive, not returned and not chipped respectively
  • Other items missing will have their entire row dropped
  • Nans

Data engineering

  • Data needs to be in numbers for the Machine to read, these numbers shoud also make sense in the real world.

Feature engineering

Total time spent in shelter

  • getting Total time spent in shelter with intake and movement date
  • movein
  • timespent
  • Negative timing is due to the same day dates being reversed is terms of newer and older dates. easily solved with abs() function
  • time spent in shelter

Adoption rate

  • Obtained via dividing current adopted with total number of animals. Caveat is that first few entries will be very skewed and jumpy on the graph
    • not time sensitive

Example of features with data set errors

animalage

sam

  • Dataset column is riddled with:
    • strings where days, weeks & months need to be converted into years
      • Regular expression usage to achieve this
    • outlier errors which need to make sense. Data that is on extreme ends can affect certain ML models
      • better to keep the data's SD relatively low
      • Here i chose to replace the outlier age (100 year old cats/ dogs) with the googled average maximum age of a cat/dog regular expression

Before and after data-preprocessing

age product Capture

Label manipulation before training models

  • SMOTEtomek was chosen to remove the nearest neighbours smoted to ensure data is not to0 cluttered
  • labells

ML models used

Utilizing various Machine learning algorithmns and libraries to get the best model Decision-Trees-Root-Node images (1) 1523957272561

ML EDA

  • MIP plot (we can see that the top 4 features that affected the Model the most were:
    • returned reason
    • Time spent in shelter
    • chipped or not
    • animal's age
  • MIP
  • Ultimately the Support vector Machine model was the best:
  • scores
    • 100% recall on adopted & not adopted labels
    • 100% precision on adopted & returned label
    • cm
    • ROC OVO (micro on graph) 98%
    • log loss very very low at 0.119 where logloss = 1.1 is when i myself randomly choose and label the data without any model.
    • roc

About

to assist in identifying features to take note of when trying to improve adoption rates

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published