Skip to content

Latest commit

 

History

History
27 lines (23 loc) · 1.91 KB

File metadata and controls

27 lines (23 loc) · 1.91 KB

EDA : Some insights

EDA checklist explained

  • Q1. What question are you trying to solve (or prove wrong)? [Start with simplest hypothesis]
  • Q2. What kind of data do you have? [Numerical, Categorical, Other. How to deal with it?]
  • (Understanding the data)
  • Q3. What is missing from the data? And how to deal with it? [avg., replacing with some value, dropping the entire column if not imp, etc.]
  • (Missing Values)
  • Q4. What are the potential outliers? Why should you pay attention to it? [What are they? Do we need them? are they destroying model?]
  • (Plot the distribution of features.)
  • Q5. How can you add, remove or change features to get more out of the data? [thumb rule: more data = good]
  • (Feature Engineering. This also includes converting categorical to numerical data.)
    1. Feature Contribution:- It is a way to figure out how much each feature influences the model.
    1. relationship between variables and correlation between features.
    1. matplotlib and seaborn libraries.
    1. Histograms (Seaborn version of histogram is density plot, sns.distplot) and Scatter Plots. Histogram for seeing the distribution of a particular variable, Scatter plot for seeing relationships * between 2 or more variables.
  • 5.Heatmap (in seaborn lib) provides us with a numerical value of the correlation between each variable.
  • Principal Component Analysis (PCA) is used to reduce the number of features to use and graphing the variance which gives us an idea of how many features we really need to represent our dataset fully. - Ananya