Exploratory data analysis is basically to analyse the data, aggregate the findings and present it to stakeholders. If the findings are convincing enough then the next step would be to develop a model around the data etc. Exploratory data analysis involves the following steps:
- Data Cleaning - Here we may have to perform Null value analysis and apply mean, mode and median. For numerical data we have to use mean, median and for categorical variable we have to use mode. Post null value imputation we have to check the data for invalid data and check for outliers.
- Data Analysis - For categorical variables we can use univariate, bivariate and multivariate analysis. For numerical values, with one attribute we can get some summary analysis and for more than one attribute we can run correlation analysis between attributes.
- Derived Metrics - Create new columns for example Age as age is a continuous variable hence it is difficult to figure out any details so better to create bins for example 0-20, 21-40 and so on.
- Load all the required python libaries such as pandas, numpy, matplotlib.
- Load the data using pandas read_csv method
- View the data using pandas head and tail methods
- Use pandas method such as shape to get an overall idea about the number of columns and rows, columns.values will give an idea about all the various columns, dtypes will give an idea about data types associated with each column
- Next use the describe method to describe the data and get details such as mean, median, 25%, 50%, 75% etc and make a note of some of these findings
- Next use value_counts on dependent variable to get an idea about the percentage of success and failure
- Next use the info method to check if there are any null values in the data.
- If there are missing values then the following approach can be used
- If there are very few missing values then replace the missing values with mean, median or mode
- If the missing values are quite high then remove the complete column
- Now the next step is to analyze the values and see if anyone of them requires the data type to be changed for example changing a string column to integer
- Next step is to check if we need to convert some of the data into bins, for example some variables such as age can be grouped into bins as it may be too much to create individual bars for each age
- We can also remove some columns which actually will not add any value into our investigation for example customerId, customer name may not add any value
- This completes the Data Cleaning Process
- Now the next step is to run Univariate Analysis, basically for every other variable try to get an idea. For this we can use countplot and plot the graphs for all the categorical variables against dependent variable
- Now we can run outlier analysis to figure out what percentage of records are outlier, typically if the outlier are more than 0.3% then it means that the distribution is not a Normal distribution hence it is worth to check the distribution which can be done by plotting a KDEPlot. A KDEPlot would give details about skeweness of the data. And then we can apply startegies such as log, double log, inverse etc to reduce the skewness of the data.
- Now the next step is to run numerical analysis for variables which are numerical in nature, for numerical analysis we also have to make couple of changes
- We have to transform dependent variable into numerical
- Apply one-hot encoding to create dummy variables
- Now we can plot a correlation bar plot between dependent variable and all the predictor or independent variables. This will give us lot more insight about positive as well negative cases for example why a customer is churning and why a customer is not churning.
- Next we can plot a heatmap between all the variables, this will give us detail of correlation across all variables. please note point #16 talks about correlation between dependent and independent variable however heatmap is to get correlation details between all the variables and not just predictor and dependent variables
- Next we will apply Bivariate, trivariate and multivariate analysis to get more insights on the data for example.