Skip to content

Step 2 ( Data pre processing)

Maryam Ahmadi J edited this page Dec 17, 2022 · 10 revisions

Data

One final dataset called "joined_team_batting_pitching_boxscore_diff" was extracted and passed to the python script.

  • Handling missing values :

Mistake

At first, when calculating rolling sum for each feature I used IFNULL(var_name,0) function which changed all null values to 0 but I noticed that I am introducing bias to data this way by adding a static value to each missing cell without any solid rationale.

Correction

I decided not to use this function anymore and instead look into how many missing values my data actually had.

Empty fields

When checking for missing values, I noticed that some cells were empty and not recognized as NULLs by python. Using df.mask(df=="") method helped me change these values and calculate the total number of missing values.

Total number of missing values

2 variables and their derivatives had more than 2000 missing values :

  1. pitching_go_to_ao
  2. batting_go_to_fo These variables (columns) were removed from the analysis. The remaining 219 missing values (rows) were removed because of the small ratio.

Changing response variable to numeric

This was done because machine learning algorithms cannot handle categorical variables (winner is the home team is indicated as 1, else 0)

Removing unnecessary variables

team_id and game_id were further removed from the dataset as they did not provide any additional information to the model. [ notes ] Of course they can be included but I think if I change them to categorical variables and then use one hot encoding to change them to numeric I will introduce too many variables to the model.

Removing highly correlated variables

As using highly correlated values increases the risk of model over fitting, I used upper triangular part of the correlation matrix to determine which variables were highly correlated. For this purpose, I used 0.7 as my correlation ratio threshold. The remaining variables were further analyzed to find the best features to be included in the model.

Clone this wiki locally