-
Notifications
You must be signed in to change notification settings - Fork 0
Step 2 ( Data pre processing)
One final dataset called "joined_team_batting_pitching_boxscore_diff" was extracted and passed to the python script.
- Handling missing values :
At first, when calculating rolling sum for each feature I used IFNULL(var_name,0) function which changed all null values to 0 but I noticed that I am introducing bias to data this way by adding a static value to each missing cell without any solid rationale.
I decided not to use this function anymore and instead look into how many missing values my data actually had.
When checking for missing values, I noticed that some cells were empty and not recognized as NULLs by python. Using df.mask(df=="") method helped me change these values and calculate the total number of missing values.
2 variables and their derivatives had more than 2000 missing values :
- pitching_go_to_ao
- batting_go_to_fo These variables (columns) were removed from the analysis. The remaining 219 missing values (rows) were removed because of the small ratio.
This was done because machine learning algorithms cannot handle categorical variables (winner is the home team is indicated as 1, else 0)
team_id and game_id were further removed from the dataset as they did not provide any additional information to the model. [ notes ] Of course they can be included but I think if I change them to categorical variables and then use one hot encoding to change them to numeric I will introduce too many variables to the model.
As using highly correlated values increases the risk of model over fitting, I used upper triangular part of the correlation matrix to determine which variables were highly correlated. For this purpose, I used 0.7 as my correlation ratio threshold. The remaining variables were further analyzed to find the best features to be included in the model.