-
Notifications
You must be signed in to change notification settings - Fork 0
Step 3 (Explanatory Analyses)
My dataset contained 1944 days, therefore, I decided to split it to train test using 0.2 as test ratio and 0.8 (1555 days) as train ratio. Using the minimum and maximum date, I was able to split the dataset to train and test and the cutting threshold was 2011-06-02.
This analysis listed the following 21 features as the most significant features: pitching_baseonball_or_walk_home_diff', 'batting_baseonball_or_walk_diff', 'batting_single_diff', 'batting_average_batting_home', 'pitching_so_to_hr_away', 'batting_w_to_sr_home', 'batting_average_batting_away', 'batting_double_diff', 'batting_w_to_sr_away', 'pitching_so_to_hr_home', 'batting_ab_to_hr_away', 'batting_ab_to_hr_home', 'batting_homerun_diff', 'pitching_homerun_diff', 'home_run_to_single_double_triple_ratio_home', 'batting_flyout_or_airout_diff', 'batting_hit_by_pitch_diff', 'pitching_flyout_or_airout_diff', 'pitching_triple_diff', 'pitching_groundout', 'batting_triple_diff'
The table with detailed variable importance scores can be found as "random_forest_variable_importance.html"
-
The aforementioned features were further investigated using mean of response plots, regression analysis, brute force tables. Average batting for away team was found as one of the most predictive variables using the weighted mean of response value, its predictor response plot also showed lower mean value for the winner team. The mean of response plot also showed a pattern in mean value of response in the fifth bin of the predictor. pitching_so_to_hr_home variable had a weighted mean of response value of 0.006. I made sure to include all variables with mean of response value greater than 0.001.
-
Looking at the pair-wise relationship plots (brute force values), pitching_homerun_diff and pitching_flyout_or_airout_diff outcome seem to be fairly correlated (core = 0.63). When looking at their brute-force plot, I noticed that when these two variables are zero, there is a greater chance for the home team to win. Features such as pitching atbat home and pitching at bat away, pitching hit away, pitching single away had weighted brute force value f >0.001. I made sure to inlcude all variables with brute-force value of greater than 0.002 (All plots can be found in the output folder)