-
Notifications
You must be signed in to change notification settings - Fork 0
Step 1 ( Feature extraction)
For the purpose of this project which is to predict if the home team is winning, I have used tables as following :
- game (to grab date of each game)
- team_batting_counts (to grab batting statistics for each team)
- team_pitching_counts (to grab pitching statistics for each team)
- boxscore (to grab outcome label won/lost)
All feature extraction processes were done using mysql. (please refer to batting_average.sql in the HW folder for the code) First, game , team_pitching_counts and team_batting_counts were all indexed to make the join query faster. Then, game table's local_date column was merged with the tables team_pitching_counts and team_batting_counts using the game_id column. The following features were selected for further feature building process.
- team_batting_counts
- Single
- Double
- Triple
- atBat
- Home_Run
- Hit
- Walk
- Ground_Out
- Flyout
- Strikeout
- Hit_by_Pitch
- team_pitching_counts
- Single
- Double
- Triple
- atBat
- Home_Run
- Hit
- Walk
- Ground_Out
- Flyout
- Strikeout
The aforementioned features where further used to build features as following: (Batting) (all values are rolling sum for the last 100 days)
- Single
- Double
- Triple
- atBat
- Home_Run
- Hit
- Walk
- Ground_Out
- Flyout
- Strikeout
- Hit_by_Pitch
- atBat / Home_Run
- Walk / Strikeout
- Ground_Out / Flyout
- atBat / Hit
- Home_Run / Hit
- Hit + Walk + Hit_by_Pitch
- Single + Double + Triple + Home_Run
- Home_Run / Single + Double + Triple + Home_Run (** This feature is imaginary and made by me)
(Pitching) (all values are rolling sum for the last 100 days)
- Single
- Double
- Triple
- atBat
- Home_Run
- Hit
- Walk
- Ground_Out
- Flyout
- Strikeout
- Ground_Out / Flyout
- Strikeout / Home_Run
- atBat / Home_Run
- atBat / Hit
When joining these two tables with the boxscore table, features were divided to away and home features based on the column home_away. Each feature therefore, has 2 values. These values are marked as _home or _away at the end of each variable name. In addition to all these variables, differences between _home and _away values are also calculated and marked as _diff at the end of each variable name.