Skip to content

Step 1 ( Feature extraction)

Maryam Ahmadi J edited this page Dec 17, 2022 · 9 revisions

Tables used

For the purpose of this project which is to predict if the home team is winning, I have used tables as following :

  1. game (to grab date of each game)
  2. team_batting_counts (to grab batting statistics for each team)
  3. team_pitching_counts (to grab pitching statistics for each team)
  4. boxscore (to grab outcome label won/lost)

Variables used

All feature extraction processes were done using mysql. (please refer to batting_average.sql in the HW folder for the code) First, game , team_pitching_counts and team_batting_counts were all indexed to make the join query faster. Then, game table's local_date column was merged with the tables team_pitching_counts and team_batting_counts using the game_id column. The following features were selected for further feature building process.

  1. team_batting_counts
  • Single
  • Double
  • Triple
  • atBat
  • Home_Run
  • Hit
  • Walk
  • Ground_Out
  • Flyout
  • Strikeout
  • Hit_by_Pitch
  1. team_pitching_counts
  • Single
  • Double
  • Triple
  • atBat
  • Home_Run
  • Hit
  • Walk
  • Ground_Out
  • Flyout
  • Strikeout

Variables built

The aforementioned features where further used to build features as following: (Batting) (all values are rolling sum for the last 100 days)

  • Single
  • Double
  • Triple
  • atBat
  • Home_Run
  • Hit
  • Walk
  • Ground_Out
  • Flyout
  • Strikeout
  • Hit_by_Pitch
  • atBat / Home_Run
  • Walk / Strikeout
  • Ground_Out / Flyout
  • atBat / Hit
  • Home_Run / Hit
  • Hit + Walk + Hit_by_Pitch
  • Single + Double + Triple + Home_Run
  • Home_Run / Single + Double + Triple + Home_Run (** This feature is imaginary and made by me)

(Pitching) (all values are rolling sum for the last 100 days)

  • Single
  • Double
  • Triple
  • atBat
  • Home_Run
  • Hit
  • Walk
  • Ground_Out
  • Flyout
  • Strikeout
  • Ground_Out / Flyout
  • Strikeout / Home_Run
  • atBat / Home_Run
  • atBat / Hit

_home , _away , _diff variable extensions

When joining these two tables with the boxscore table, features were divided to away and home features based on the column home_away. Each feature therefore, has 2 values. These values are marked as _home or _away at the end of each variable name. In addition to all these variables, differences between _home and _away values are also calculated and marked as _diff at the end of each variable name.

Clone this wiki locally