A machine learning project that predicts total corners in football matches using historical data from 8 European leagues.
Predicts total corners in matches with MAE < 2.0 (average error less than 2 corners) to support betting analysis.
- Data:
pandas,numpy,soccerdata - ML:
XGBoost,scikit-learn - Tracking:
MLflow - Statistics:
scipy(Poisson distribution) - Visualization:
matplotlib,plotly
- Platform: FBref.com (via
soccerdatalibrary) - Leagues: Premier League, La Liga, Bundesliga, Ligue 1, Serie A, Eredivisie, Primeira Liga, Pro League
- Seasons: 2018-2025 (years where advanced data is available)
- Total Matches: ~21,000
- Shooting: xG, shots, shots on target, shot distance, shot creation actions
- Passing: corners, crosses, passes (total attemps, progressive, last 1/3, long passes), assists
- Defense: tackles (Total, last 1/3), blocks, interceptions, clearances
- Possession: touches, carries (progressive, last 1/3, penalty area), possession %
- Goalkeeping: save %
FBref.com β Download stats β Merge leagues β Clean data β CSV
Output: dataset_cleaned.csv
Basic features used (9):
Processed with averages of their own leagues, example Average corners = Average corners team - Average corners league
- Average corners
- Varianze corners
- Average Xg
- Average sca
- Average crosses
- Average possession
- Average attemps in 1/3
- Average GF
- Average GAAdvanced Key engineered features used (15):
SHOTS
- shot accuracy
- xg shot
- possession_shot
PASSES
- progressive_pass_ratio
- final_third_involvement
- assist_sca
- creative_efficiency
DEFENSE
- interception_tackle
- clearance_ratio
- high press intensity
POSSESSION
- progressive_carry_ratio
- carry_pass_balance
- transition_index
ATTACK
- offensive index
- attacking presenceOther features (11):
POINTS PER GAME
- average points per game local team
- average points per game visit team
- difference poinst per game
LEAGUES ONE HOT ENCODING
- premier league
- ligue 1
- bundesliga
- la liga
- eredivise
- serie a
- primeira liga,
- pro league| Category | Features | Examples |
|---|---|---|
| Local Team Averages | 96 | Form, General (Home/away) - Basic + Advance features |
| Visit Team Averages | 96 | Form, General (Home/away) - Basic + Advance features |
| Head-to-Head Averages | 48 | Last 3 matches (Home/away) - Basic + Advance features |
| Points Per Game Features | 3 | Poinst Local, Visit and Difference |
| League Encoding | 8 | One-hot encoded leagues |
| Team against Averages | 18 | Basic features against teams |
Output: dataset_processed.csv
Why XGBoost?
- Handles non-linear relationships
- Works well with 80+ features
- Resistant to overfitting
- Fast training/prediction
Total: 21,000 matches
βββ Train (70%): 14,700 matches
βββ Validation (15%): 3,150 matches
βββ Test (15%): 3,150 matches
Hyperparameters (found via GridSearchCV):
MLFlow image
{
'n_estimators': 200,
'max_depth': 4,
'learning_rate': 0.03,
'reg_alpha': 3.0,
'reg_lambda': 5.0,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.6,
'best_gamma':1.0
}MLFlow image
| Set | MAE | RΒ² | RMSE |
|---|---|---|---|
| Train | 1.78 | 0.49 | 2.23 |
| Validation | 1.95 | 0.38 | 2.45 |
| Test | 1.93 | 0.39 | 2.42 |
β Test MAE = 1.93: Predictions are off by 1.93 corners on average
Currently my Model has overfit, I am still improving data and model configuration
Errors < 1 corner: 46%
Errors < 1.5 corners: 55%
Errors < 2 corners: 68%
Errors < 3 corners: 82%
| Feature | Importance | Description |
|---|---|---|
lst_team1_home_avg_ck |
0.0842 | Home team avg corners at home |
lst_team2_away_avg_ck |
0.0795 | Away team avg corners away |
lst_team1_home_xg |
0.0623 | Home team expected goals |
lst_h2h_avg_ck |
0.0581 | Head-to-head avg corners |
lst_team1_home_sh |
0.0534 | Home team shots |
lst_team2_away_xg |
0.0489 | Away team expected goals |
predict_corners(
local="Barcelona",
visitante="Real Madrid",
jornada=15,
temporada="2526",
league_code="ESP"
)ποΈ Barcelona vs Real Madrid
π
Season 2526 | Round 15
π― PREDICTION: 10.3 corners
π Most probable: 10 corners (12.5%)
π 80% confidence: 7-13 corners
π― OVER/UNDER PROBABILITIES:
Over 8.5: 72.3% @1.38 - HIGH β
Over 9.5: 58.1% @1.72 - MEDIUM β οΈ
Over 10.5: 43.2% @2.31 - LOW β
β οΈ RELIABILITY: VERY HIGH βββ (Score: 71/100)
Measures team consistency:
Score = (100 - CV) Γ 0.4 +
consistency Γ 0.3 +
trend_stability Γ 0.3
- Score > 65: VERY HIGH βββ
- Score > 50: HIGH ββ
- Score > 35: MEDIUM β
- Score < 35: LOW β οΈ
futbol_corners_forecast/
β
βββ config/
β βββ model_config.json # Best hyperparameters
β
βββ dataset/
β βββ cleaned/
β β βββ dataset_cleaned.csv # Raw processed data
β βββ processed/
β βββ dataset_processed.csv # ML-ready features
β
βββ models/
β βββ xgboost_corners_*.pkl # Trained model
β βββ scaler_corners_*.pkl # Feature scaler
β βββ feature_importance_*.csv # Feature rankings
β
βββ mlruns/ # MLflow experiments
β
βββ src/
β βββ models/
β β βββ train_model.py # Training pipeline
β β βββ test_model.py # Prediction system
β β
β βββ process_data/
β βββ generate_dataset.py # Data collection
β βββ process_dataset.py # Feature engineering
β
βββ EDA.ipynb # Exploratory analysis
βββ README.md
- Consistent teams β Better predictions (MAE ~1.9)
- Top leagues β More data = Better accuracy
- Mid-season matches β More historical data
- Matches where teams had low variance and low anomalies
- Inconsistent teams β Higher error (MAE ~2.3)
- Early season β Limited historical data
- uncertainty
Educational purposes only. Not financial advice.



