A comprehensive machine learning project that predicts the market value of football players using various performance metrics, player attributes, and statistical analysis. The project implements multiple regression algorithms with hyperparameter tuning to achieve high accuracy in player valuation predictions.
This project leverages machine learning to predict football player market values, which is crucial for:
- Transfer Market Analysis - Evaluate fair transfer fees and identify market opportunities
- Contract Negotiations - Determine appropriate compensation structures
- Investment Decisions - Spot undervalued or overvalued players
- Strategic Planning - Support club management and scouting decisions
Our best performing model achieves:
- RΒ² Score: 0.988 (98.8% variance explained)
- Mean Absolute Error: β¬232K
- Root Mean Square Error: β¬463K
- Mean Absolute Percentage Error: 6.81%
- Multiple ML Algorithms - Linear Regression, Random Forest, Gradient Boosting, Polynomial Regression, Lasso, Ridge
- Hyperparameter Tuning - Automated parameter optimization using RandomizedSearchCV
- Feature Engineering - Advanced feature selection and importance analysis
- Ensemble Methods - Voting and Stacking regressors for improved performance
- Model Persistence - Save and load trained models
- Comprehensive Evaluation - Multiple metrics for robust assessment
Predicting-Market-Value-Footballers/
βββ Web Scaping/
β βββ web_scraping.ipynb # Data collection notebook
βββ data/
β βββ out.csv # Processed dataset
β βββ players_all.csv # Complete player data
β βββ test.csv # Test dataset
βββ data_preparation/
β βββ data_prepare.ipynb # Data preprocessing
βββ main.ipynb # Main analysis and modeling
βββ gradient_boosting_with_most_im... # Best model file
βββ README.md
- Web scraping from SoFiFA.com - Automated collection of 60+ player attributes across multiple pages
- Comprehensive dataset - Player statistics including ratings, physical attributes, skills, and market values
- Data cleaning and feature engineering - Processing scraped data into ML-ready format
- Handling missing values and outliers - Data quality assurance and preprocessing
- Feature scaling and normalization - Preparation for machine learning algorithms
The project implements and compares multiple algorithms:
- Linear Regression - RΒ²: 0.891, MAE: β¬848K
- Lasso Regression - RΒ²: 0.868, MAE: β¬916K
- Ridge Regression - RΒ²: 0.868, MAE: β¬916K
- Polynomial Regression - RΒ²: 0.918, MAE: β¬758K
- Random Forest - RΒ²: 0.976, MAE: β¬302K
- Gradient Boosting - RΒ²: 0.979, MAE: β¬339K
- Tuned Random Forest - RΒ²: 0.961, MAE: β¬419K
- Tuned Gradient Boosting - RΒ²: 0.983, MAE: β¬260K
- Voting Regressor - RΒ²: 0.978, MAE: β¬306K
- Stacking Regressor - RΒ²: 0.988, MAE: β¬227K
The most important features for predicting market value are:
- Age - Player's current age
- Overall Rating - FIFA overall skill rating
- Potential - Maximum potential rating
- Best Overall - Peak overall rating achieved
- Growth - Difference between potential and current rating
- Dribbling/Reflexes - Technical skills
- Wages - Current salary information
- Release Clause - Contract release clause value
- RandomizedSearchCV with 5-fold cross-validation
- 50 parameter combinations tested for each model
- Optimized parameters for Random Forest and Gradient Boosting
Gradient Boosting with Selected Features:
GradientBoostingRegressor(
n_estimators=500,
learning_rate=0.1,
max_depth=5,
min_samples_split=5,
min_samples_leaf=1,
random_state=42
)- Open the main notebook
jupyter notebook main.ipynb- Execute cells sequentially to:
- Load and explore the dataset
- Train multiple models
- Compare performance metrics
- Analyze feature importance
- Generate predictions
from joblib import load
import pandas as pd
# Load the best model
model = load('gradient_boosting_with_most_important_features_best_model.joblib')
# Prepare your data (ensure same features as training)
# new_player_data = pd.DataFrame({...})
# Make predictions
predicted_value = model.predict(new_player_data)
print(f"Predicted market value: β¬{predicted_value[0]:,.0f}")required_features = [
'Age',
'Overall rating',
'Potential',
'Best overall',
'Growth',
'Dribbling / Reflexes',
'new_wages',
'new_release_clause'
]| Model | RΒ² Score | MAE (β¬) | RMSE (β¬) | MAPE (%) |
|---|---|---|---|---|
| Linear Regression | 0.891 | 848K | 1,348K | 72.0 |
| Random Forest | 0.976 | 302K | 641K | 7.6 |
| Gradient Boosting | 0.979 | 339K | 602K | 13.3 |
| Tuned Gradient Boosting | 0.983 | 260K | 539K | 7.4 |
| GB with Selected Features | 0.988 | 232K | 463K | 6.8 |
| Stacking Regressor | 0.988 | 227K | 456K | 6.7 |
- Automated data collection from football databases
- Player statistics, ratings, and market values
- Real-time data updates
- Data cleaning and validation
- Feature engineering and selection
- Train/test split preparation
- Comprehensive model comparison
- Hyperparameter optimization
- Performance evaluation and validation
- Deep Learning Models - Neural networks for non-linear patterns
- Time Series Analysis - Player value trends over time
- Real-time Predictions - API integration for live data
- Additional Features - Injury history, team performance metrics
- Multi-class Prediction - Position-specific value models
For questions or collaboration opportunities, please open an issue in the repository.