This project presents a solution for the 2025 Kaggle Playground Series competition. The goal was to predict workout calorie expenditure using physiological data. The final model is a stacked ensemble of gradient boosting models enhanced with advanced feature engineering and a final bias-correction layer.
It follows a sophisticated, end-to-end machine learning pipeline to deliver highly accurate predictions:
- Enriches Raw Data: A custom transformation pipeline creates new, insightful features from base metrics, such as Body Mass Index (BMI), age-squared, and advanced heart rate statistics.
- Optimizes Multiple Models: It uses the Optuna framework to systematically find the best hyperparameters for three powerful gradient boosting models: LightGBM, XGBoost, and CatBoost.
- Combines Predictions with Stacking: A
StackingRegressoracts as a "manager" model. It takes the predictions from the individual base models as input and learns how to combine them for a more accurate and robust final prediction. - Refines the Final Output: A simple, final linear regression model is trained to correct any systematic bias in the ensemble's predictions, providing a last-mile performance boost.
- Core Stack: Python, NumPy, Pandas
- Machine Learning: Scikit-learn (for Pipelines, Stacking, and Preprocessing)
- Gradient Boosting: LightGBM, XGBoost, CatBoost
- Hyperparameter Tuning: Optuna
- Utilities: Joblib (for saving model artifacts), Logging
The foundation of the model's success lies in its comprehensive feature engineering, all encapsulated within a robust scikit-learn Pipeline to prevent data leakage.
- Custom Transformer: A dedicated class
CustomFeatureEngineerhandles all new feature creation. - Key Engineered Features:
- BMI Features:
BMI,BMI_sq, andBMI_cat(underweight, normal, etc.). - Age-Based Features:
Age_sqandAge_decadeto capture non-linear relationships. - Heart Rate Analytics:
HR_max_est,HR_reserve, andHR_pct_maxto contextualize heart rate. - Interaction Terms:
Weight * Durationto model combined effects.
- BMI Features:
- Preprocessing: The pipeline automatically handles missing value imputation, one-hot encoding for categorical data, and
StandardScalerfor numeric features.
The core of the solution is a stacked ensemble that leverages multiple models.
- Base Models:
Ridge: A simple linear baseline.LightGBM: Optimized with Optuna.XGBoost: Optimized with Optuna.CatBoost: Optimized with Optuna.
- Hyperparameter Tuning: Each boosting model was tuned using Optuna over 20-50 trials with 5-fold cross-validation to find the most effective parameters.
- Meta-Model: A
Ridgeregressor serves as the final estimator, learning the optimal weights to assign to each base model's prediction.
A final LinearRegression model was trained on the out-of-fold predictions from the validation set. This unique step acts as a fine-tuning mechanism, correcting for any small, systematic errors and ensuring the predictions are perfectly calibrated.
The model achieved a top-tier rank, demonstrating its high accuracy and effectiveness.
| Metric | Value |
|---|---|
| 🏁 Final Rank | 370/4316 (Top 9%) |
| 🎯 Final Score (RMSLE) | 0.05866 (top is 0.05841) |
- Advanced Feature Engineering for Tabular Data
- End-to-End Machine Learning Pipelines (
scikit-learn) - Hyperparameter Optimization (Optuna)
- Advanced Ensemble Methods (Stacking)
- Model Evaluation, Calibration, and Refinement
- Best Practices in Reproducible ML (Pipelines, Artifact Saving)
- Kaggle Playground Series
- This competition is part of Kaggle's monthly series designed for practicing and honing machine learning skills on approachable, real-world datasets.
👋 Hi! I'm Ludovic Malot, a French engineer with a passion for building effective and creative machine learning solutions. This project was a fantastic opportunity to apply and refine advanced ensembling techniques in a highly competitive environment.
Feel free to connect with me on LinkedIn or drop a ⭐ if you found this project insightful!