End-to-End MLB Moneyline Betting System
- Overview
- Project Structure
- Data Sources
- ETL Pipeline
- Database Schema
- Data Validation & Quality Checks
- Modeling Pipeline
- How to Run
- Environment Setup
- Future Work
- Acknowledgments
- License
- What does the project do? This project aims to create a production-level (or close to production level) system to aid in MLB moneyline betting. As a result, I hope that this project becomes fully self-contained from data storage to ETL/ELT, modeling, backtesting, monitoring, and deploying. The initial version of this project will be completed within a span of 2 months, so while I don't expect the modeling to be optimal, I hope to get something "good enough" while standing up the deployment aspect. Additional model refinement can come later.
While this is a personal project, which I plan to test out with "skin in the game", I suspect that the insights from this work will be useful for teaching (I teach "Data Science for Sports" at the GW School of Business).
Another aim of this project is to gain additional experience in Docker, ETL/ELT, and deployment to supplement my current role as a data scientist.
end-to-end-mlb-betting/
├── data/
│ ├── raw_games.csv
│ ├── raw_team_stats.csv
│ └── raw_player_stats.csv
├── dags/
│ ├── airflow_dag.py
├── db/
│ ├── schema.sql
│ ├── data_validation_checks.sql
├── docker/
├── etl/
│ ├── utils.py
│ ├── extract_games.py
│ ├── extract_team_stats.py
│ ├── extract_player_stats.py
│ ├── extract_odds.py
│ ├── load_to_db.py
│ ├── transform_games_clean.py
│ └── update_all_data.py
├── models/
│ ├── evaluate.py
│ ├── train_model.py
├── serving/
│ ├── api.py
│ ├── Dockerfile
├── tests/
├── test_features.py
├── tracking/
│ ├── mlflow_config
├── validate/
│ └── run_data_checks.py
- MLBStats API
- See documentation here: ...
- Break into steps: extract, transform, load
- Note that to get updated data, run
python -m src.etl.update_all_data
- Note that to get updated data, run
- Briefly describe each stage, CLI args, file outputs.
Diagrams and/or descriptions of your tables: games, team stats, player stats, etc.
Describe the validation logic, examples of checks, and how to run them
Overview of modeling workflow: features, target variable, cross-validation approach
Commands to run: - historical extract - load to DB - validation - modeling Include CLI examples and notes
Required packages, Python version, virtualenv/conda, Docker (if used)
Ideas for extending the pipeline or improving the model (e.g., lineup changes, player absences)
Credits to data providers, libraries, or collaborators
If public, state license type.