A production-ready data pipeline that ingests Formula 1 race data from the OpenF1 API into Databricks using Lakeflow Spark Declarative Pipelines with Autoloader.
This project provides a complete end-to-end pipeline for F1 data:
- Ingestion: Fetch data from OpenF1 API and stage to Unity Catalog volumes
- Bronze Layer: Lakeflow Autoloader streams JSON files into raw Delta tables
- Silver Layer: Clean, validate, and transform data with proper types and data quality checks
- Gold Layer: Aggregate data for analytics and dashboards
- Databricks workspace with Unity Catalog enabled
- Databricks CLI installed and configured
- Python 3.8+ (for local development)
-- Run in Databricks SQL or notebook
CREATE CATALOG IF NOT EXISTS jai_patel_f1_data;
CREATE SCHEMA IF NOT EXISTS jai_patel_f1_data.racing_stats;
CREATE VOLUME IF NOT EXISTS jai_patel_f1_data.racing_stats.pipeline_storage;Or use the provided script:
# In Databricks, run: setup/setup_catalog.sql# Clone the repository
git clone https://github.com/Jaipats/Formula1_Databricks.git
cd Formula1_Databricks
# Deploy using Databricks CLI
bash deploy/databricks_cli_deploy.sh- Open Databricks workspace
- Navigate to:
/Workspace/Users/YOUR_EMAIL/Formula1_Databricks/notebooks/ - Open:
01_ingest_f1_data.py - Update the workspace path (line 36)
- Attach to a cluster and run all cells
Output: JSON files in /Volumes/{catalog}/{schema}/pipeline_storage/staging/
- Go to Workflows β Lakeflow Pipelines
- Create new pipeline:
- Name:
f1_data_pipeline - Storage:
/Volumes/jai_patel_f1_data/racing_stats/pipeline_storage - Configuration:
{ "catalog": "jai_patel_f1_data", "schema": "racing_stats" } - Libraries: Add notebooks from
dlt/folder
- Name:
- Click Start
Result: Bronze, Silver, and Gold tables created automatically!
-- Bronze (raw data)
SELECT * FROM jai_patel_f1_data.racing_stats.bronze_meetings;
-- Silver (cleaned data)
SELECT * FROM jai_patel_f1_data.racing_stats.silver_meetings;
-- Gold (analytics)
SELECT * FROM jai_patel_f1_data.racing_stats.gold_race_summary;After your data is loaded, set up the interactive F1 analytics Databricks App:
# Set environment variables
export DATABRICKS_HOST='your-workspace.cloud.databricks.com'
export DATABRICKS_TOKEN='your-personal-access-token'
export DATABRICKS_HTTP_PATH='/sql/1.0/warehouses/your-warehouse-id'
# Run the app locally
cd apps
streamlit run app.pyFeatures:
- Overview dashboard with season statistics
- Driver performance analysis with comparison mode
- Team analytics (Race sessions only)
- Detailed race analysis with multiple charts
- Tire strategy analysis with team filtering
For Production: Deploy as a Databricks App using apps/app.yaml configuration.
See apps/app.py for details and deployment instructions.
After data is loaded and verified, create a Genie Space for natural language queries:
Option 1: Using Databricks Notebook
- Upload
notebooks/create_genie_space.pyto your workspace - Run all cells
- Get instant access link to your Genie Space
Option 2: Using CLI Script
export DATABRICKS_HOST='your-workspace.cloud.databricks.com'
export DATABRICKS_TOKEN='your-token'
cd deploy
./create_genie_space.shWhat is Genie? Genie is an AI-powered analytics tool that lets you ask questions in natural language:
- "Show me the top 10 fastest laps from 2025"
- "Compare Red Bull and Mercedes pit stop performance"
- "What tire compounds were used most in Monaco?"
The Genie Space includes 19 tables (13 silver + 6 gold) covering all F1 data.
π Full Guide: See GENIE_SPACE_GUIDE.md for complete documentation and example questions.
Formula1_Databricks/
βββ notebooks/
β βββ 01_ingest_f1_data.py # API ingestion to volumes
β βββ 02_explore_data.py # Data exploration
β βββ create_genie_space.py # Create Genie Space (interactive)
βββ dlt/
β βββ f1_volume_to_bronze_autoloader.py # Autoloader β Bronze
β βββ f1_bronze_to_silver.py # Bronze β Silver
β βββ f1_gold_aggregations.py # Silver β Gold
β βββ pipeline_config.json # Lakeflow pipeline config
βββ apps/
β βββ app.py # Streamlit Databricks App
β βββ app.yaml # Databricks Apps config
β βββ requirements.txt # Python dependencies
β βββ test_connection.py # Connection diagnostic tool
βββ config/
β βββ pipeline_config.yaml # Pipeline configuration
β βββ settings.py # Config loader
βββ utils/
β βββ api_client.py # OpenF1 API client
β βββ data_fetcher.py # Data fetching logic
β βββ volume_writer.py # Volume writing utilities
βββ setup/
β βββ setup_catalog.sql # Unity Catalog setup
βββ deploy/
β βββ databricks_cli_deploy.sh # Deployment script
β βββ create_genie_space.py # Create Genie Space (CLI Python)
β βββ create_genie_space.sh # Create Genie Space (CLI Shell)
βββ dashboards/
β βββ f1_race_analytics.sql # Sample dashboard queries
βββ docs/
βββ GENIE_SPACE_GUIDE.md # Complete Genie Space guide
βββ DLT_AUTOLOADER_GUIDE.md # Lakeflow Autoloader guide
βββ QUICK_START.md # Quick start guide
- Writes data incrementally to volumes
- Processes data by session to avoid memory issues
- Handles large datasets without crashes
- Incremental processing: Only processes new files
- Automatic schema evolution: Handles schema changes gracefully
- Exactly-once semantics: No duplicates
- Production-ready: Fault tolerant with checkpointing
- Fetches multiple endpoints simultaneously
- 50-60% faster than sequential fetching
- Configurable worker threads
- 429 (Rate Limit): Exponential backoff with
Retry-Afterheader support - 422 (Data Not Available): Gracefully skips and continues
- Timeouts: Automatic retry with configurable attempts
- Bronze: Raw data from API
- Silver: Cleaned and validated data
- Gold: Aggregated analytics tables
All data from OpenF1 API:
- Meetings: Race weekends
- Sessions: Practice, Qualifying, Race, Sprint
- Drivers: Driver information per session
- Laps: Lap timing and sector times
- Car Data: Telemetry (speed, RPM, throttle, brake, gear)
- Position: Driver positions throughout session
- Pit Stops: Pit stop timing
- Stints: Tyre strategies
- Weather: Track conditions
- Race Control: Flags and messages
- Team Radio: Radio communications
- Intervals: Time gaps between drivers
- Overtakes: Overtake events
- Session Results: Final results
- Starting Grid: Starting positions
Edit config/pipeline_config.yaml:
# Unity Catalog
unity_catalog:
catalog: "jai_patel_f1_data"
schema: "racing_stats"
# Data Configuration
data:
target_year: 2025 # Year to fetch
batch_size: 1000 # Records per batch
# Car data filters (reduce payload size)
car_data_filters:
speed_gte: 200 # Minimum speed (km/h)
sample_drivers: true # Only first 5 drivers per session
# API Configuration
api:
rate_limit_delay: 2 # Seconds between calls
retry_attempts: 5 # Number of retries
parallel_endpoints: true # Enable parallel fetching
max_workers: 3 # Parallel threads- QUICK_START.md - 5-minute getting started guide
- DLT_AUTOLOADER_GUIDE.md - Complete Lakeflow Autoloader documentation
- HOW_TO_RUN.md - Detailed deployment instructions
- DATABRICKS_NOTEBOOK_SETUP.md - Notebook best practices
- API_DATA_AVAILABILITY.md - API data availability info
- PARALLEL_PROCESSING_GUIDE.md - Parallel API calls guide
- RATE_LIMIT_HANDLING.md - Rate limit handling details
Error: ModuleNotFoundError: No module named 'config'
Solution: Update the workspace path in the notebook (line ~36):
sys.path.append('/Workspace/Users/YOUR_EMAIL@databricks.com/Formula1_Databricks')Error: "Fatal error: The Python kernel is unresponsive"
Cause: Imports before dbutils.library.restartPython()
Solution: See DATABRICKS_NOTEBOOK_SETUP.md
Error: "No sessions found"
Cause: Wrong parameter order in get_sessions()
Solution: Already fixed in latest version. Pull from GitHub.
Solution: Disable auto-format in your IDE. See DISABLE_AUTO_FORMAT.md
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. API Ingestion (Manual) β
β Run: notebooks/01_ingest_f1_data.py β
β Output: JSON files in volumes β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Lakeflow Pipeline (Automatic) β
β Autoloader β Bronze β Silver β Gold β
β Triggered: Workflows β Lakeflow Pipelines β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Analytics & Dashboards β
β Query Gold tables β
β Build dashboards in Databricks β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Ingestion: ~20-30 minutes for full 2025 season
- Lakeflow Pipeline: ~10-15 minutes for Bronze β Silver β Gold
- Parallel API calls: 50-60% faster than sequential
- Memory usage: < 2GB (incremental writing)
- Fork the repository
- Create a feature branch
- Make your changes
- Test in Databricks
- Submit a pull request
This project is open source and available under the MIT License.
- OpenF1 API for providing F1 data
- Databricks for the amazing platform
- F1 community for inspiration
For questions or issues:
- Create an issue on GitHub
- Check the documentation files
- Review troubleshooting guides
Happy Racing! ποΈπ¨