Scrapes real-time retail prices to estimate inflation before official CPI releases. The idea is simple: if you track enough product prices daily, you can see inflation trends 2-3 weeks before the BLS publishes their numbers.
- Scrape prices from Amazon and Walmart (mock implementations — real scraping would need proxy rotation and careful rate limiting to avoid getting blocked)
- ETL pipeline cleans, validates, and stores the data as parquet files
- Nowcast model computes weighted price indices using CPI basket weights
- Forecast module runs ARIMA/SARIMA for short-term predictions
- Airflow DAG orchestrates the whole thing on a daily schedule
- Streamlit dashboard for visualization
The backtesting framework is implemented but results depend heavily on data source quality. With the mock scrapers (hardcoded product catalogs), the pipeline runs end-to-end and produces forecasts, but the metrics aren't meaningful — you're basically forecasting data you generated yourself.
With real price feeds, the literature suggests web-scraped price indices can lead official CPI by 2-3 weeks with reasonable accuracy. The value here is the pipeline architecture, not the mock data results.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt# Run the ETL pipeline
python -m src.pipeline.etl --output ./data/processed
# Compute nowcast
python -m src.models.nowcast --data-path ./data/processed
# Dashboard
streamlit run streamlit_app/app.py
# Airflow (copy DAG first)
cp dags/scraping_dag.py $AIRFLOW_HOME/dags/Tracking 8 categories with BLS-approximate weights:
| Category | Weight |
|---|---|
| Grocery | 14.3% |
| Housing | 42.4% |
| Transportation | 16.0% |
| Medical | 8.5% |
| Education | 6.2% |
| Recreation | 5.4% |
| Apparel | 2.6% |
| Other | 4.6% |
- Scraping is legally tricky: Amazon and Walmart both restrict automated scraping in their ToS. I ended up using mock data for the portfolio version — in a real setting you'd need to negotiate data access or use an API
- Seasonal adjustment is harder than it looks: Tried X-13ARIMA-SEATS but it needs a lot of data to work well. Fell back to simpler decomposition for now
- Data quality is the real bottleneck: Spent more time on validation and outlier detection than on the actual models. Price data from scraping is noisy — products go out of stock, prices spike during sales, units change
- ARIMA order selection: Grid search over (p,d,q) is slow and the AIC-optimal model isn't always the best for forecasting. Would use
auto_arimafrom pmdarima next time
- Use a proper price aggregator API instead of scraping (BLS actually publishes microdata, just with a lag)
- Try a state-space model or dynamic factor model instead of just ARIMA
- The Laspeyres index calculation is simplified — should handle substitution bias and quality adjustment
- Great Expectations integration is mostly scaffolded, not fully wired up
- The Airflow DAG works but the alerting is just a print statement
This is a proof of concept. Don't use it for actual trading decisions. Always refer to official CPI releases from the BLS for real inflation data.