This repo is a Code the Dream-friendly scaffold for a batch ETL data pipeline that:
- Geocodes global cities → lat/lon
- Pulls OpenWeather Air Pollution (Historical) data for the last 72 hours
- Transforms raw JSON into a tidy gold dataset
- Loads to Parquet (and optionally Postgres)
- Serves a simple Streamlit dashboard that reads the gold dataset
- Create env file
cp .env.example .env
# Set OPENWEATHER_API_KEY in .env- Run everything
docker compose up --build- Dashboard: http://localhost:8501
- Gold output:
./data/gold/air_pollution_gold.parquet - Raw cache:
./data/raw/openweather/...
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Set OPENWEATHER_API_KEY and local paths (see .env.example)
python services/pipeline/run_pipeline.py --source openweather --history-hours 72
streamlit run services/dashboard/app/Home.pyEdit configs/cities.csv:
city,country_code,state
Toronto,CA,
Paris,FR,
Lagos,NG,
Sydney,AU,NSWFor global cities, country_code is required to disambiguate.
docs/architecture.md— component + sequence diagrams (PlantUML)docs/data_flow_diagram.md— data flow diagram (PlantUML)