This project builds an end-to-end data platform for air quality data using open smart city datasets.
It covers the full pipeline:
- Data ingestion from CrateDB (FIWARE-based open data APIs)
- Standardization into a unified schema
- Data quality-aware transformation
- Analytical tables for cross-city comparison
The goal is not just to collect data, but to evaluate whether open air quality data is fit for analysis and comparison across cities.
We selected:
- APBA
- Torre Pacheco
These two cities exhibit different data characteristics:
- Different sensor coverage density
- Different pollutant availability
- Different temporal continuity
This allows us to test:
- Data completeness
- Cross-city comparability
- Stability of ingestion pipelines
The selected time range:
2023-07 to 2025-09
Reasons:
- Avoid very early historical data with inconsistent schema
- Focus on recent data with stable ingestion patterns
- Ensure sufficient volume for time-series analysis
- Align both cities on overlapping data availability windows
Data is ingested from CrateDB using monthly extraction:
data/raw/crate/{city}/{pollutant}/{YYYY-MM}.json data/bronze/crate/{city}/{pollutant}/{YYYY-MM}.csv
Characteristics:
- Partitioned by city / pollutant / month
- Raw API response preserved
- Minimal transformation
All pollutants are normalized into a unified schema:
data/silver/air_quality_long/
Schema:
| column | description |
|---|---|
| time_index | epoch timestamp (ms) |
| observed_at | timestamp |
| city | city name |
| pollutant | pollutant type (co, no2, o3...) |
| value | measurement value |
Transformations:
- Pivot wide → long
- Remove null pollutant values
- Normalize timestamps
- Union across cities
data/gold/city_hourly_air_quality/
- Aggregated by hour
- Average pollutant values
- Per city + pollutant
data/gold/city_hourly_comparison/
- Align cities on same timestamp
- Enable direct comparison
data/gold/city_hourly_coverage/
- Tracks data availability
- Identifies missing data patterns
- Pollutant availability differs significantly across cities
- Some city-pollutant combinations have zero data in certain periods
- Data density varies heavily by time
- Multiple records may exist for the same timestamp
This project demonstrates how to:
- Build a reproducible ingestion pipeline from open APIs
- Normalize heterogeneous environmental data
- Design scalable data layers (Bronze / Silver / Gold)
- Evaluate real-world data usability beyond basic validity checks
It highlights the gap between:
- "data exists"
- vs
- "data is usable for analysis"
- Bash (data ingestion)
- Spark (data transformation)
- Parquet (storage format)
- CrateDB (data source)
- Add data quality metrics (null rate, coverage rate)
- Extend to more cities
- Build dashboard / visualization layer