Goal: Predict NO₂ concentration one hour ahead based on meteorological/weather features and pollutants data.
This notebook uses the UCI Air Quality Dataset with the following features:
["Date", "Time", "CO(GT)", "PT08.S1(CO)", "NMHC(GT)", "C6H6(GT)", "PT08.S2(NMHC)", "NOx(GT)", "PT08.S3(NOx)", "NO2(GT)", "PT08.S4(NO2)", "PT08.S5(O3)", "T", "RH", "AH"]
Timeframe: March 2004 to February 2005 (Torino, Italy).
Air pollutions has direct implications on public health and city planning.
Why NO2?: NO2 is a strong indicator of traffic related pollutions and has respiratory health impacts.
Motivation: Explore the data with machine learning models and try to predict short term pollutant levels based on other features and lagged features which would help to build an early warning system, create city dashboards and analysis.
This notebook will follow the following roadmap:
- Load & Clean Data
- Basic Explanatory Data Analysis (EDA)
- Time Series Analysis
- Feature Enginerring & Feature Selection
- Modeling & Evaluation & Feature Importance
- Conclusion & Future Work
- Residual analysis
- Cross-validation & hyperparameter tuning
- Model interpretability (feature importance, reduced models)
- Error metrics (MAE, RMSE, R², Relative error rates)
| Model | MAE | RMSE | R² | Relative MAE | Relative RMSE |
|---|---|---|---|---|---|
| Linear Regression | 33.73 | 44.53 | 24.92 | 9.98 | 13.17 |
| Support Vector Regression | 34.13 | 44.99 | 23.37 | 10.10 | 13.31 |
| Random Forest | 15.76 | 22.46 | 80.91 | 4.66 | 6.64 |
| Random Forest (Reduced) | 15.70 | 22.26 | 81.24 | 4.64 | 6.59 |
| LightGBM | 15.58 | 21.93 | 81.80 | 4.61 | 6.49 |
Based on the results, a tree based model (with time aware features) shows the best predictive power in comparison from our Linear and SVR model which outperforms them by ~60%. With a bit of permutation importance filtering, optimizing grid search parameters and using a gradient boosting machine we were able to improve our evaluation metrics. With a relative MAE of ~4.6% and a relative RMSE of ~6.5% our best model is best suitable for the following use case; Early warning system, Public info dashboard, City planning/analysis, etc..
# Clone repo
git clone https://github.com/JacobL04/air-quality-prediction.git
cd air-quality-prediction
# Run notebook
jupyter notebook notebooks/air_quality_analysis.ipynbTo conclude this project, we built and benchmarked multiple machine learning models to forecast hourly NO2 concentrations. Through feature engineering (lags, rolling statistics) and feature reduction (permutation importance), we demonstrated that a tree based models, particularly LightGBM, has achieve a strong predictive performance (R² = 0.82, MAE = 15.58, RMSE = 21.93), which significantly outpreforms the linear model.
Some future improvements and updates to consider for to this project:
- Integration with real world APIs: Connect to live air quality and weather data sources (e.g. NASA EarthData or city-level pollution APIs) to validate model performance beyond the UCI dataset.
- Automated EDA & feature engineering: Develop a more streamlined pipeline that automatically performs exploratory analysis and generates lagged/rolling features, making the approach scalable to different pollutants and datasets.
- Expanded feature set: Experiment with new variables, such as wind speed/direction, temperature interactions (e.g. wind × NO2), or cross-pollutant relationships, to better capture complex dynamics of air quality.
- Vito, S. (2008). Air Quality [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C59K5F.
- https://www.youtube.com/@statquest
- https://www.kaggle.com/code/ryanholbrook/time-series-as-features
- https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html
- https://www.geeksforgeeks.org/machine-learning/what-is-lag-in-time-series-forecasting/