Forcasting Air Quality: Hourly NO₂ Concentration Prediction

Goal: Predict NO₂ concentration one hour ahead based on meteorological/weather features and pollutants data.

Dataset

This notebook uses the UCI Air Quality Dataset with the following features:
["Date", "Time", "CO(GT)", "PT08.S1(CO)", "NMHC(GT)", "C6H6(GT)", "PT08.S2(NMHC)", "NOx(GT)", "PT08.S3(NOx)", "NO2(GT)", "PT08.S4(NO2)", "PT08.S5(O3)", "T", "RH", "AH"]
Timeframe: March 2004 to February 2005 (Torino, Italy).

Problem/Motive

Air pollutions has direct implications on public health and city planning.
Why NO2?: NO2 is a strong indicator of traffic related pollutions and has respiratory health impacts. Motivation: Explore the data with machine learning models and try to predict short term pollutant levels based on other features and lagged features which would help to build an early warning system, create city dashboards and analysis.

Roadmap & Methodology

This notebook will follow the following roadmap:

Load & Clean Data
Basic Explanatory Data Analysis (EDA)
Time Series Analysis
Feature Enginerring & Feature Selection
Modeling & Evaluation & Feature Importance
Conclusion & Future Work

Techniques applied:

Residual analysis
Cross-validation & hyperparameter tuning
Model interpretability (feature importance, reduced models)
Error metrics (MAE, RMSE, R², Relative error rates)

Results

Model	MAE	RMSE	R²	Relative MAE	Relative RMSE
Linear Regression	33.73	44.53	24.92	9.98	13.17
Support Vector Regression	34.13	44.99	23.37	10.10	13.31
Random Forest	15.76	22.46	80.91	4.66	6.64
Random Forest (Reduced)	15.70	22.26	81.24	4.64	6.59
LightGBM	15.58	21.93	81.80	4.61	6.49

Based on the results, a tree based model (with time aware features) shows the best predictive power in comparison from our Linear and SVR model which outperforms them by ~60%. With a bit of permutation importance filtering, optimizing grid search parameters and using a gradient boosting machine we were able to improve our evaluation metrics. With a relative MAE of ~4.6% and a relative RMSE of ~6.5% our best model is best suitable for the following use case; Early warning system, Public info dashboard, City planning/analysis, etc..

How to Run

# Clone repo
git clone https://github.com/JacobL04/air-quality-prediction.git
cd air-quality-prediction

# Run notebook
jupyter notebook notebooks/air_quality_analysis.ipynb

Future Improvements

To conclude this project, we built and benchmarked multiple machine learning models to forecast hourly NO2 concentrations. Through feature engineering (lags, rolling statistics) and feature reduction (permutation importance), we demonstrated that a tree based models, particularly LightGBM, has achieve a strong predictive performance (R² = 0.82, MAE = 15.58, RMSE = 21.93), which significantly outpreforms the linear model.

Some future improvements and updates to consider for to this project:

Integration with real world APIs: Connect to live air quality and weather data sources (e.g. NASA EarthData or city-level pollution APIs) to validate model performance beyond the UCI dataset.
Automated EDA & feature engineering: Develop a more streamlined pipeline that automatically performs exploratory analysis and generates lagged/rolling features, making the approach scalable to different pollutants and datasets.
Expanded feature set: Experiment with new variables, such as wind speed/direction, temperature interactions (e.g. wind × NO2), or cross-pollutant relationships, to better capture complex dynamics of air quality.

Acknowledgments and Resources

Vito, S. (2008). Air Quality [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C59K5F.
https://www.youtube.com/@statquest
https://www.kaggle.com/code/ryanholbrook/time-series-as-features
https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html
https://www.geeksforgeeks.org/machine-learning/what-is-lag-in-time-series-forecasting/

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forcasting Air Quality: Hourly NO₂ Concentration Prediction

Dataset

Problem/Motive

Roadmap & Methodology

Techniques applied:

Results

How to Run

Future Improvements

Acknowledgments and Resources

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forcasting Air Quality: Hourly NO₂ Concentration Prediction

Dataset

Problem/Motive

Roadmap & Methodology

Techniques applied:

Results

How to Run

Future Improvements

Acknowledgments and Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages