Skip to content

JacobL04/air-quality-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Forcasting Air Quality: Hourly NO₂ Concentration Prediction

Goal: Predict NO₂ concentration one hour ahead based on meteorological/weather features and pollutants data.

Dataset

This notebook uses the UCI Air Quality Dataset with the following features:
["Date", "Time", "CO(GT)", "PT08.S1(CO)", "NMHC(GT)", "C6H6(GT)", "PT08.S2(NMHC)", "NOx(GT)", "PT08.S3(NOx)", "NO2(GT)", "PT08.S4(NO2)", "PT08.S5(O3)", "T", "RH", "AH"]
Timeframe: March 2004 to February 2005 (Torino, Italy).

Problem/Motive

Air pollutions has direct implications on public health and city planning.
Why NO2?: NO2 is a strong indicator of traffic related pollutions and has respiratory health impacts. Motivation: Explore the data with machine learning models and try to predict short term pollutant levels based on other features and lagged features which would help to build an early warning system, create city dashboards and analysis.

Roadmap & Methodology

This notebook will follow the following roadmap:

  1. Load & Clean Data
  2. Basic Explanatory Data Analysis (EDA)
  3. Time Series Analysis
  4. Feature Enginerring & Feature Selection
  5. Modeling & Evaluation & Feature Importance
  6. Conclusion & Future Work

Techniques applied:

  • Residual analysis
  • Cross-validation & hyperparameter tuning
  • Model interpretability (feature importance, reduced models)
  • Error metrics (MAE, RMSE, R², Relative error rates)

Results

Model MAE RMSE Relative MAE Relative RMSE
Linear Regression 33.73 44.53 24.92 9.98 13.17
Support Vector Regression 34.13 44.99 23.37 10.10 13.31
Random Forest 15.76 22.46 80.91 4.66 6.64
Random Forest (Reduced) 15.70 22.26 81.24 4.64 6.59
LightGBM 15.58 21.93 81.80 4.61 6.49

Based on the results, a tree based model (with time aware features) shows the best predictive power in comparison from our Linear and SVR model which outperforms them by ~60%. With a bit of permutation importance filtering, optimizing grid search parameters and using a gradient boosting machine we were able to improve our evaluation metrics. With a relative MAE of ~4.6% and a relative RMSE of ~6.5% our best model is best suitable for the following use case; Early warning system, Public info dashboard, City planning/analysis, etc..

How to Run

# Clone repo
git clone https://github.com/JacobL04/air-quality-prediction.git
cd air-quality-prediction

# Run notebook
jupyter notebook notebooks/air_quality_analysis.ipynb

Future Improvements

To conclude this project, we built and benchmarked multiple machine learning models to forecast hourly NO2 concentrations. Through feature engineering (lags, rolling statistics) and feature reduction (permutation importance), we demonstrated that a tree based models, particularly LightGBM, has achieve a strong predictive performance (R² = 0.82, MAE = 15.58, RMSE = 21.93), which significantly outpreforms the linear model.

Some future improvements and updates to consider for to this project:

  1. Integration with real world APIs: Connect to live air quality and weather data sources (e.g. NASA EarthData or city-level pollution APIs) to validate model performance beyond the UCI dataset.
  2. Automated EDA & feature engineering: Develop a more streamlined pipeline that automatically performs exploratory analysis and generates lagged/rolling features, making the approach scalable to different pollutants and datasets.
  3. Expanded feature set: Experiment with new variables, such as wind speed/direction, temperature interactions (e.g. wind × NO2), or cross-pollutant relationships, to better capture complex dynamics of air quality.

Acknowledgments and Resources

About

Forecasting and Predicting hourly NO2 concentration using Machine Learning techniques and models such as Feature Engineering, Time Series Analysis, Model Evaluation, Residual Analysis and more

Topics

Resources

Stars

Watchers

Forks

Contributors