Skip to content

This is an API ETL project to create an SQL database for analysis visualizing and manipulating data following a logical and defined Pipeline model

Notifications You must be signed in to change notification settings

aldo2287/Data-Engineering-and-Analysis-City-Air-Quality-API-Extraction-Portfolio-Full-Project

Repository files navigation

๐Ÿ“Š Air Quality Bristol API ETL & Visualization Project
A Python-based ETL pipeline that extracts air quality monitoring data from the official Bristol City Council API, processes and transforms the data, and loads the results into a PostgreSQL database using Apache Spark. The project also includes data normalization, geospatial coordinate handling, and visualization of monitoring locations across the city of Bristol.

๐Ÿ“Œ Project Overview
This project demonstrates a complete data engineering and analytics workflow:

Extract: Data is collected from an online public API in GeoJSON format.

Transform:

Geospatial data is processed using GeoPandas.

Coordinates are normalized using MinMaxScaler.

Cleaned and transformed data is stored in both CSV and PostgreSQL database formats.

Load: The processed data is loaded into a PostgreSQL database via Apache Spark.

Visualize: Monitor site locations are visualized using Matplotlib on a normalized coordinate grid.

๐Ÿš€ Technologies Used
Python 3.x

GeoPandas

Pandas

Scikit-learn

Matplotlib

Apache Spark (PySpark)

PostgreSQL

JDBC Connector

Shapely

๐Ÿ“Š Data Source
Source: Bristol City Council Air Quality API

Format: GeoJSON containing monitoring station data for Bristol City.

๐Ÿ“‚ Project Workflow
API Call: Fetches live air quality data in GeoJSON format.

Read & Parse GeoJSON: Loads data into a GeoPandas DataFrame.

Coordinate Normalization: Scales longitude and latitude values between 0 and 1.

Data Transformation:

Adds normalized coordinates.

Updates geometry points.

Selects and renames relevant columns.

Export to CSV: Saves the transformed dataset.

Load into PostgreSQL via Apache Spark: Ingests final data into a database table.

Visualization: Plots normalized monitoring locations on a scatter plot.

Summary Statistics: Counts unique monitoring locations in the dataset.

๐Ÿ“ˆ Example Visualization
A scatter plot showing the distribution of air quality monitoring stations in Bristol, using normalized coordinate values.

๐Ÿ“ฆ How to Run
Install required Python libraries:

bash
Copy
Edit
pip install pandas geopandas scikit-learn matplotlib pyspark psycopg2-binary
Set up your PostgreSQL database and update the connection credentials in the script.

Run the Python script to execute the complete ETL pipeline and generate the visual output.

๐Ÿ“‘ Key Skills Demonstrated
API Integration & Data Extraction

GeoJSON Data Handling

Geospatial Data Normalization

Apache Spark Data Processing

PostgreSQL Data Ingestion via JDBC

Data Visualization

End-to-End ETL Workflow

About

This is an API ETL project to create an SQL database for analysis visualizing and manipulating data following a logical and defined Pipeline model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published