GitHub - aldo2287/Data-Engineering-and-Analysis-City-Air-Quality-API-Extraction-Portfolio-Full-Project: This is an API ETL project to create an SQL database for analysis visualizing and manipulating data following a logical and defined Pipeline model

aldo2287 / Data-Engineering-and-Analysis-City-Air-Quality-API-Extraction-Portfolio-Full-Project Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

This is an API ETL project to create an SQL database for analysis visualizing and manipulating data following a logical and defined Pipeline model

2 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Air_Quality_Bristol_Final_Submission.ipynb		Air_Quality_Bristol_Final_Submission.ipynb
ETL Pipeline Portfolio Project.pdf		ETL Pipeline Portfolio Project.pdf
ETL Portfolio Reflections.pdf		ETL Portfolio Reflections.pdf
ETL_flowchart.drawio.png		ETL_flowchart.drawio.png
ReadME.txt		ReadME.txt
api_data.geojson		api_data.geojson
dataframe_df.csv		dataframe_df.csv
final_data.csv		final_data.csv

Repository files navigation

📊 Air Quality Bristol API ETL & Visualization Project
A Python-based ETL pipeline that extracts air quality monitoring data from the official Bristol City Council API, processes and transforms the data, and loads the results into a PostgreSQL database using Apache Spark. The project also includes data normalization, geospatial coordinate handling, and visualization of monitoring locations across the city of Bristol.

📌 Project Overview
This project demonstrates a complete data engineering and analytics workflow:

Extract: Data is collected from an online public API in GeoJSON format.

Transform:

Geospatial data is processed using GeoPandas.

Coordinates are normalized using MinMaxScaler.

Cleaned and transformed data is stored in both CSV and PostgreSQL database formats.

Load: The processed data is loaded into a PostgreSQL database via Apache Spark.

Visualize: Monitor site locations are visualized using Matplotlib on a normalized coordinate grid.

🚀 Technologies Used
Python 3.x

GeoPandas

Pandas

Scikit-learn

Matplotlib

Apache Spark (PySpark)

PostgreSQL

JDBC Connector

Shapely

📊 Data Source
Source: Bristol City Council Air Quality API

Format: GeoJSON containing monitoring station data for Bristol City.

📂 Project Workflow
API Call: Fetches live air quality data in GeoJSON format.

Read & Parse GeoJSON: Loads data into a GeoPandas DataFrame.

Coordinate Normalization: Scales longitude and latitude values between 0 and 1.

Data Transformation:

Adds normalized coordinates.

Updates geometry points.

Selects and renames relevant columns.

Export to CSV: Saves the transformed dataset.

Load into PostgreSQL via Apache Spark: Ingests final data into a database table.

Visualization: Plots normalized monitoring locations on a scatter plot.

Summary Statistics: Counts unique monitoring locations in the dataset.

📈 Example Visualization
A scatter plot showing the distribution of air quality monitoring stations in Bristol, using normalized coordinate values.

📦 How to Run
Install required Python libraries:

bash
Copy
Edit
pip install pandas geopandas scikit-learn matplotlib pyspark psycopg2-binary
Set up your PostgreSQL database and update the connection credentials in the script.

Run the Python script to execute the complete ETL pipeline and generate the visual output.

📑 Key Skills Demonstrated
API Integration & Data Extraction

GeoJSON Data Handling

Geospatial Data Normalization

Apache Spark Data Processing

PostgreSQL Data Ingestion via JDBC

Data Visualization

End-to-End ETL Workflow