This repository contains my final project for the Data Engineering Zoomcamp by DataTalksClub. The course covered a varitey of data engineering topics and tools including docker, terraform, dbt, dlt, google cloud platform, spark, and kafka. A big thanks to Alexey Grigorev and his team for the wonderful work they do sharing information and creating community.
My course notes and homework can be found here: DE Zoomcamp 2024 Repo.
To reproduce this project, please refer to the the Set Up Instructions
To apply the principles and techniques learned in the data engineering course by building an end-to-end cloud data pipeline for the NYC Collisions dataset. The project ingests, transforms, and enriches the data with local meterological data in monthly batches and visualizes the combined dataset.
The City of New York provides access to The New York City Motor Vehicle Collisions Data Set through the NYC Open Data platform. The dataset encompasses a wide range of collision types, including but not limited to, vehicle-to-vehicle crashes, pedestrian-involved incidents, and cyclist accidents. Key attributes include the date and time of the collision, location details such as the borough and street, as well as contributing factors and vehicle types involved.
As part of this excercise, the dataset was enriched with meteorological data obtained from World Weather Online. In order to provide further insights into factors potentially influencing collisions, additional categorical representations were derived from the data. These representations include severity and sun phase, which were calculated based on the day of the year and latitude/longitude coordinates. This enriched dataset enables a more comprehensive analysis of traffic collisions in New York City.
- Cloud - Google Cloud Platform
- Infrastructure as Code Software - Terraform
- Containerization - Docker and Docker Compose
- Orchestration - Mage
- Data Extract - REST API
- Transformation - DBT & Apache Spark
- Data Lake - Google Cloud Storage
- Data Warehouse - BigQuery
- Data Visualization - Looker Studio
- Languages - Python, SQL, and Bash
Terraform is used to:
- Authenticate the service account.
- Enabel Google Cloud Platform APIs.
- Provision a GCP bucket and Bigquery dataset.
- Destroy the infrastructure upon completion.
Docker and Docker-Compose are used to create a local container environment for running Mage with integrated DBT and Spark.
Mage is employed for orchestrating the data pipelines, managing dependencies between tasks, and automating workflows.
Collision Data Ingestion
- Collision data is retrieved from the REST API on a monthly basis.
- Requests are made in batches until the full month of data has been retrieved (typically 2-3 requests/ mos).
- Raw data is then written to the GCP bucket.
Weather Data Ingestion
- Because the weather dataset isn't publicly accessible, it was retrieved from the REST API and saved to CSV within this repository. This was done for reproducibility purposes using a Python script found in the /scripts folder.
- The CSV file is ingested into Mage where column name transformations are performed. The file is then written to the GCS bucket and exposed to BigQuery as an external table.
Initial processing of the collisions dataset is handled using Spark to assign data types and calculate the sun phase (day, night, dusk, or dawn) based on date and time using a user-defined function (UDF). The processed data is written back to the GCS bucket and exposed as external tables in BigQuery.
DBT is employed to further transform the collision data and incorporate weather information into a partitioned and clusterd fact table. Subsequently, dimensional tables such as monthly and annual views, along with a vehicle view derived from unnested data, are created.
Looker is utilized for visualizing and analyzing the transformed data stored in BigQuery.