Skip to content

JoelVilc/elt-with-gcp-compute-engine-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ELT Data Pipeline with GCP and Airflow

This project demonstrates how to build an ELT (Extract, Load, Transform) data pipeline to process 1 million records using Google Cloud Platform (GCP) and Apache Airflow. The pipeline extracts data from Google Cloud Storage (GCS), loads it into BigQuery, and transforms it to create country-specific tables and views for analysis.


Features

  • Extract data from GCS in CSV format.
  • Load raw data into a staging table in BigQuery.
  • Transform data into country-specific tables and reporting views.
  • Use Apache Airflow to orchestrate the pipeline.
  • Generate clean and structured datasets for analysis.

Architecture

image

Workflow

  1. Extract: Check for file existence in GCS.
  2. Load: Load raw CSV data into a BigQuery staging table.
  3. Transform:
    • Create country-specific tables in the transform layer.
    • Generate reporting views for each country with filtered insights.

Data Layers

  1. Raw Layer: Raw data from the CSV file.
  2. Transform Layer: Cleaned and transformed tables.
  3. Reporting Layer: Views optimized for analysis and reporting.

Requirements

Tools and Services

  • Google Cloud Platform (GCP):
    • Google Compute Engine ( for Airflow )
    • BigQuery
    • Cloud Storage
  • Apache Airflow:
    • Airflow with Google Cloud providers

Setup Instructions

Prerequisites

  1. A Google Cloud project with:
    • BigQuery and Cloud Storage enabled.
    • Service account with required permissions.
  2. Apache Airflow installed.
  3. Airflow Installation on VM

End Result

Airflow Pipeline

image

Looker Studio Report

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages