A fully automated, production-grade data pipeline on GCP using DBT, BigQuery, and GitHub Actions (CI/CD) — built to simulate a real-world healthcare analytics use case.
- Project Overview
- Tools & Technologies Used
- Architecture Diagram
- Step-by-Step Workflow
- Key Learnings
- Conclusion
- Acknowledgements
This project demonstrates my ability to build a scalable, production-grade data pipeline using industry-standard tools. From raw data ingestion and transformation to CI/CD and visualization, this project simulates the daily responsibilities of a Data Engineer.
⚙️ Tech stack: GCP + BigQuery + DBT Core + GitHub Actions + Python
| Tool | Purpose |
|---|---|
| Google Cloud Platform (GCP) | Infrastructure & storage |
| BigQuery | Data warehouse / SQL engine |
| Google Cloud Storage | Raw data file storage |
| DBT Core | Transformations, testing, documentation |
| GitHub Actions | CI/CD automation pipeline |
| Python | Data generation script |
This project follows a modular and automated data engineering architecture on Google Cloud.
Raw synthetic healthcare data is generated and stored in GCS, externalized into BigQuery, transformed via DBT models, and deployed through CI/CD using GitHub Actions.
- Created a new Google Cloud project (
root-matrix-457217-p5) - Enabled BigQuery, Cloud Storage, and IAM APIs
- Created service accounts with proper IAM roles (
BigQuery Admin,Storage Admin, etc.) -
- Created a dedicated service account in GCP for secure, programmatic access
- Assigned necessary roles:
BigQuery AdminStorage AdminBigQuery Job User
- Downloaded the service account's JSON key
- Used this key for:
- Local development (
profiles.ymlwithkeyfile:path) - GitHub Actions (
gcp-key.jsongenerated dynamically from GitHub Secrets)
- Local development (
- Created datasets:
dev_healthcare_dataprod_healthcare_data
- Cloud Storage bucket (
healthcare-data-bucket-amarkhatri) was created automatically by the Python script
To simulate a real-world healthcare data pipeline, I wrote a Python script that:
- Generates synthetic data using the Faker library for:
- Patient demographics (
CSV) - Electronic health records (
JSON) - Insurance claims (
Parquet)
- Patient demographics (
- Creates a Cloud Storage bucket if it doesn't already exist
- Cleans the target folders before uploading new files
- Uploads raw data directly to GCS (
dev/andprod/folders) - Writes all files in appropriate formats using:
pandasfor CSVjsonfor newline-delimited JSONpyarrowfor Parquet
The script performs all ingestion + staging steps programmatically, without manual uploads.
📁 Script location:
data_generator/synthetic_data_generator.py
After uploading the raw data to GCS, I created external tables in BigQuery that reference those files directly — allowing SQL querying without loading the data into native BigQury storage.
patient_data.csv→ CSV external tableehr_data.json→ newline-delimited JSON external tableclaims_data.parquet→ Parquet external table (schema-aware)
These external tables were created for both:
dev_healthcare_datadataset (5K test records)prod_healthcare_datadataset (20K records)
- Cost-efficient for large, raw datasets
- Schema can be auto-detected or explicitly defined
- Queryable via standard SQL like any native table
CREATE OR REPLACE EXTERNAL TABLE `project_id.dev_healthcare_data.patient_data_external`
OPTIONS (
format = 'CSV',
uris = ['gs://healthcare-data-bucket-amarkhatri/dev/patient_data.csv'],
skip_leading_rows = 1
);Once the raw data was available via external tables in BigQuery, I used DBT Core to build a structured transformation layer on top of it.
In this step, I created several DBT models that:
-
Referenced external tables using
{{ source() }} -
Performed aggregations and filtering logic (e.g., identifying high claim patients, summarizing chronic conditions)
-
Joined datasets (e.g., patients with claims or EHR data)
-
Were configured with DBT’s built-in materializations (
view,incremental) for flexibility and performance All transformations were written in modular.sqlmodels, configured viadbt_project.ymland executed using DBT CLI or GitHub Actions.
It shows how each model in the pipeline is derived from raw external source tables in BigQuery:
- Sources (
SRC) likeclaims_data_external,patient_data_external, andehr_data_externalrepresent external tables that directly query files stored in Google Cloud Storage - Models (
MDL) likehigh-claim-patients,chronic-conditions-summary, andhealth-anomaliesrepresent transformed tables built using SQL logic in DBT
- Used
{{ source() }}to connect to external BigQuery tables backed by GCS - Applied
{{ config(materialized='incremental') }}to optimize model performance - Structured models for clarity and reusability
- Defined column-level tests using
schema.yml(e.g.,not_null,unique)
To ensure data quality and trust in the pipeline, I implemented column-level tests and added documentation using schema.yml files in DBT.
DBT allows us to define tests and metadata alongside our models — all inside YAML. These tests run automatically using dbt test.
- To enforce data integrity on critical columns (
not_null,unique) - To validate raw data coming from external sources
- To document model and column purposes using DBT's built-in documentation system
- To support CI/CD by catching schema or data issues automatically in GitHub Actions
Here’s an example from schema.yml:
version: 2
models:
- name: high_claim_patients
description: "Identifies patients with total claim amounts above a threshold"
columns:
- name: patient_id
tests:
- unique
- not_null
- name: total_claim_amount
tests:
- not_nullTo automate testing and deployment of my DBT models, I configured GitHub Actions to handle CI (Continuous Integration) and CD (Continuous Deployment).
This ensures:
- Every pull request runs
dbt testto validate models before merging - Every merge to main triggers
dbt runto deploy production models to BigQuery - All deployments are version-controlled, reproducible, and secure
-
Triggered on pull requests to
main -
Spins up a fresh Ubuntu runner
-
Installs Python + DBT
-
Injects a secure GCP service account key via GitHub Secrets
-
Generates a temporary
profiles.yml -
Runs
`dbt testto validate schema and model logic
Example snippet from ci.yml:
on:
pull_request:
branches:
- main
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install dbt-bigquery
- run: dbt test --profiles-dir /home/runner/.dbt --target devTriggered on push to main
-
Installs DBT, authenticates with GCP
-
Runs dbt run with --target prod to build models into the production dataset
| Config | Value |
|---|---|
| Deployment method | GitHub Actions |
| Trigger | push to main |
| DBT command | dbt run --target prod |
| Destination | prod_healthcare_data BigQuery dataset |
| Auth method | GCP service account via GitHub Secrets |
| Environment | Fresh Ubuntu runner (ubuntu-latest) |
Secure deployment:
GCP credentials stored as GitHub Secrets
gcp-key.json and profiles.yml are generated at runtime (not stored in repo)
- Developed a clear understanding of DBT’s transformation flow and model structure
- Gained hands-on experience integrating Python, BigQuery, and GCS for cloud data pipelines
- Learned how to build secure, production-ready CI/CD pipelines using GitHub Actions
- Practiced writing modular, testable SQL models with automated validations
- Built an end-to-end pipeline that mirrors real-world engineering workflows
This project started as a hands-on learning exercise and became a full-stack, automated data engineering pipeline. I worked with industry-standard tools (GCP, DBT, GitHub Actions), built my own data sources, and pushed transformations all the way to production.
It reflects both the technical skills I’ve developed and my drive to learn independently and build real, usable solutions.
This project was built by closely following a YouTube tutorial by DATA TIME, which covered how to build an end-to-end data pipeline using DBT, BigQuery, and GitHub Actions.
My goal with this project was not to invent something new, but to:
- Rebuild the full pipeline on my own
- Understand every component (GCP, DBT, CI/CD)
- Practice deploying it in a production-ready, portfolio-quality format
- Document my learning in a way that demonstrates my technical depth and readiness for real-world work
All credit for the project architecture and approach goes to the original creator — this repo reflects my hands-on execution of the same, and my own journey of learning through replication.





