A collection of data engineering projects showcasing my skills in building robust, scalable, and secure data pipelines using modern tools and cloud platforms.
This portfolio demonstrates experience across the full data engineering lifecycle — from data ingestion and transformation to orchestration, governance, and monitoring.
Key Technologies:
- Data Lake: Google Cloud Storage
- Data Warehouse: Google BigQuery
- IaC: Terraform
- Data Transformation: dbt, PySpark
- Workflow orchestration: Kestra, Airflow
- Containerization: Docker
ETL pipeline orchestrated with Kestra that extracts .json files data from GH Archive, load them to a Google Cloud Bucket, transforms them into Tables in Google BigQuery via dbt. Data are visualized via Looker Studio
- Docker (containerization)
- Terraform (infrastructure as code)
- Kestra (workflow orchestration)
- Google Cloud Storage (data lake)
- BigQuery (data warehouse)
- dbt (data transformation)
- Looker Studio (data visualization)
Simple ETL pipeline orchestrated with AirFlow. Reads data from from CoinCap api and loads data into a postresql database using a PySpark cluster.
- AirFlow (workflow orchestration)
- PySpark (data transformation)
- Postgresql (database)
This repository contains a Nextflow pipeline that takes a file of sgRNA sequences and analyzes where they align in the human genome (GRCh38). It includes steps to convert and compare gene information, and creates a simple gene expression matrix using data from two breast cancer (TCGA-BRCA) samples.
- Nextflow (workflow orchestration)
- Docker (containerization)