A fully automated data engineering pipeline orchestrated using Apache Airflow, running PySpark jobs on AWS EMR, and integrating data from Snowflake into aggregated datasets stored in Amazon S3.
This project demonstrates a production-style workflow including:
- EMR cluster provisioning from Airflow
- Executing multiple PySpark jobs in sequence
- Extract-transform-load (ETL) from Snowflake
- S3 data ingestion + aggregations
- Joining processed datasets into a final output table
- Sanitized, environment-variable driven configuration
- CI workflow included (GitHub Actions)
Airflow DAG → EMR Cluster → PySpark Jobs → S3 → Snowflake → S3 Final Output
Pipeline Flow:
- Airflow creates an EMR cluster dynamically using
EmrCreateJobFlowOperator - Airflow submits 3 Spark jobs to EMR:
s3.py: Reads raw data from S3 → aggregates IP countssnow.py: Reads Snowflake table via Spark Snowflake Connector → aggregates scoresmaster.py: Joins the outputs and stores final results in S3
- EMR auto-terminates after job completion
- All configuration is handled via environment variables (no hardcoded secrets)
dags/
final_dag.py # Airflow DAG (sanitized, uses env vars)
spark_jobs/
s3.py # PySpark job: S3 ingest + aggregation
snow.py # PySpark job: Snowflake extract + S3 write
master.py # PySpark job: join + final output
.env.example # Safe reference for required env variables
.gitignore
README.md
.github/workflows/ci.yml # CI pipeline for linting
All sensitive values must come through environment variables.
See .env.example for the full list:
AWS_REGION
AWS_CONN_ID
LOG_BUCKET
S3_BUCKET
EC2_KEYNAME
EC2_SUBNET
MASTER_SG
SLAVE_SG
EMR_SERVICE_ROLE
EMR_JOBFLOW_ROLE
SF_URL
SF_ACCOUNT
SF_USER
SF_PASSWORD
SF_DATABASE
SF_SCHEMA
SF_WAREHOUSE
SF_ROLE
SRC_S3_PATH
DEST_IPCOUNT_PATH
DEST_SCORES_PATH
MASTER_OUT_PATH
- Upload
spark_jobs/*.pyto S3 bucket defined inS3_BUCKET - Place
final_dag.pyinside Airflow'sdags/folder - Set required environment variables in Airflow (or use Connections/Variables)
- Airflow triggers the DAG (manually or on schedule)
- EMR spins up → runs Spark jobs → terminates automatically
This repo includes GitHub Actions (ci.yml) that runs:
- Lint checks (
flake8) - Basic test structure
You can add:
- Airflow DAG view
- EMR cluster step execution
- Snowflake query results
- S3 output folders
(Ensure no secrets or account IDs are visible.)
This repository demonstrates real-world production data engineering skills:
- Workflow orchestration
- Distributed Spark processing
- Data integration (Snowflake ↔ AWS)
- CI/CD and clean repo structure
- Secure coding practices (env vars only)
- Cloud-native automation (EMR, S3, IAM)
Excellent for showcasing on a resume, LinkedIn, or interviews.
Open-source. No proprietary secrets included.
