Automated Data Pipeline using Apache Airflow, AWS EMR, PySpark & Snowflake

A fully automated data engineering pipeline orchestrated using Apache Airflow, running PySpark jobs on AWS EMR, and integrating data from Snowflake into aggregated datasets stored in Amazon S3.

This project demonstrates a production-style workflow including:

EMR cluster provisioning from Airflow
Executing multiple PySpark jobs in sequence
Extract-transform-load (ETL) from Snowflake
S3 data ingestion + aggregations
Joining processed datasets into a final output table
Sanitized, environment-variable driven configuration
CI workflow included (GitHub Actions)

🚀 Architecture Overview

Airflow DAG → EMR Cluster → PySpark Jobs → S3 → Snowflake → S3 Final Output

Pipeline Flow:

Airflow creates an EMR cluster dynamically using EmrCreateJobFlowOperator
Airflow submits 3 Spark jobs to EMR:
- s3.py: Reads raw data from S3 → aggregates IP counts
- snow.py: Reads Snowflake table via Spark Snowflake Connector → aggregates scores
- master.py: Joins the outputs and stores final results in S3
EMR auto-terminates after job completion
All configuration is handled via environment variables (no hardcoded secrets)

📂 Project Structure

dags/
  final_dag.py              # Airflow DAG (sanitized, uses env vars)
spark_jobs/
  s3.py                     # PySpark job: S3 ingest + aggregation
  snow.py                   # PySpark job: Snowflake extract + S3 write
  master.py                 # PySpark job: join + final output
.env.example                # Safe reference for required env variables
.gitignore
README.md
.github/workflows/ci.yml   # CI pipeline for linting

🔐 Environment Variables (Required)

All sensitive values must come through environment variables.

See .env.example for the full list:

AWS / EMR

AWS_REGION
AWS_CONN_ID
LOG_BUCKET
S3_BUCKET
EC2_KEYNAME
EC2_SUBNET
MASTER_SG
SLAVE_SG
EMR_SERVICE_ROLE
EMR_JOBFLOW_ROLE

Snowflake

SF_URL
SF_ACCOUNT
SF_USER
SF_PASSWORD
SF_DATABASE
SF_SCHEMA
SF_WAREHOUSE
SF_ROLE

S3 Paths

SRC_S3_PATH
DEST_IPCOUNT_PATH
DEST_SCORES_PATH
MASTER_OUT_PATH

🧪 How to Run (High-Level)

Upload spark_jobs/*.py to S3 bucket defined in S3_BUCKET
Place final_dag.py inside Airflow's dags/ folder
Set required environment variables in Airflow (or use Connections/Variables)
Airflow triggers the DAG (manually or on schedule)
EMR spins up → runs Spark jobs → terminates automatically

📌 CI Pipeline Included

This repo includes GitHub Actions (ci.yml) that runs:

Lint checks (flake8)
Basic test structure

📸 Screenshots

You can add:

Airflow DAG view
EMR cluster step execution
Snowflake query results
S3 output folders

(Ensure no secrets or account IDs are visible.)

📘 Why This Project is Valuable

This repository demonstrates real-world production data engineering skills:

Workflow orchestration
Distributed Spark processing
Data integration (Snowflake ↔ AWS)
CI/CD and clean repo structure
Secure coding practices (env vars only)
Cloud-native automation (EMR, S3, IAM)

Excellent for showcasing on a resume, LinkedIn, or interviews.

🏷️ License

Open-source. No proprietary secrets included.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Data Pipeline using Apache Airflow, AWS EMR, PySpark & Snowflake

🚀 Architecture Overview

📂 Project Structure

🔐 Environment Variables (Required)

AWS / EMR

Snowflake

S3 Paths

🧪 How to Run (High-Level)

📌 CI Pipeline Included

📸 Screenshots

📘 Why This Project is Valuable

🏷️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
dags		dags
spark_jobs		spark_jobs
.gitignore		.gitignore
Architecture.png		Architecture.png
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Automated Data Pipeline using Apache Airflow, AWS EMR, PySpark & Snowflake

🚀 Architecture Overview

📂 Project Structure

🔐 Environment Variables (Required)

AWS / EMR

Snowflake

S3 Paths

🧪 How to Run (High-Level)

📌 CI Pipeline Included

📸 Screenshots

📘 Why This Project is Valuable

🏷️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages