Skip to content

Sharmila-D3/Airflow-EMR-Snowflake-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Data Pipeline using Apache Airflow, AWS EMR, PySpark & Snowflake

A fully automated data engineering pipeline orchestrated using Apache Airflow, running PySpark jobs on AWS EMR, and integrating data from Snowflake into aggregated datasets stored in Amazon S3.

Architecture Diagram

This project demonstrates a production-style workflow including:

  • EMR cluster provisioning from Airflow
  • Executing multiple PySpark jobs in sequence
  • Extract-transform-load (ETL) from Snowflake
  • S3 data ingestion + aggregations
  • Joining processed datasets into a final output table
  • Sanitized, environment-variable driven configuration
  • CI workflow included (GitHub Actions)

🚀 Architecture Overview

Airflow DAG → EMR Cluster → PySpark Jobs → S3 → Snowflake → S3 Final Output

Pipeline Flow:

  1. Airflow creates an EMR cluster dynamically using EmrCreateJobFlowOperator
  2. Airflow submits 3 Spark jobs to EMR:
    • s3.py: Reads raw data from S3 → aggregates IP counts
    • snow.py: Reads Snowflake table via Spark Snowflake Connector → aggregates scores
    • master.py: Joins the outputs and stores final results in S3
  3. EMR auto-terminates after job completion
  4. All configuration is handled via environment variables (no hardcoded secrets)

📂 Project Structure

dags/
  final_dag.py              # Airflow DAG (sanitized, uses env vars)
spark_jobs/
  s3.py                     # PySpark job: S3 ingest + aggregation
  snow.py                   # PySpark job: Snowflake extract + S3 write
  master.py                 # PySpark job: join + final output
.env.example                # Safe reference for required env variables
.gitignore
README.md
.github/workflows/ci.yml   # CI pipeline for linting

🔐 Environment Variables (Required)

All sensitive values must come through environment variables.

See .env.example for the full list:

AWS / EMR

AWS_REGION
AWS_CONN_ID
LOG_BUCKET
S3_BUCKET
EC2_KEYNAME
EC2_SUBNET
MASTER_SG
SLAVE_SG
EMR_SERVICE_ROLE
EMR_JOBFLOW_ROLE

Snowflake

SF_URL
SF_ACCOUNT
SF_USER
SF_PASSWORD
SF_DATABASE
SF_SCHEMA
SF_WAREHOUSE
SF_ROLE

S3 Paths

SRC_S3_PATH
DEST_IPCOUNT_PATH
DEST_SCORES_PATH
MASTER_OUT_PATH

🧪 How to Run (High-Level)

  1. Upload spark_jobs/*.py to S3 bucket defined in S3_BUCKET
  2. Place final_dag.py inside Airflow's dags/ folder
  3. Set required environment variables in Airflow (or use Connections/Variables)
  4. Airflow triggers the DAG (manually or on schedule)
  5. EMR spins up → runs Spark jobs → terminates automatically

📌 CI Pipeline Included

This repo includes GitHub Actions (ci.yml) that runs:

  • Lint checks (flake8)
  • Basic test structure

📸 Screenshots

You can add:

  • Airflow DAG view
  • EMR cluster step execution
  • Snowflake query results
  • S3 output folders

(Ensure no secrets or account IDs are visible.)


📘 Why This Project is Valuable

This repository demonstrates real-world production data engineering skills:

  • Workflow orchestration
  • Distributed Spark processing
  • Data integration (Snowflake ↔ AWS)
  • CI/CD and clean repo structure
  • Secure coding practices (env vars only)
  • Cloud-native automation (EMR, S3, IAM)

Excellent for showcasing on a resume, LinkedIn, or interviews.


🏷️ License

Open-source. No proprietary secrets included.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages