GitHub - AyushNagras-07/airflow_docker

PySpark + Airflow Batch ETL Pipeline (API → MySQL)

Overview

This project demonstrates a small-scale batch ETL pipeline that ingests user data from a public API, processes it using PySpark, and loads analytics-ready data into a relational database. Apache Airflow is used to orchestrate the end-to-end workflow, manage task dependencies, retries, and ensure reliable data loading.

The project serves as a proof-of-concept for integrating PySpark-based transformations within an Airflow-managed ETL pipeline before scaling to a larger producer–consumer architecture on AWS.

Business Use Case

In data-driven systems (SaaS platforms, consumer applications, analytics teams), user data is generated daily and must be ingested, processed, and stored reliably for reporting and analysis.
This pipeline simulates a daily user ingestion process where raw API data is transformed into a structured, query-ready format and loaded into a database with idempotent behavior to support retries and reprocessing.

Architecture Overview

The pipeline follows a batch ETL design with the following components:

Data Source
- Public API (RandomUser API) providing simulated user profile data.
Orchestration
- Apache Airflow manages task execution, dependencies, and retries.
Processing
- PySpark parses nested JSON, applies schema selection, and performs data cleaning.
Storage
- Processed data is loaded into a MySQL database using Airflow-managed connections.
Load Strategy
- Idempotent loading using delete-and-insert logic based on ingestion date.

An architecture and data flow diagram is included in the repository for clarity.

Data Flow (high level)

Fetch user data from the RandomUser API.
Temporarily store raw API response in Airflow XComs (POC-scale).
Process and flatten nested JSON fields using PySpark.
Add ingestion metadata (ingestion date, source system).
Load cleaned data into MySQL using Airflow hooks.
Support safe retries and re-runs without duplicate records.

Key Features

Batch ETL pipeline orchestrated with Apache Airflow.
PySpark-based transformation of nested JSON data.
Clear separation between data processing and data loading.
Idempotent database loads to handle retries and backfills.
Designed as a foundation for scaling to AWS S3 and larger data volumes.

Tech Stack

Apache Airflow
PySpark
Python
MySQL
REST API (RandomUser API)

Future Enhancements

Replace XCom-based data passing with S3-backed raw and processed data layers.
Scale transformations using larger datasets and partitioned Parquet files.
Integrate AWS services such as S3 and managed Airflow.
Extend analytics storage to Redshift or similar data warehouse solutions.

Notes

This project is intentionally kept small in scale to validate the core ETL design and technology integration. It acts as a stepping stone toward a full producer–consumer data platform with cloud-based storage and processing.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dags		dags
data		data
learning		learning
docker-compose.yaml		docker-compose.yaml
etl_pyspark_airflow.png		etl_pyspark_airflow.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

AyushNagras-07/airflow_docker

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages