This project demonstrates a full end-to-end data engineering pipeline for ingesting, processing, transforming, and serving large-scale e-commerce data using Azure Data Factory, Azure Databricks, and Azure Synapse Analytics. The project follows a Medallion Architecture: Bronze → Silver → Gold.
This project implements a robust pipeline, handling approximately 1 million rows from multiple sources. Data is ingested from 7 different GitHub CSV files (~120K rows each) and a MySQL database into Azure Data Lake Storage (ADLS Gen2) using Azure Data Factory.
To make the ingestion fully automated and scalable, an Azure Data Factory (ADF) pipeline was created using dynamic parameterization. The pipeline loops through an array of file URLs and automatically assigns filenames in ADLS, allowing multiple files to be ingested efficiently without manual intervention. This forms the Bronze layer of the Medallion Architecture, storing raw, unprocessed data.
For the Silver layer, Azure Databricks is used to process large datasets using Spark. The transformations include:
- Data Cleaning: Removing null rows and duplicates
- Data Type Corrections: Ensuring correct formats for dates, numbers, and categorical fields
- Calculated Columns: Creating new metrics such as
delivery_delay = datediff(order_delivered_date, order_placed_date)to enrich analytics capabilities - Table Joins: Combining multiple tables into a denormalized wide table for analytics and reporting
The Silver layer is further enriched with MongoDB data, adding additional attributes to enhance insights. After processing, the clean and enriched data is stored in Parquet format in ADLS under the folder Silver-Clean-Data, enabling columnar storage and compression for fast query performance.
Finally, the Gold layer is implemented using Azure Synapse Analytics. IAM roles and permissions are configured to allow Synapse to access the Silver layer. In Synapse:
- Views and aggregations are created for analytical use
- A CETAS (Create External Table As Select) table is generated and stored in
Gold-Serving-Data
The Gold layer delivers analytics-ready, high-quality data, which can be directly consumed by BI tools such as Power BI or by end-users for reporting, enabling faster insights and decision-making.
-
Data Sources:
- 7 CSV files from GitHub (~120K rows each; total ~1M rows)
- One MySQL database table
- One MongoDB database for data enrichment
-
Storage (ADLS Gen2):
- Bronze Layer:
Bronze-Raw-Data(raw unprocessed data) - Silver Layer:
Silver-Clean-Data(cleaned and enriched data) - Gold Layer:
Gold-Serving-Data(analytics-ready datasets)
- Bronze Layer:
-
Tools & Technologies:
- Azure Data Lake storage (ADLS gen 2) – blob storage
- IAM and Roles – access management for secure data operations
- Azure Data Factory (ADF) – dynamic pipelines for ingestion
- Azure Databricks (Spark) – processing, cleaning, joins, enrichment
- Azure Synapse Analytics – Gold layer storage, views, CETAS tables
- Python / PySpark – data processing
- Parquet format – columnar, compressed storage for analytics
Azure Data Factory (ADF) is used to ingest raw data from GitHub CSV files and MySQL.
Key Steps:
-
Dynamic Pipeline Creation
-
Pipeline Execution – ingestion of CSV files and MySQL table export to ADLS:

-
Data Storage in Bronze Layer – all raw data stored in
Bronze-Raw-Data:

Bronze Layer contains raw, unprocessed data that serves as a permanent reference for future processing.
Azure Databricks is used for big data processing using Spark.
Key Steps:
-
Connecting Databricks to ADLS
-
Data Cleaning & Transformation
- Remove null rows
- Remove duplicate records
- Set correct data types (e.g., dates)
- Join tables to create denormalized wide table for analytics
-
Data Enrichment using MongoDB
-
Writing Clean Data to ADLS (Silver Layer)
-
Visualizing Data in Databricks
Silver Layer contains cleaned, enriched, and analytics-ready data, optimized for storage and processing.
Azure Synapse Analytics is used to create final tables and views for analytics and reporting.
Key Steps:
-
Connecting Synapse to ADLS
- Configured IAM roles to access
Silver-Clean-Datafolder
- Configured IAM roles to access
-
Creating Views & Aggregations
-
Writing Final Table to Gold Layer
Gold Layer contains final analytics-ready datasets accessible by BI tools like Power BI or directly by end-users for reporting and analysis.
- Dynamic Multi-Source Ingestion – GitHub + MySQL via a single ADF pipeline
- Big Data Processing – Spark handles ~1 million rows efficiently
- Data Quality Management – Null removal, duplicates, data type correction
- Data Enrichment – Integrates MongoDB to enhance datasets
- Optimized Storage – Parquet columnar format for fast analytics
- Analytics-Ready Gold Layer – CETAS tables accessible to stakeholders













