End-to-End E-commerce Data Engineering Project

This project demonstrates a full end-to-end data engineering pipeline for ingesting, processing, transforming, and serving large-scale e-commerce data using Azure Data Factory, Azure Databricks, and Azure Synapse Analytics. The project follows a Medallion Architecture: Bronze → Silver → Gold.

Project Introduction

This project implements a robust pipeline, handling approximately 1 million rows from multiple sources. Data is ingested from 7 different GitHub CSV files (~120K rows each) and a MySQL database into Azure Data Lake Storage (ADLS Gen2) using Azure Data Factory.

To make the ingestion fully automated and scalable, an Azure Data Factory (ADF) pipeline was created using dynamic parameterization. The pipeline loops through an array of file URLs and automatically assigns filenames in ADLS, allowing multiple files to be ingested efficiently without manual intervention. This forms the Bronze layer of the Medallion Architecture, storing raw, unprocessed data.

For the Silver layer, Azure Databricks is used to process large datasets using Spark. The transformations include:

Data Cleaning: Removing null rows and duplicates
Data Type Corrections: Ensuring correct formats for dates, numbers, and categorical fields
Calculated Columns: Creating new metrics such as delivery_delay = datediff(order_delivered_date, order_placed_date) to enrich analytics capabilities
Table Joins: Combining multiple tables into a denormalized wide table for analytics and reporting

The Silver layer is further enriched with MongoDB data, adding additional attributes to enhance insights. After processing, the clean and enriched data is stored in Parquet format in ADLS under the folder Silver-Clean-Data, enabling columnar storage and compression for fast query performance.

Finally, the Gold layer is implemented using Azure Synapse Analytics. IAM roles and permissions are configured to allow Synapse to access the Silver layer. In Synapse:

Views and aggregations are created for analytical use
A CETAS (Create External Table As Select) table is generated and stored in Gold-Serving-Data

The Gold layer delivers analytics-ready, high-quality data, which can be directly consumed by BI tools such as Power BI or by end-users for reporting, enabling faster insights and decision-making.

Project Overview

Data Sources:
- 7 CSV files from GitHub (~120K rows each; total ~1M rows)
- One MySQL database table
- One MongoDB database for data enrichment
Storage (ADLS Gen2):
- Bronze Layer: Bronze-Raw-Data (raw unprocessed data)
- Silver Layer: Silver-Clean-Data (cleaned and enriched data)
- Gold Layer: Gold-Serving-Data (analytics-ready datasets)
Tools & Technologies:
- Azure Data Lake storage (ADLS gen 2) – blob storage
- IAM and Roles – access management for secure data operations
- Azure Data Factory (ADF) – dynamic pipelines for ingestion
- Azure Databricks (Spark) – processing, cleaning, joins, enrichment
- Azure Synapse Analytics – Gold layer storage, views, CETAS tables
- Python / PySpark – data processing
- Parquet format – columnar, compressed storage for analytics

Architecture & Workflow

1. Bronze Layer – Raw Data Ingestion

Azure Data Factory (ADF) is used to ingest raw data from GitHub CSV files and MySQL.

Key Steps:

Create Linked Services for HTTP (GitHub) and MySQL:
Dynamic Pipeline Creation
- ADF pipeline uses dynamic parameters to loop through an array containing:
  - GitHub CSV URLs
  - Target filenames in ADLS
- This allows ingestion of multiple files in a single pipeline.
Pipeline Execution – ingestion of CSV files and MySQL table export to ADLS:
Data Storage in Bronze Layer – all raw data stored in Bronze-Raw-Data:

Bronze Layer contains raw, unprocessed data that serves as a permanent reference for future processing.

2. Silver Layer – Data Processing & Transformation

Azure Databricks is used for big data processing using Spark.

Key Steps:

Connecting Databricks to ADLS
- Configured Spark cluster with required permissions to access ADLS storage.
Data Cleaning & Transformation
- Remove null rows
- Remove duplicate records
- Set correct data types (e.g., dates)
- Join tables to create denormalized wide table for analytics
Data Enrichment using MongoDB
- Enrich the wide table with data from MongoDB
Writing Clean Data to ADLS (Silver Layer)
- Stored in Parquet format for columnar storage and compression
- Ready for analytics
Visualizing Data in Databricks
- Quick visual checks to validate transformations and joins

Silver Layer contains cleaned, enriched, and analytics-ready data, optimized for storage and processing.

3. Gold Layer – Analytics-Ready Data

Azure Synapse Analytics is used to create final tables and views for analytics and reporting.

Key Steps:

Connecting Synapse to ADLS
- Configured IAM roles to access Silver-Clean-Data folder
Creating Views & Aggregations
- Build views for end-user analytics
Writing Final Table to Gold Layer
- CETAS (Create External Table As Select) stores final data in Gold-Serving-Data

Gold Layer contains final analytics-ready datasets accessible by BI tools like Power BI or directly by end-users for reporting and analysis.

Key Features

Dynamic Multi-Source Ingestion – GitHub + MySQL via a single ADF pipeline
Big Data Processing – Spark handles ~1 million rows efficiently
Data Quality Management – Null removal, duplicates, data type correction
Data Enrichment – Integrates MongoDB to enhance datasets
Optimized Storage – Parquet columnar format for fast analytics
Analytics-Ready Gold Layer – CETAS tables accessible to stakeholders

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
HTTP Data service		HTTP Data service
Mongodb server Data service		Mongodb server Data service
MySQL server Data service		MySQL server Data service
img		img
.DS_Store		.DS_Store
Architecture.png		Architecture.png
Databricks Silver layer.ipynb		Databricks Silver layer.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End E-commerce Data Engineering Project

Project Introduction

Project Overview

Architecture & Workflow

1. Bronze Layer – Raw Data Ingestion

2. Silver Layer – Data Processing & Transformation

3. Gold Layer – Analytics-Ready Data

Key Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

End-to-End E-commerce Data Engineering Project

Project Introduction

Project Overview

Architecture & Workflow

1. Bronze Layer – Raw Data Ingestion

2. Silver Layer – Data Processing & Transformation

3. Gold Layer – Analytics-Ready Data

Key Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages