This project demonstrates a production-grade end-to-end data engineering pipeline for e-commerce data using Microsoft Azure and Databricks Medallion Architecture.
It showcases how raw data is ingested, transformed, and served for analytics using cloud-native, scalable, and enterprise-ready design patterns.
The platform follows a layered Medallion Architecture:
graph TD
%% Define Styles with Black Text
classDef source fill:#FFD1DC,stroke:#333,stroke-width:2px,color:#000;
classDef landing fill:#FFFACD,stroke:#d4a017,stroke-width:2px,color:#000;
classDef bronze fill:#E6CCB2,stroke:#333,stroke-width:2px,color:#000;
classDef silver fill:#E5E4E2,stroke:#333,stroke-width:2px,color:#000;
classDef gold fill:#FFD700,stroke:#333,stroke-width:2px,color:#000;
classDef viz fill:#D6EAF8,stroke:#333,stroke-width:2px,color:#000;
%% Pipeline Flow
S[(Source CSV Files)]:::source --> L1[Landing Zone 1: Raw CSV]:::landing
L1 --> L2[Landing Zone 2: Parquet]:::landing
L2 --> B[Bronze Layer: Raw Delta]:::bronze
B --> Si[Silver Layer: Cleaned & Transformed]:::silver
Si --> G[Gold Layer: Analytics-Ready]:::gold
G --> V{Analytics & Dashboards}:::viz
- Raw CSV files are manually uploaded into Azure Data Lake Storage Gen2
- Data is stored as-is (schema-on-read)
- Acts as a system of record for replay and audit
- Azure Data Factory (ADF) pipeline converts CSV files into Parquet
- Improves:
- Query performance
- Compression efficiency
- Downstream processing speed
- Reads Parquet files from Landing Zone 2
- Performs append-only ingestion into Delta tables
- No transformations applied
- Preserves raw structure and ingestion metadata
- Cleans and standardizes data
- Handles:
- Null values
- Duplicates
- Data type corrections
- Applies business-level transformations
- Implements efficient incremental processing
- Creates business-ready, analytics-optimized tables
- Aggregates and joins cleaned data
- Serves as the single source of truth for reporting
- Event-based triggers activate pipelines when new data arrives
- Single trigger with conditional processing ensures efficiency
- Prevents unnecessary recomputation of static datasets
- Azure Synapse Analytics used for SQL-based analytics
- Databricks Dashboards for interactive visualizations
- Gold layer tables are queried directly for insights
- Azure Data Factory
- Azure Data Lake Storage Gen2
- Azure Databricks
- Delta Lake
- Parquet
- Azure Synapse Analytics
- Designing scalable cloud data architectures
- Implementing Medallion Architecture in Databricks
- Building event-driven ETL pipelines
- Optimizing storage and analytics using Parquet & Delta Lake
- Applying real-world data engineering best practices
- Implement Auto Loader for streaming ingestion
- Add data quality checks and validations
- Integrate Power BI dashboards
- Enable CI/CD for Databricks workflows
If you’re interested in Data Engineering, Azure, or Databricks, feel free to connect or explore the notebooks in this repository.