Skip to content

blackdyer26/Retail-Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Retail Data Engineering Pipeline

Azure Databricks PySpark

Project Overview

This project demonstrates an end-to-end enterprise-grade data engineering solution built on Microsoft Azure cloud platform. It implements a modern data lakehouse architecture using the **Medallion Architecture"" (Bronze-Silver-Gold layers) to process retail transaction data from multiple heterogeneous sources.

Key Achievements

  • ✅ Architected an end-to-end ETL solution using Azure Data Factory to ingest heterogeneous data from Azure SQL Database and REST APIs into ADLS Gen2 Storage.
  • ✅ Implemented a robust Medallion Architecture (Bronze → Silver → Gold) for data quality and governance.
  • ✅ Developed high-performance PySpark transformations in Azure Databricks to process and transform data across Delta Lake layers.
  • ✅ Built interactive Power BI dashboards for real-time business intelligence and analytics.
  • ✅ Ensured data quality, consistency, and reliability throughout the entire pipeline.

🏗️ Architecture Diagram

Retail Data Pipeline Architecture

Architecture Components

The pipeline consists of the following key components:

  1. Data Sources (Left)

    • Azure SQL Database: Three tables (Transaction, Product, Store)
    • REST API: Customer data in JSON format
  2. Ingestion Layer (Azure Data Factory)

    • Orchestrates data movement from multiple sources
    • Handles incremental and full data loads
    • Implements error handling and retry logic
  3. Storage Layer (Azure Data Lake Storage Gen2)

    • Centralized data lake for all raw and processed data
    • Hierarchical namespace for efficient data organization
    • Cost-effective storage with high throughput
  4. Processing Layer (Azure Databricks)

    • Distributed PySpark processing engine
    • Implements business logic and transformations
    • Handles data quality checks and validations
  5. Data Lakehouse (Medallion Architecture)

    • Bronze Layer: Raw data ingestion (as-is from source)
    • Silver Layer: Cleaned, validated, and deduplicated data
    • Gold Layer: Business-level aggregations and analytics-ready datasets
  6. Visualization Layer

    • Interactive dashboards and reports
    • Real-time business metrics
    • Self-service analytics for stakeholders

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages