This project implements a production-grade ETL pipeline using Azure Databricks following the Medallion Architecture (Bronze → Silver → Gold). It uses:
- 🔄 Auto Loader for streaming & incremental data ingestion
- 🔥 Delta Live Tables (DLT) for declarative transformations
- 🏢 Unity Catalog for governance
- 🧠 SCD Type 1 logic for warehouse dimensions
- 💾 ADLS Gen2 for layered cloud storage
+---------------------+
| Source Parquet |
+---------------------+
↓
Bronze Layer (Raw Ingest via Auto Loader)
↓
Silver Layer (Cleaned DataFrames with schema)
↓
Gold Layer (Fact & Dim tables with SCD logic)
✅ Ingested source data using Auto Loader
✅ Parameterized ingestion notebook (file_name input widget)
✅ Streaming Delta Tables with checkpointing for Exactly Once
📷 Auto Loader Notebook:
✅ PySpark transformations ✅ Dropped duplicates, selected columns, schema validation ✅ Tables registered to Unity Catalog
📷 Silver Layer Table in Catalog:
✅ Created FactOrders and DimCustomers
✅ Applied SCD Type 1 (overwrite on match, no history)
✅ Used merge() for UPSERT logic
✅ Created surrogate keys (monotonically_increasing_id() + offset)
📷 SCD Logic using Delta + PySpark:
Used @dlt.expect_all_or_drop() to enforce constraints on streaming tables:
my_rules = {
"rule1": "product_id IS NOT NULL",
"rule2": "product_name IS NOT NULL"
}
@dlt.expect_all_or_drop(my_rules)
@dlt.table()
def DimProducts_stage():
return spark.readStream.table("databricks_catalog.silver.products_silver")Used ABFSS URI format to access containers:
.option("path", "abfss://gold@datalakemushtaaqe2e.dfs.core.windows.net/FactOrders")Separate containers used for:
source/(raw data)bronze/,silver/,gold/(ETL stages)metastore/(Unity Catalog backing store)
| Tool | Usage |
|---|---|
| Azure Databricks | Compute, Notebooks, Jobs |
| Delta Lake | Storage Format + ACID |
| Unity Catalog | Centralized Schema Governance |
| Azure ADLS Gen2 | Cloud Object Storage |
| PySpark / SQL | Transformations |
| Delta Live Tables | Declarative Streaming & DWH |
| RocksDB | Stream State Management |
AzureLakehouseProject/
│
├── notebooks/
│ ├── bronze_ingestion.ipynb
│ ├── silver_customers.ipynb
│ ├── silver_orders.ipynb
│ ├── gold_dim_customers.ipynb
│ └── gold_fact_orders.ipynb
│
├── screenshots/
│ ├── auto_loader_notebook.png
│ ├── silver_output_catalog.png
│ ├── dlt_scd.png
│
├── README.md
└── .gitignore
-
🔧 Set up Azure ADLS Gen2 with containers:
source/,bronze/,silver/,gold/
-
🔐 Set up Unity Catalog + External Locations
-
📥 Upload source Parquet files to the
source/container -
🔄 Run the Bronze Auto Loader notebook
-
🧹 Run Silver notebooks to transform & clean data
-
🏛️ Run Gold notebooks to build fact/dim tables
-
🔁 Automate via Databricks Workflows (parameterized)
- Used RocksDB under the hood for structured streaming state store
- Streaming tables and views used inside DLT
- Handles Initial + Incremental loading cleanly
- Parameterized ingestion supports looping through datasets using jobs
Mustaaq Ahmed Khan Databricks | Azure | F1 Strategy Enthusiast 🔗 LinkedIn | 🧠 Passionate about ETL + Race Data + Real-time Systems
⭐ Star this repo if you found it helpful!

