This project demonstrates a real-time data engineering pipeline using Azure Data Factory, Databricks (PySpark), Delta Lake, and Azure Data Lake Storage Gen2.
The pipeline pulls data directly from the NYC Taxi API, eliminating the need for manual file uploads. It transforms and organizes data using the Medallion Architecture (Bronze → Silver → Gold) and ensures data is secure, optimized, and analytics-ready.
NYC Taxi Trip Record Data
NYC TLC Official Data Page
The pipeline follows the Medallion Architecture:
Layer | Description |
---|---|
🥉 Bronze | Raw data ingested from the API |
🥈 Silver | Cleaned and transformed data |
🥇 Gold | Modeled data used for analytics/reporting |
-
API Integration:
- Pulls live data from the official NYC Taxi API.
-
Ingestion (ADF):
- Dynamic & parameterized pipelines ingest data into Bronze Layer in Parquet format.
-
Transformation (Databricks & PySpark):
- Cleansing and data modeling happens here.
- Output saved in the Silver Layer.
-
Serving (Delta Lake + Parquet):
- Modeled data stored in the Gold Layer.
- Delta Lake enables:
- Time travel
- Version control
- ACID compliance
-
Security:
- Azure Active Directory + Key Vault + RBAC for safe access.
All layers use Parquet for efficient storage.
Gold Layer uses Delta Lake for advanced features.
Feature | Use Case |
---|---|
Azure Active Directory | Identity & access management |
Azure Key Vault | Secret management |
Role-Based Access Control (RBAC) | Restrict data layer access |
Tool | Purpose |
---|---|
Azure Data Factory | Orchestration & API ingestion |
Databricks + PySpark | Data transformation |
Azure Data Lake Gen2 | Storage across all layers |
Delta Lake | Time travel, version control |
Azure Key Vault | Secret management |
Azure Active Directory | Secure authentication |
Throughout this project, the following key topics and concepts were explored:
- Introduction to Real-Time Data Engineering
- Designing Scalable Data Architecture
- Understanding the Medallion Architecture: Bronze, Silver, Gold Layers
- Azure Fundamentals and Account Setup
- Exploring and Understanding the NYC Taxi Dataset
- Creating Azure Resource Groups and Storage Accounts
- Setting up Azure Data Lake Storage Gen2
- Building Azure Data Factory Pipelines
- Ingesting Data from Public APIs using Azure Data Factory
- Real-Time Data Ingestion Scenarios in ADF
- Creating Dynamic & Parameterized Pipelines in ADF
- Accessing Azure Data Lake using Databricks
- Working with Databricks Clusters
- Reading Data Using PySpark
- Data Transformation Using PySpark Functions
- Data Analysis in PySpark
- Difference Between Managed and External Delta Tables
- Creating and Managing Delta Tables in Databricks
- Exploring Delta Log Files
- Querying Delta Tables with Versioning
- Time Travel in Delta Lake
- Connecting Databricks to Business Intelligence Tools
Below are the real implementation screenshots taken directly from the live project:
By the end of this project, we:
- Built a fully automated, real-time pipeline
- Followed best practices using Medallion Architecture
- Integrated security and transformation at scale
- Created data ready for business intelligence tools
- Add Power BI/Looker Studio dashboards for visualization
- Schedule CI/CD deployment with Azure DevOps
- Integrate alerts using Azure Monitor & Log Analytics
💡 Built with ❤️ by Neha Bharti