NYC Taxi Data Engineering 🚖📊

📝 Project Overview

This project demonstrates a real-time data engineering pipeline using Azure Data Factory, Databricks (PySpark), Delta Lake, and Azure Data Lake Storage Gen2.

The pipeline pulls data directly from the NYC Taxi API, eliminating the need for manual file uploads. It transforms and organizes data using the Medallion Architecture (Bronze → Silver → Gold) and ensures data is secure, optimized, and analytics-ready.

🔗 Dataset Source

NYC Taxi Trip Record Data
NYC TLC Official Data Page

🏗️ Architecture

The pipeline follows the Medallion Architecture:

Layer	Description
🥉 Bronze	Raw data ingested from the API
🥈 Silver	Cleaned and transformed data
🥇 Gold	Modeled data used for analytics/reporting

📌 Architecture Diagram

🔄 End-to-End Flow

API Integration:
- Pulls live data from the official NYC Taxi API.
Ingestion (ADF):
- Dynamic & parameterized pipelines ingest data into Bronze Layer in Parquet format.
Transformation (Databricks & PySpark):
- Cleansing and data modeling happens here.
- Output saved in the Silver Layer.
Serving (Delta Lake + Parquet):
- Modeled data stored in the Gold Layer.
- Delta Lake enables:
  - Time travel
  - Version control
  - ACID compliance
Security:
- Azure Active Directory + Key Vault + RBAC for safe access.

💾 Storage Format

All layers use Parquet for efficient storage.
Gold Layer uses Delta Lake for advanced features.

🔐 Security Implementation

Feature	Use Case
Azure Active Directory	Identity & access management
Azure Key Vault	Secret management
Role-Based Access Control (RBAC)	Restrict data layer access

⚙️ Tools & Services

Tool	Purpose
Azure Data Factory	Orchestration & API ingestion
Databricks + PySpark	Data transformation
Azure Data Lake Gen2	Storage across all layers
Delta Lake	Time travel, version control
Azure Key Vault	Secret management
Azure Active Directory	Secure authentication

📚 Topics Covered

Throughout this project, the following key topics and concepts were explored:

Introduction to Real-Time Data Engineering
Designing Scalable Data Architecture
Understanding the Medallion Architecture: Bronze, Silver, Gold Layers
Azure Fundamentals and Account Setup
Exploring and Understanding the NYC Taxi Dataset
Creating Azure Resource Groups and Storage Accounts
Setting up Azure Data Lake Storage Gen2
Building Azure Data Factory Pipelines
Ingesting Data from Public APIs using Azure Data Factory
Real-Time Data Ingestion Scenarios in ADF
Creating Dynamic & Parameterized Pipelines in ADF
Accessing Azure Data Lake using Databricks
Working with Databricks Clusters
Reading Data Using PySpark
Data Transformation Using PySpark Functions
Data Analysis in PySpark
Difference Between Managed and External Delta Tables
Creating and Managing Delta Tables in Databricks
Exploring Delta Log Files
Querying Delta Tables with Versioning
Time Travel in Delta Lake
Connecting Databricks to Business Intelligence Tools

📸 Project Screenshots

Below are the real implementation screenshots taken directly from the live project:

🖥️ Pipeline & Monitoring

Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 7
Image 8
Image 9
Image 10
Image 11
Image 12
Image 13
Image 14
Image 15
Image 16
Image 17

🎯 Outcome

By the end of this project, we:

Built a fully automated, real-time pipeline
Followed best practices using Medallion Architecture
Integrated security and transformation at scale
Created data ready for business intelligence tools

🚀 Future Scope

Add Power BI/Looker Studio dashboards for visualization
Schedule CI/CD deployment with Azure DevOps
Integrate alerts using Azure Monitor & Log Analytics

💡 Built with ❤️ by Neha Bharti

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Images		Images
RAW DATA		RAW DATA
Script		Script
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NYC Taxi Data Engineering 🚖📊

📝 Project Overview

🔗 Dataset Source

🏗️ Architecture

📌 Architecture Diagram

🔄 End-to-End Flow

💾 Storage Format

🔐 Security Implementation

⚙️ Tools & Services

📚 Topics Covered

📸 Project Screenshots

🖥️ Pipeline & Monitoring

🎯 Outcome

🚀 Future Scope

About

Uh oh!

Releases

Packages

Languages

neha-dev-dot/NYC-TAXI

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Data Engineering 🚖📊

📝 Project Overview

🔗 Dataset Source

🏗️ Architecture

📌 Architecture Diagram

🔄 End-to-End Flow

💾 Storage Format

🔐 Security Implementation

⚙️ Tools & Services

📚 Topics Covered

📸 Project Screenshots

🖥️ Pipeline & Monitoring

🎯 Outcome

🚀 Future Scope

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages