A production-grade data analytics platform built on Microsoft Fabric, ingesting data from the YouTube Data API v3 and processing it through a Bronze-Silver-Gold medallion architecture. The project was developed across 9 sprints as a hands-on learning exercise for Azure and Fabric best practices.
- Build an end-to-end data pipeline that extracts YouTube channel and video data daily, transforms it through a medallion architecture, and surfaces analytics-ready tables.
- Follow Microsoft Fabric best practices: workspace separation (Processing / Data Stores / Consumption), Git integration, deployment pipelines (Dev > Test > Prod), variable libraries, and Fabric Environments.
- Automate everything: infrastructure provisioning, pipeline orchestration, capacity management, and CI/CD -- all driven by code and GitHub Actions.
- Validate thoroughly: automated verification scripts, AI-assisted checklist (123/130 checks passed), and a full E2E smoke test.
YouTube Data API v3
|
GitHub Actions (daily-pipeline.yml)
|
Fabric Processing Workspace
|
pp-int-run-youtube (parent pipeline)
|--- RunETL --> pl-int-etl-youtube (child pipeline)
| |--- nb-extract-youtube (Extract: API -> Bronze Files)
| |--- nb-int-1-load-youtube (Load: Files -> Bronze Tables)
| |--- nb-int-2-clean-youtube (Clean: Bronze -> Silver)
| |--- nb-int-3-model-youtube (Model: Silver -> Gold)
| |--- nb-int-4-validate-youtube (Validate: schema checks)
|--- LogSuccess / LogFailure --> nb-int-5-log-pipeline
Fabric Data Stores Workspace
|--- lh_bronze (raw JSON files + ingested tables)
|--- lh_silver (cleaned, typed tables)
|--- lh_gold (modeled with surrogate keys, daily snapshots)
|--- lh_int_admin (validation results + pipeline logs)
| Sprint | Title | What was built |
|---|---|---|
| 1 | Capacity & Workspace Design | Azure resource group, 2 Fabric capacities (F8), 9 workspaces (3x Processing/Data Stores/Consumption across Dev/Test/Prod), Service Principal, 3 security groups |
| 2 | Version Control & Deployment | Git integration for 3 Dev workspaces, 3 deployment pipelines (Dev > Test > Prod), branch protection, GitHub secrets, capacity-toggle workflow |
| 3 | Data Extraction & Bronze Layer | Azure Key Vault for YouTube API key, lh_bronze lakehouse, extraction notebook pulling channels/playlists/videos as JSON files with date-partitioned folders |
| 4 | Silver & Gold Schema Design | lh_silver and lh_gold lakehouses, DDL notebooks (nb-lhcreate-*) for schema creation across all layers |
| 5 | Data Transformation & Movement | Load, Clean, and Model notebooks implementing Bronze-to-Silver (type casting, deduplication) and Silver-to-Gold (surrogate keys, daily snapshots) transformations |
| 6 | Orchestration & Validation | lh_int_admin lakehouse, validation notebook checking schema conformance, child pipeline (pl-int-etl-youtube) with 6 sequential activities |
| 7 | Go-Live Preparation | Parent pipeline (pp-int-run-youtube) with success/failure logging branches, daily-pipeline.yml GitHub Action for automated runs, 5 documentation files |
| 8 | Reference Architecture | Variable Library (vl-int-config) for environment-specific IDs, Fabric Environment (env-int-pyspark) with Runtime 1.3 and Starter Pool, ABFS paths in notebooks, no pinned lakehouse dependencies |
| 9 | Solution Verification | Comparison report (PDF), automated verification script (44/44 PASS), comprehensive checklist (130 checks, 95% AI-verified), E2E smoke test |
.
├── .github/workflows/ # GitHub Actions
│ ├── capacity-toggle.yml # Resume/pause Fabric capacities
│ └── daily-pipeline.yml # Trigger daily ETL pipeline
│
├── docs/ # Project documentation
│ ├── architecture-overview.md
│ ├── cicd-strategy.md
│ ├── data-dictionary.md
│ ├── naming-convention.md
│ └── orchestration-guidelines.md
│
├── solution/ # Fabric artifacts (Git-synced)
│ ├── processing/ # 10 Spark notebooks (.ipynb)
│ ├── datastores/ # (managed by Fabric)
│ └── consumption/ # (managed by Fabric)
│
├── openspec/ # OpenSpec change management
│ ├── specs/ # 28 specification files
│ └── changes/archive/ # 9 archived sprint changes
│
├── script/verify/ # Automated verification
│ └── verify_solution.py # 44-check solution validator
│
├── verify/ # Verification artifacts
│ ├── manual-verification-checklist.md # 130-check checklist
│ ├── verification-checklist.pdf # PDF version with AI status icons
│ ├── comparison-report.pdf # Spec vs implementation comparison
│ ├── generate_checklist_pdf.py # PDF generator
│ └── generate_report.py # Comparison report generator
│
└── README.md
| Category | Technology |
|---|---|
| Cloud | Azure (Resource Groups, Fabric Capacities, Key Vault, Entra ID) |
| Data Platform | Microsoft Fabric (Lakehouses, Notebooks, Pipelines, Environments, Variable Libraries) |
| Compute | Apache Spark 3.5, Delta Lake 3.2 (via Fabric Runtime 1.3) |
| Data Source | YouTube Data API v3 (channels, playlists, videos) |
| CI/CD | GitHub Actions, Fabric Deployment Pipelines, Fabric Git Integration |
| Languages | PySpark, Python, PowerShell, YAML |
The platform has been verified through multiple methods:
- Automated script: 44/44 checks PASS (
script/verify/verify_solution.py) - AI-assisted checklist: 123/130 checks PASS (95%) across Azure, Fabric, GitHub, SQL, and OneLake APIs
- E2E smoke test: Pipeline triggered via API, completed in ~8 min, data verified in all 4 lakehouses
- 1 warning: GitHub Actions daily pipeline returns HTTP 401 (SP permissions on Processing workspace)
- 6 manual checks remaining: Variable Library values (no REST API available)
See verify/manual-verification-checklist.md for the full 130-check breakdown.
- Azure subscription with Microsoft Fabric enabled
- Two Fabric capacities (F8 or higher): one for non-prod, one for prod
- YouTube Data API v3 key stored in Azure Key Vault
- GitHub repository with Actions enabled
- Resume capacity: Use the
capacity-toggle.ymlworkflow or Azure Portal - Trigger pipeline: Run
pp-int-run-youtubefrom the Processing workspace, or letdaily-pipeline.ymlrun on schedule - Monitor: Check pipeline status in Fabric Monitor, or query
dbo.log_pipeline_runsinlh_int_admin - Pause capacity: Use the
capacity-toggle.ymlworkflow to save costs
# Automated solution checks
python script/verify/verify_solution.py
# Generate comparison report
python verify/generate_report.py
# Generate PDF checklist
python verify/generate_checklist_pdf.py| Document | Description |
|---|---|
| Architecture Overview | Workspace layout, medallion layers, pipeline flow |
| CI/CD Strategy | Git integration, deployment pipelines, GitHub Actions |
| Data Dictionary | All tables across Bronze/Silver/Gold/Admin |
| Naming Convention | Azure and Fabric naming patterns |
| Orchestration Guidelines | Pipeline execution and monitoring |
This project is a learning exercise and is not intended for production use without additional security and compliance review.