kor-dojo1 -- YouTube Analytics Data Platform

A production-grade data analytics platform built on Microsoft Fabric, ingesting data from the YouTube Data API v3 and processing it through a Bronze-Silver-Gold medallion architecture. The project was developed across 9 sprints as a hands-on learning exercise for Azure and Fabric best practices.

Goals

Build an end-to-end data pipeline that extracts YouTube channel and video data daily, transforms it through a medallion architecture, and surfaces analytics-ready tables.
Follow Microsoft Fabric best practices: workspace separation (Processing / Data Stores / Consumption), Git integration, deployment pipelines (Dev > Test > Prod), variable libraries, and Fabric Environments.
Automate everything: infrastructure provisioning, pipeline orchestration, capacity management, and CI/CD -- all driven by code and GitHub Actions.
Validate thoroughly: automated verification scripts, AI-assisted checklist (123/130 checks passed), and a full E2E smoke test.

Architecture

YouTube Data API v3
        |
   GitHub Actions (daily-pipeline.yml)
        |
   Fabric Processing Workspace
        |
   pp-int-run-youtube (parent pipeline)
        |--- RunETL --> pl-int-etl-youtube (child pipeline)
        |       |--- nb-extract-youtube     (Extract: API -> Bronze Files)
        |       |--- nb-int-1-load-youtube  (Load: Files -> Bronze Tables)
        |       |--- nb-int-2-clean-youtube (Clean: Bronze -> Silver)
        |       |--- nb-int-3-model-youtube (Model: Silver -> Gold)
        |       |--- nb-int-4-validate-youtube (Validate: schema checks)
        |--- LogSuccess / LogFailure --> nb-int-5-log-pipeline
        
   Fabric Data Stores Workspace
        |--- lh_bronze    (raw JSON files + ingested tables)
        |--- lh_silver    (cleaned, typed tables)
        |--- lh_gold      (modeled with surrogate keys, daily snapshots)
        |--- lh_int_admin (validation results + pipeline logs)

Sprint Breakdown

Sprint	Title	What was built
1	Capacity & Workspace Design	Azure resource group, 2 Fabric capacities (F8), 9 workspaces (3x Processing/Data Stores/Consumption across Dev/Test/Prod), Service Principal, 3 security groups
2	Version Control & Deployment	Git integration for 3 Dev workspaces, 3 deployment pipelines (Dev > Test > Prod), branch protection, GitHub secrets, capacity-toggle workflow
3	Data Extraction & Bronze Layer	Azure Key Vault for YouTube API key, `lh_bronze` lakehouse, extraction notebook pulling channels/playlists/videos as JSON files with date-partitioned folders
4	Silver & Gold Schema Design	`lh_silver` and `lh_gold` lakehouses, DDL notebooks (`nb-lhcreate-*`) for schema creation across all layers
5	Data Transformation & Movement	Load, Clean, and Model notebooks implementing Bronze-to-Silver (type casting, deduplication) and Silver-to-Gold (surrogate keys, daily snapshots) transformations
6	Orchestration & Validation	`lh_int_admin` lakehouse, validation notebook checking schema conformance, child pipeline (`pl-int-etl-youtube`) with 6 sequential activities
7	Go-Live Preparation	Parent pipeline (`pp-int-run-youtube`) with success/failure logging branches, `daily-pipeline.yml` GitHub Action for automated runs, 5 documentation files
8	Reference Architecture	Variable Library (`vl-int-config`) for environment-specific IDs, Fabric Environment (`env-int-pyspark`) with Runtime 1.3 and Starter Pool, ABFS paths in notebooks, no pinned lakehouse dependencies
9	Solution Verification	Comparison report (PDF), automated verification script (44/44 PASS), comprehensive checklist (130 checks, 95% AI-verified), E2E smoke test

Repository Structure

.
├── .github/workflows/          # GitHub Actions
│   ├── capacity-toggle.yml     #   Resume/pause Fabric capacities
│   └── daily-pipeline.yml      #   Trigger daily ETL pipeline
│
├── docs/                       # Project documentation
│   ├── architecture-overview.md
│   ├── cicd-strategy.md
│   ├── data-dictionary.md
│   ├── naming-convention.md
│   └── orchestration-guidelines.md
│
├── solution/                   # Fabric artifacts (Git-synced)
│   ├── processing/             #   10 Spark notebooks (.ipynb)
│   ├── datastores/             #   (managed by Fabric)
│   └── consumption/            #   (managed by Fabric)
│
├── openspec/                   # OpenSpec change management
│   ├── specs/                  #   28 specification files
│   └── changes/archive/        #   9 archived sprint changes
│
├── script/verify/              # Automated verification
│   └── verify_solution.py      #   44-check solution validator
│
├── verify/                     # Verification artifacts
│   ├── manual-verification-checklist.md  # 130-check checklist
│   ├── verification-checklist.pdf        # PDF version with AI status icons
│   ├── comparison-report.pdf             # Spec vs implementation comparison
│   ├── generate_checklist_pdf.py         # PDF generator
│   └── generate_report.py               # Comparison report generator
│
└── README.md

Key Technologies

Category	Technology
Cloud	Azure (Resource Groups, Fabric Capacities, Key Vault, Entra ID)
Data Platform	Microsoft Fabric (Lakehouses, Notebooks, Pipelines, Environments, Variable Libraries)
Compute	Apache Spark 3.5, Delta Lake 3.2 (via Fabric Runtime 1.3)
Data Source	YouTube Data API v3 (channels, playlists, videos)
CI/CD	GitHub Actions, Fabric Deployment Pipelines, Fabric Git Integration
Languages	PySpark, Python, PowerShell, YAML

Verification Status

The platform has been verified through multiple methods:

Automated script: 44/44 checks PASS (script/verify/verify_solution.py)
AI-assisted checklist: 123/130 checks PASS (95%) across Azure, Fabric, GitHub, SQL, and OneLake APIs
E2E smoke test: Pipeline triggered via API, completed in ~8 min, data verified in all 4 lakehouses
1 warning: GitHub Actions daily pipeline returns HTTP 401 (SP permissions on Processing workspace)
6 manual checks remaining: Variable Library values (no REST API available)

See verify/manual-verification-checklist.md for the full 130-check breakdown.

Getting Started

Prerequisites

Azure subscription with Microsoft Fabric enabled
Two Fabric capacities (F8 or higher): one for non-prod, one for prod
YouTube Data API v3 key stored in Azure Key Vault
GitHub repository with Actions enabled

Running the Pipeline

Resume capacity: Use the capacity-toggle.yml workflow or Azure Portal
Trigger pipeline: Run pp-int-run-youtube from the Processing workspace, or let daily-pipeline.yml run on schedule
Monitor: Check pipeline status in Fabric Monitor, or query dbo.log_pipeline_runs in lh_int_admin
Pause capacity: Use the capacity-toggle.yml workflow to save costs

Running Verification

# Automated solution checks
python script/verify/verify_solution.py

# Generate comparison report
python verify/generate_report.py

# Generate PDF checklist
python verify/generate_checklist_pdf.py

Documentation

Document	Description
Architecture Overview	Workspace layout, medallion layers, pipeline flow
CI/CD Strategy	Git integration, deployment pipelines, GitHub Actions
Data Dictionary	All tables across Bronze/Silver/Gold/Admin
Naming Convention	Azure and Fabric naming patterns
Orchestration Guidelines	Pipeline execution and monitoring

License

This project is a learning exercise and is not intended for production use without additional security and compliance review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kor-dojo1 -- YouTube Analytics Data Platform

Goals

Architecture

Sprint Breakdown

Repository Structure

Key Technologies

Verification Status

Getting Started

Prerequisites

Running the Pipeline

Running Verification

Documentation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.agent		.agent
.cursor		.cursor
.github/workflows		.github/workflows
deliverables		deliverables
docs		docs
openspec		openspec
script/verify		script/verify
scripts		scripts
solution		solution
verify		verify
.gitignore		.gitignore
README.md		README.md

jdkorigan/AZURE-AI-SPECS-DOJO1

Folders and files

Latest commit

History

Repository files navigation

kor-dojo1 -- YouTube Analytics Data Platform

Goals

Architecture

Sprint Breakdown

Repository Structure

Key Technologies

Verification Status

Getting Started

Prerequisites

Running the Pipeline

Running Verification

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages