Skip to content

jdkorigan/AZURE-AI-SPECS-DOJO1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kor-dojo1 -- YouTube Analytics Data Platform

A production-grade data analytics platform built on Microsoft Fabric, ingesting data from the YouTube Data API v3 and processing it through a Bronze-Silver-Gold medallion architecture. The project was developed across 9 sprints as a hands-on learning exercise for Azure and Fabric best practices.

Goals

  • Build an end-to-end data pipeline that extracts YouTube channel and video data daily, transforms it through a medallion architecture, and surfaces analytics-ready tables.
  • Follow Microsoft Fabric best practices: workspace separation (Processing / Data Stores / Consumption), Git integration, deployment pipelines (Dev > Test > Prod), variable libraries, and Fabric Environments.
  • Automate everything: infrastructure provisioning, pipeline orchestration, capacity management, and CI/CD -- all driven by code and GitHub Actions.
  • Validate thoroughly: automated verification scripts, AI-assisted checklist (123/130 checks passed), and a full E2E smoke test.

Architecture

YouTube Data API v3
        |
   GitHub Actions (daily-pipeline.yml)
        |
   Fabric Processing Workspace
        |
   pp-int-run-youtube (parent pipeline)
        |--- RunETL --> pl-int-etl-youtube (child pipeline)
        |       |--- nb-extract-youtube     (Extract: API -> Bronze Files)
        |       |--- nb-int-1-load-youtube  (Load: Files -> Bronze Tables)
        |       |--- nb-int-2-clean-youtube (Clean: Bronze -> Silver)
        |       |--- nb-int-3-model-youtube (Model: Silver -> Gold)
        |       |--- nb-int-4-validate-youtube (Validate: schema checks)
        |--- LogSuccess / LogFailure --> nb-int-5-log-pipeline
        
   Fabric Data Stores Workspace
        |--- lh_bronze    (raw JSON files + ingested tables)
        |--- lh_silver    (cleaned, typed tables)
        |--- lh_gold      (modeled with surrogate keys, daily snapshots)
        |--- lh_int_admin (validation results + pipeline logs)

Sprint Breakdown

Sprint Title What was built
1 Capacity & Workspace Design Azure resource group, 2 Fabric capacities (F8), 9 workspaces (3x Processing/Data Stores/Consumption across Dev/Test/Prod), Service Principal, 3 security groups
2 Version Control & Deployment Git integration for 3 Dev workspaces, 3 deployment pipelines (Dev > Test > Prod), branch protection, GitHub secrets, capacity-toggle workflow
3 Data Extraction & Bronze Layer Azure Key Vault for YouTube API key, lh_bronze lakehouse, extraction notebook pulling channels/playlists/videos as JSON files with date-partitioned folders
4 Silver & Gold Schema Design lh_silver and lh_gold lakehouses, DDL notebooks (nb-lhcreate-*) for schema creation across all layers
5 Data Transformation & Movement Load, Clean, and Model notebooks implementing Bronze-to-Silver (type casting, deduplication) and Silver-to-Gold (surrogate keys, daily snapshots) transformations
6 Orchestration & Validation lh_int_admin lakehouse, validation notebook checking schema conformance, child pipeline (pl-int-etl-youtube) with 6 sequential activities
7 Go-Live Preparation Parent pipeline (pp-int-run-youtube) with success/failure logging branches, daily-pipeline.yml GitHub Action for automated runs, 5 documentation files
8 Reference Architecture Variable Library (vl-int-config) for environment-specific IDs, Fabric Environment (env-int-pyspark) with Runtime 1.3 and Starter Pool, ABFS paths in notebooks, no pinned lakehouse dependencies
9 Solution Verification Comparison report (PDF), automated verification script (44/44 PASS), comprehensive checklist (130 checks, 95% AI-verified), E2E smoke test

Repository Structure

.
├── .github/workflows/          # GitHub Actions
│   ├── capacity-toggle.yml     #   Resume/pause Fabric capacities
│   └── daily-pipeline.yml      #   Trigger daily ETL pipeline
│
├── docs/                       # Project documentation
│   ├── architecture-overview.md
│   ├── cicd-strategy.md
│   ├── data-dictionary.md
│   ├── naming-convention.md
│   └── orchestration-guidelines.md
│
├── solution/                   # Fabric artifacts (Git-synced)
│   ├── processing/             #   10 Spark notebooks (.ipynb)
│   ├── datastores/             #   (managed by Fabric)
│   └── consumption/            #   (managed by Fabric)
│
├── openspec/                   # OpenSpec change management
│   ├── specs/                  #   28 specification files
│   └── changes/archive/        #   9 archived sprint changes
│
├── script/verify/              # Automated verification
│   └── verify_solution.py      #   44-check solution validator
│
├── verify/                     # Verification artifacts
│   ├── manual-verification-checklist.md  # 130-check checklist
│   ├── verification-checklist.pdf        # PDF version with AI status icons
│   ├── comparison-report.pdf             # Spec vs implementation comparison
│   ├── generate_checklist_pdf.py         # PDF generator
│   └── generate_report.py               # Comparison report generator
│
└── README.md

Key Technologies

Category Technology
Cloud Azure (Resource Groups, Fabric Capacities, Key Vault, Entra ID)
Data Platform Microsoft Fabric (Lakehouses, Notebooks, Pipelines, Environments, Variable Libraries)
Compute Apache Spark 3.5, Delta Lake 3.2 (via Fabric Runtime 1.3)
Data Source YouTube Data API v3 (channels, playlists, videos)
CI/CD GitHub Actions, Fabric Deployment Pipelines, Fabric Git Integration
Languages PySpark, Python, PowerShell, YAML

Verification Status

The platform has been verified through multiple methods:

  • Automated script: 44/44 checks PASS (script/verify/verify_solution.py)
  • AI-assisted checklist: 123/130 checks PASS (95%) across Azure, Fabric, GitHub, SQL, and OneLake APIs
  • E2E smoke test: Pipeline triggered via API, completed in ~8 min, data verified in all 4 lakehouses
  • 1 warning: GitHub Actions daily pipeline returns HTTP 401 (SP permissions on Processing workspace)
  • 6 manual checks remaining: Variable Library values (no REST API available)

See verify/manual-verification-checklist.md for the full 130-check breakdown.

Getting Started

Prerequisites

  • Azure subscription with Microsoft Fabric enabled
  • Two Fabric capacities (F8 or higher): one for non-prod, one for prod
  • YouTube Data API v3 key stored in Azure Key Vault
  • GitHub repository with Actions enabled

Running the Pipeline

  1. Resume capacity: Use the capacity-toggle.yml workflow or Azure Portal
  2. Trigger pipeline: Run pp-int-run-youtube from the Processing workspace, or let daily-pipeline.yml run on schedule
  3. Monitor: Check pipeline status in Fabric Monitor, or query dbo.log_pipeline_runs in lh_int_admin
  4. Pause capacity: Use the capacity-toggle.yml workflow to save costs

Running Verification

# Automated solution checks
python script/verify/verify_solution.py

# Generate comparison report
python verify/generate_report.py

# Generate PDF checklist
python verify/generate_checklist_pdf.py

Documentation

Document Description
Architecture Overview Workspace layout, medallion layers, pipeline flow
CI/CD Strategy Git integration, deployment pipelines, GitHub Actions
Data Dictionary All tables across Bronze/Silver/Gold/Admin
Naming Convention Azure and Fabric naming patterns
Orchestration Guidelines Pipeline execution and monitoring

License

This project is a learning exercise and is not intended for production use without additional security and compliance review.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •