Skip to content

cbfacademy/Data-Pipeline-Testing-and-Monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎡 Chinook Music Store - dbt Testing & Monitoring Labs

A hands-on dbt project for learning data pipeline testing and monitoring using the complete Chinook database with real production-like data.

πŸ“‹ Course Overview

This repository accompanies Module 4: Data Pipeline Testing and Monitoring and contains 8 hands-on labs across 4 decks:

Deck Topic Labs
1 Introduction to Testing Tools Lab 1.1, Lab 1.2
2 Testing Data Pipelines Lab 2.1, Lab 2.2
3 Writing Unit and Integration Tests Lab 3.1, Lab 3.2
4 Monitoring and Maintenance Lab 4.1, Lab 4.2

🎯 What You'll Learn

  • βœ… Connect dbt to real data in BigQuery
  • βœ… Write schema tests (unique, not_null, accepted_values, relationships)
  • βœ… Create singular tests for complex business logic
  • βœ… Build custom generic tests (reusable macros)
  • βœ… Implement unit tests for transformation logic
  • βœ… Set up integration tests across models
  • βœ… Configure monitoring and alerts
  • βœ… Debug pipeline failures systematically

πŸ—„οΈ The Chinook Database - Real Production Data

The Chinook database represents a digital music store (like iTunes) with real, complete data:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Artist      │────▢│      Album      │────▢│      Track      β”‚
β”‚   275 records   β”‚     β”‚   347 records   β”‚     β”‚  3,503 records  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Customer     │────▢│     Invoice     │────▢│   InvoiceLine   β”‚
β”‚    59 records   β”‚     β”‚   412 records   β”‚     β”‚  2,240 records  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total: 15,000+ records across 11 tables - This is real-world scale data!

Tables Included:

Table Records Description
Artist 275 Music artists and bands
Album 347 Albums linked to artists
Track 3,503 Individual songs with pricing
Genre 25 Music genres
MediaType 5 File format types
Customer 59 Customer information
Employee 8 Sales representatives
Invoice 412 Purchase transactions
InvoiceLine 2,240 Line items per invoice
Playlist 18 Music playlists
PlaylistTrack 8,715 Playlist-track associations

πŸš€ Getting Started

Prerequisites

Before starting, ensure you have:

  • Google Cloud account with a GCP project
  • BigQuery API enabled in your project
  • Python 3.8+ installed
  • Google Cloud SDK installed (gcloud)

Step 1: Clone the Repository

git clone https://github.com/your-org/chinook-dbt-testing-labs.git
cd chinook-dbt-testing-labs

Step 2: Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate it
# On Mac/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Step 3: Authenticate with Google Cloud

# Login to Google Cloud
gcloud auth login

# Set your project
gcloud config set project YOUR_PROJECT_ID

# Create application default credentials
gcloud auth application-default login

Step 4: Load Chinook Data to BigQuery

This is the key step! Run our data loader to populate your BigQuery project with the complete Chinook database:

python scripts/load_chinook_to_bigquery.py --project YOUR_PROJECT_ID

This script will:

  1. Download the official Chinook database
  2. Create a chinook_raw dataset in your BigQuery project
  3. Load all 11 tables with complete data
  4. Verify the load was successful

Expected output:

🎡 Chinook Database Loader for BigQuery
==========================================
πŸ“₯ Downloading Chinook database...
   βœ… Downloaded

πŸ“Š Loading tables to BigQuery...
   βœ… Artist: 275 rows loaded
   βœ… Album: 347 rows loaded
   βœ… Track: 3,503 rows loaded
   ...

πŸŽ‰ SUCCESS! Chinook database loaded to BigQuery

Step 5: Configure dbt Profile

Create or edit ~/.dbt/profiles.yml:

chinook_testing:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: YOUR_PROJECT_ID      # <-- Your GCP project ID
      dataset: chinook_dev          # dbt creates this for transformed models
      location: US
      threads: 4
      timeout_seconds: 300

Step 6: Verify Everything Works

# Test the connection
dbt debug

# Install dbt packages
dbt deps

# Build all models
dbt run

# Run all tests
dbt test

You should see:

Completed successfully
Done. PASS=X WARN=0 ERROR=0

πŸŽ‰ You're ready to start the labs!


πŸ“ Project Structure

chinook-dbt-testing-labs/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ dbt_project.yml              # dbt project configuration
β”œβ”€β”€ packages.yml                 # dbt package dependencies
β”œβ”€β”€ requirements.txt             # Python dependencies
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ load_chinook_to_bigquery.py  # ⬅️ Run this first!
β”‚   β”œβ”€β”€ setup.sh                 # Quick setup script
β”‚   └── run_tests_with_retry.sh  # Test runner with retries
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ staging/                 # Source β†’ Staging transformations
β”‚   β”‚   β”œβ”€β”€ _stg_chinook.yml     # Source definitions & tests
β”‚   β”‚   β”œβ”€β”€ stg_artists.sql
β”‚   β”‚   β”œβ”€β”€ stg_albums.sql
β”‚   β”‚   β”œβ”€β”€ stg_tracks.sql
β”‚   β”‚   β”œβ”€β”€ stg_genres.sql
β”‚   β”‚   β”œβ”€β”€ stg_media_types.sql
β”‚   β”‚   β”œβ”€β”€ stg_customers.sql
β”‚   β”‚   β”œβ”€β”€ stg_employees.sql
β”‚   β”‚   β”œβ”€β”€ stg_invoices.sql
β”‚   β”‚   └── stg_invoice_lines.sql
β”‚   β”‚
β”‚   β”œβ”€β”€ intermediate/            # Business logic transformations
β”‚   β”‚   β”œβ”€β”€ _int_chinook.yml
β”‚   β”‚   β”œβ”€β”€ int_tracks_enriched.sql
β”‚   β”‚   └── int_invoice_totals.sql
β”‚   β”‚
β”‚   β”œβ”€β”€ marts/                   # Analytics-ready models
β”‚   β”‚   β”œβ”€β”€ _marts_chinook.yml
β”‚   β”‚   β”œβ”€β”€ dim_customers.sql
β”‚   β”‚   β”œβ”€β”€ dim_tracks.sql
β”‚   β”‚   └── fct_sales.sql
β”‚   β”‚
β”‚   └── unit_tests/              # Unit test models
β”‚       └── ...
β”‚
β”œβ”€β”€ tests/                       # Test SQL files
β”‚   β”œβ”€β”€ assert_*.sql             # Singular tests
β”‚   β”œβ”€β”€ integration/             # Integration tests
β”‚   β”œβ”€β”€ monitoring/              # Monitoring tests
β”‚   └── unit_tests/              # Unit test assertions
β”‚
β”œβ”€β”€ macros/tests/                # Custom generic tests
β”‚   β”œβ”€β”€ test_is_positive.sql
β”‚   β”œβ”€β”€ test_valid_email.sql
β”‚   └── test_within_range.sql
β”‚
β”œβ”€β”€ seeds/                       # Test fixtures only (not source data)
β”‚   β”œβ”€β”€ test_tracks_input.csv
β”‚   └── test_tracks_expected.csv
β”‚
β”œβ”€β”€ labs/                        # Lab instructions
β”‚   β”œβ”€β”€ deck1/
β”‚   β”‚   β”œβ”€β”€ LAB_1_1_explore_chinook.md
β”‚   β”‚   └── LAB_1_2_first_tests.md
β”‚   β”œβ”€β”€ deck2/
β”‚   β”‚   β”œβ”€β”€ LAB_2_1_schema_tests.md
β”‚   β”‚   └── LAB_2_2_singular_tests.md
β”‚   β”œβ”€β”€ deck3/
β”‚   β”‚   β”œβ”€β”€ LAB_3_1_unit_tests.md
β”‚   β”‚   └── LAB_3_2_integration_tests.md
β”‚   └── deck4/
β”‚       β”œβ”€β”€ LAB_4_1_monitoring_setup.md
β”‚       └── LAB_4_2_debugging_failures.md
β”‚
└── analyses/                    # Ad-hoc debug queries
    └── debug_invoice_issues.sql

πŸ§ͺ Labs Overview

Deck 1: Introduction to Testing Tools

  • Lab 1.1: Explore Chinook & Build Your First Models
  • Lab 1.2: Write Your First dbt Tests

Deck 2: Testing Data Pipelines

  • Lab 2.1: Master Schema Tests (unique, not_null, relationships)
  • Lab 2.2: Create Singular Tests for Business Logic

Deck 3: Writing Unit and Integration Tests

  • Lab 3.1: Build Unit Tests with Test Fixtures
  • Lab 3.2: Implement Integration Tests Across Models

Deck 4: Monitoring and Maintenance

  • Lab 4.1: Set Up Freshness Monitoring & Alerts
  • Lab 4.2: Debug Pipeline Failures Systematically

πŸƒ Quick Commands

# Load data to BigQuery (run once)
python scripts/load_chinook_to_bigquery.py --project YOUR_PROJECT_ID

# Build all models
dbt run

# Run all tests
dbt test

# Build and test together
dbt build

# Run specific model
dbt run --select stg_customers

# Test specific model
dbt test --select stg_customers

# Run only schema tests
dbt test --select test_type:schema

# Run only singular tests
dbt test --select test_type:singular

# Generate and view documentation
dbt docs generate && dbt docs serve

πŸ†˜ Troubleshooting

"Permission Denied" when loading data

# Make sure you're authenticated
gcloud auth application-default login

"Dataset not found" error

The loader creates chinook_raw dataset. Make sure you ran:

python scripts/load_chinook_to_bigquery.py --project YOUR_PROJECT_ID

"Source not found" error in dbt

Check that your profile points to the correct project where you loaded the data.

Tests failing unexpectedly

# Store failures for investigation
dbt test --store-failures

# Then query the failure table in BigQuery

πŸ“š Additional Resources


πŸ“ License

This project is for educational purposes as part of the CBF Data Engineering curriculum.


Happy Testing! πŸŽ‰

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages