Skip to content

Complete deployment_testing.md module - Implementation plan #568

@SkafteNicki

Description

@SkafteNicki

Overview

This issue tracks the completion of the s7_deployment/deployment_testing.md module, which was started but never finished. The module currently covers three deployment testing strategies (A/B Testing, Canary Deployment, Shadow Deployment) but has significant incomplete content and placeholder text.

Related PR: #268 (branch: deployment-testing)


Current State Analysis

✅ What's Complete

  • Strong introduction explaining deployment testing importance for ML projects
  • Three main section structures: A/B Testing, Canary Deployment, Shadow Deployment
  • Visual diagrams referenced for all three strategies
  • Knowledge check table at the end
  • Basic exercise scaffolding

❌ What's Incomplete/Missing

  1. A/B Testing Section (Lines 23-105)

    • Missing explanatory text after diagram (line 47 is just [text])
    • Incomplete exercise description (line 62 ends mid-sentence: "The second")
    • Exercise 1 shows geolocation code without context or instructions
    • No statistical analysis guidance despite having a table of tests (lines 50-56)
  2. Canary Deployment Section (Lines 107-126)

    • No conceptual explanation (only diagram)
    • Minimal exercise (just external link + one command mention)
  3. Shadow Deployment Section (Lines 128-169)

    • No conceptual explanation (only diagram)
    • Exercise code needs improvement (currently just random routing, not true shadowing)
    • Missing analysis/comparison steps
  4. General Issues

    • No learning objectives section
    • No prerequisites section
    • Knowledge check table has duplicate column header (line 175)
    • No ML-specific examples throughout
    • Code examples lack error handling, logging, type hints

Implementation Requirements

Based on discussion, the completed module should have:

  • Content depth: Brief (2-3 paragraphs per concept)
  • Exercise level: Intermediate (guidance with room for problem-solving)
  • Examples: Simple ML models (MNIST-based)
  • Statistics: Moderate (hypothesis testing + Python examples)
  • Additional sections: Comparison, best practices, advanced topics

Implementation Plan

Phase 1: Introduction & Structure (Lines 1-22)

Estimated addition: ~30 lines

  • Add Learning Objectives box after line 11

    • Understand three deployment testing strategies
    • Implement A/B testing with statistical validation
    • Deploy canary releases with gradual rollout
    • Set up shadow deployments for risk-free testing
  • Add Prerequisites section

    • Link to: APIs module, Cloud Deployment module, Testing APIs module
    • Required: Deployed FastAPI app, GCP account, basic statistics knowledge
  • Add Quick Comparison Table before line 19

    • Columns: Strategy, Use Case, Risk Level, Complexity, User Impact
    • Help students choose strategy at a glance

Phase 2: A/B Testing Section (Lines 23-105)

Estimated addition: ~80 lines

  • Add conceptual introduction (2-3 paragraphs) after line 27

    • Explain A/B testing for ML deployments
    • Example: Testing MNIST models with different preprocessing
    • When to use: Model version comparison, feature changes
  • Fix line 47 placeholder

    • Replace [text](link) with actual content
    • Explain: Sample size determination, confidence intervals, stopping criteria
    • Link to statistical calculator and Python implementation
  • Add statistical guidance after table (line 56)

    • Brief explanation of each test in the table
    • Python example using scipy.stats
    • Sample size calculator example
  • Complete Exercise Section

    • Fix line 62: Complete the sentence ("The second will test model performance differences")
    • Exercise 1: Geo-based A/B Testing
      • Add context: Route users by geography to test regional variants
      • Setup instructions: Install geoip2, download GeoLite2 database
      • Deploy 2 Cloud Run services with different model versions
      • Add monitoring/logging code to track variant assignment
    • Exercise 2 (NEW): Statistical Analysis
      • Provide sample data (Model A: 92% accuracy, Model B: 94% accuracy)
      • Calculate statistical significance using provided test table
      • Python code example with t-test
    • Exercise 3 (NEW): Simple Traffic Split A/B Test
      • Deploy two MNIST models (baseline vs. with data augmentation)
      • Use Cloud Run traffic splitting (50/50)
      • Collect prediction logs
      • Analyze which model performs better

Phase 3: Canary Deployment Section (Lines 107-126)

Estimated addition: ~60 lines

  • Add conceptual explanation after line 107 (2-3 paragraphs)

    • What is canary deployment (origin: canary in coal mine)
    • How it works: Gradual rollout (5%→25%→50%→100%)
    • ML use case: Rolling out retrained model safely
  • Add monitoring guidance

    • Metrics to track: accuracy, latency, error rate, user engagement
    • When to rollback vs. proceed
    • Brief mention of automation possibilities
  • Expand exercises

    • Exercise 1: Keep GCP guide link, add context
      • What they'll build, expected time, prerequisites checklist
    • Exercise 2 (EXPAND from lines 122-126): Step-by-step Canary
      • Deploy MNIST model v1 (baseline)
      • Deploy MNIST model v2 (improved architecture, e.g., added dropout)
      • Use gcloud run services update-traffic with progressive percentages
      • Monitor logs in Cloud Logging
      • Provide script template to automate traffic increases
    • Exercise 3 (NEW): Rollback Scenario
      • Simulate issue (v2 has higher error rate on edge cases)
      • Practice immediate rollback
      • Document decision criteria and learnings

Phase 4: Shadow Deployment Section (Lines 128-169)

Estimated addition: ~70 lines

  • Add conceptual explanation after line 128 (2-3 paragraphs)

    • What is shadow deployment
    • Zero user risk - perfect for ML model validation
    • Example: Test new MNIST model architecture alongside production
  • Add implementation patterns

    • Load balancer approach (current exercise approach)
    • Service mesh (mention Istio but note complexity)
    • Application-level duplication
    • When to use each pattern
  • Fix & expand Exercise 1

    • Step 1 (IMPROVE lines 144-166): Fix load balancer code
      • Current code does random routing (not true shadowing!)
      • Replace with proper shadow implementation:
        • Send request to both primary and shadow
        • Only return primary response to user
        • Log shadow response for comparison
      • Use httpx for async requests
      • Add structured logging for comparison
    • Step 2 (IMPROVE lines 168-169): Better deployment
      • Deploy to Cloud Run (better fit than Cloud Functions)
      • Provide requirements.txt
      • Full deployment commands with gcloud run deploy
    • Step 3 (NEW): Deploy Two Model Versions
      • Primary: Stable MNIST model
      • Shadow: Experimental model (e.g., CNN vs. ResNet architecture)
      • Configure load balancer to duplicate traffic
    • Step 4 (NEW): Analysis Exercise
      • Query logs from both versions using Cloud Logging
      • Compare predictions on identical inputs
      • Analyze latency differences
      • Make promotion decision: deploy shadow to production or iterate

Phase 5: Comparison Section (NEW)

Estimated addition: ~30 lines

Add new section after Shadow Deployment, before Knowledge Check:

  • "When to Use Which Strategy" subsection

    • Decision flowchart (text-based is fine)
    • Criteria: risk tolerance, rollback requirements, testing goals, traffic volume
  • "Combining Strategies" subsection

    • Example workflow: Shadow first → Canary → A/B test
    • Progressive de-risking approach for ML deployments

Phase 6: Best Practices (NEW)

Estimated addition: ~40 lines

Add new section after Comparison:

  • General deployment testing best practices

    • Always monitor key metrics during tests
    • Define success criteria upfront
    • Have rollback plan ready and tested
    • Document all decisions and results
  • ML-specific considerations

    • Monitor for model drift during deployment
    • Watch for feature distribution shifts
    • Consider latency vs. accuracy tradeoffs
    • Calculate A/B test duration for statistical power
    • Version control for models and data preprocessing

Phase 7: Knowledge Check (Lines 171-188)

Estimated addition: ~20 lines

  • Fix table at line 175

    • Remove duplicate "Releasing to users based on conditions" column
    • Replace with "Negative user impact" or "Complexity"
  • Add scenario-based questions after table

    • Q1: "You need to test a new model with zero user risk. Which strategy?"
    • Q2: "You want to statistically compare two models. Which strategy?"
    • Q3: "You want to gradually roll out a model with easy rollback. Which strategy?"
    • Provide solutions with reasoning

Phase 8: Advanced Topics (NEW - Optional)

Estimated addition: ~40 lines

Add new optional section before final "This ends..." line:

  • Blue-Green Deployment

    • Brief explanation (2 paragraphs)
    • Difference from canary (instant switch vs. gradual)
    • Quick GCP implementation note
  • Feature Flags

    • How feature flags enable gradual rollout
    • Tools: LaunchDarkly, Optimizely, or custom
    • Simple implementation example
  • Multi-Armed Bandits (OPTIONAL)

    • Dynamic A/B testing that automatically optimizes
    • When to consider (high traffic scenarios)
    • Brief algorithm overview

Phase 9: Code Quality & Polish

Estimated addition: Throughout all phases

  • Improve all code examples

    • Add comprehensive error handling
    • Add structured logging statements
    • Add type hints for all functions
    • Add docstrings
    • Include requirements.txt snippets for each exercise
  • Add callout boxes (using mkdocs admonitions)

    • !!! tip for common pitfalls
    • !!! note for GCP-specific features
    • !!! warning for statistical errors to avoid
  • Ensure consistent formatting

    • All exercises follow same structure
    • Consistent code style (Black formatted)
    • Proper markdown formatting

Estimated Impact

Section Current Lines Estimated Addition Total Lines
Introduction 22 +30 ~52
A/B Testing 82 +80 ~162
Canary 19 +60 ~79
Shadow 41 +70 ~111
Comparison (NEW) 0 +30 ~30
Best Practices (NEW) 0 +40 ~40
Knowledge Check 18 +20 ~38
Advanced Topics (NEW) 0 +40 ~40
TOTAL 190 +370 ~560

Implementation Order

  1. Phase 1-2 (Introduction & A/B Testing) - Most incomplete, foundation
  2. Phase 3-4 (Canary & Shadow) - Build on A/B concepts
  3. Phase 5 (Comparison) - Ties strategies together
  4. Phase 6-7 (Best Practices & Knowledge Check) - Practical application
  5. Phase 8 (Advanced Topics) - Optional enrichment
  6. Phase 9 (Polish) - Final touches throughout

Success Criteria

The module will be considered complete when:

  • ✅ All three strategies have clear 2-3 paragraph conceptual explanations
  • ✅ Each strategy has 2-3 complete, working exercises with full instructions
  • ✅ Exercises use simple ML models (MNIST-based) appropriate for course
  • ✅ A/B testing includes moderate statistical coverage with Python examples
  • ✅ All placeholder/incomplete content is filled or removed
  • ✅ Comparison, best practices, and advanced topics sections exist
  • ✅ All code examples include error handling, logging, and type hints
  • ✅ Knowledge check table is fixed and includes scenario questions
  • ✅ Module follows same style/format as other course modules

Related Files

  • Main file: s7_deployment/deployment_testing.md
  • Related modules:
    • s7_deployment/testing_apis.md (prerequisite)
    • s7_deployment/apis.md (prerequisite)
    • s7_deployment/cloud_deployment.md (prerequisite)
  • Figures needed: All referenced diagrams should exist in ../figures/

Notes

  • This is a substantial completion effort (~370 lines of new content)
  • Each phase can be implemented independently in separate PRs if preferred
  • All exercises should be tested on actual GCP to ensure they work
  • Consider creating sample MNIST models for exercises in a separate directory
  • Statistical examples should use commonly available Python packages (scipy, numpy)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions