-
Notifications
You must be signed in to change notification settings - Fork 677
Complete deployment_testing.md module - Implementation plan #568
Description
Overview
This issue tracks the completion of the s7_deployment/deployment_testing.md module, which was started but never finished. The module currently covers three deployment testing strategies (A/B Testing, Canary Deployment, Shadow Deployment) but has significant incomplete content and placeholder text.
Related PR: #268 (branch: deployment-testing)
Current State Analysis
✅ What's Complete
- Strong introduction explaining deployment testing importance for ML projects
- Three main section structures: A/B Testing, Canary Deployment, Shadow Deployment
- Visual diagrams referenced for all three strategies
- Knowledge check table at the end
- Basic exercise scaffolding
❌ What's Incomplete/Missing
-
A/B Testing Section (Lines 23-105)
- Missing explanatory text after diagram (line 47 is just
[text]) - Incomplete exercise description (line 62 ends mid-sentence: "The second")
- Exercise 1 shows geolocation code without context or instructions
- No statistical analysis guidance despite having a table of tests (lines 50-56)
- Missing explanatory text after diagram (line 47 is just
-
Canary Deployment Section (Lines 107-126)
- No conceptual explanation (only diagram)
- Minimal exercise (just external link + one command mention)
-
Shadow Deployment Section (Lines 128-169)
- No conceptual explanation (only diagram)
- Exercise code needs improvement (currently just random routing, not true shadowing)
- Missing analysis/comparison steps
-
General Issues
- No learning objectives section
- No prerequisites section
- Knowledge check table has duplicate column header (line 175)
- No ML-specific examples throughout
- Code examples lack error handling, logging, type hints
Implementation Requirements
Based on discussion, the completed module should have:
- Content depth: Brief (2-3 paragraphs per concept)
- Exercise level: Intermediate (guidance with room for problem-solving)
- Examples: Simple ML models (MNIST-based)
- Statistics: Moderate (hypothesis testing + Python examples)
- Additional sections: Comparison, best practices, advanced topics
Implementation Plan
Phase 1: Introduction & Structure (Lines 1-22)
Estimated addition: ~30 lines
-
Add Learning Objectives box after line 11
- Understand three deployment testing strategies
- Implement A/B testing with statistical validation
- Deploy canary releases with gradual rollout
- Set up shadow deployments for risk-free testing
-
Add Prerequisites section
- Link to: APIs module, Cloud Deployment module, Testing APIs module
- Required: Deployed FastAPI app, GCP account, basic statistics knowledge
-
Add Quick Comparison Table before line 19
- Columns: Strategy, Use Case, Risk Level, Complexity, User Impact
- Help students choose strategy at a glance
Phase 2: A/B Testing Section (Lines 23-105)
Estimated addition: ~80 lines
-
Add conceptual introduction (2-3 paragraphs) after line 27
- Explain A/B testing for ML deployments
- Example: Testing MNIST models with different preprocessing
- When to use: Model version comparison, feature changes
-
Fix line 47 placeholder
- Replace
[text](link)with actual content - Explain: Sample size determination, confidence intervals, stopping criteria
- Link to statistical calculator and Python implementation
- Replace
-
Add statistical guidance after table (line 56)
- Brief explanation of each test in the table
- Python example using
scipy.stats - Sample size calculator example
-
Complete Exercise Section
- Fix line 62: Complete the sentence ("The second will test model performance differences")
- Exercise 1: Geo-based A/B Testing
- Add context: Route users by geography to test regional variants
- Setup instructions: Install geoip2, download GeoLite2 database
- Deploy 2 Cloud Run services with different model versions
- Add monitoring/logging code to track variant assignment
- Exercise 2 (NEW): Statistical Analysis
- Provide sample data (Model A: 92% accuracy, Model B: 94% accuracy)
- Calculate statistical significance using provided test table
- Python code example with t-test
- Exercise 3 (NEW): Simple Traffic Split A/B Test
- Deploy two MNIST models (baseline vs. with data augmentation)
- Use Cloud Run traffic splitting (50/50)
- Collect prediction logs
- Analyze which model performs better
Phase 3: Canary Deployment Section (Lines 107-126)
Estimated addition: ~60 lines
-
Add conceptual explanation after line 107 (2-3 paragraphs)
- What is canary deployment (origin: canary in coal mine)
- How it works: Gradual rollout (5%→25%→50%→100%)
- ML use case: Rolling out retrained model safely
-
Add monitoring guidance
- Metrics to track: accuracy, latency, error rate, user engagement
- When to rollback vs. proceed
- Brief mention of automation possibilities
-
Expand exercises
- Exercise 1: Keep GCP guide link, add context
- What they'll build, expected time, prerequisites checklist
- Exercise 2 (EXPAND from lines 122-126): Step-by-step Canary
- Deploy MNIST model v1 (baseline)
- Deploy MNIST model v2 (improved architecture, e.g., added dropout)
- Use
gcloud run services update-trafficwith progressive percentages - Monitor logs in Cloud Logging
- Provide script template to automate traffic increases
- Exercise 3 (NEW): Rollback Scenario
- Simulate issue (v2 has higher error rate on edge cases)
- Practice immediate rollback
- Document decision criteria and learnings
- Exercise 1: Keep GCP guide link, add context
Phase 4: Shadow Deployment Section (Lines 128-169)
Estimated addition: ~70 lines
-
Add conceptual explanation after line 128 (2-3 paragraphs)
- What is shadow deployment
- Zero user risk - perfect for ML model validation
- Example: Test new MNIST model architecture alongside production
-
Add implementation patterns
- Load balancer approach (current exercise approach)
- Service mesh (mention Istio but note complexity)
- Application-level duplication
- When to use each pattern
-
Fix & expand Exercise 1
- Step 1 (IMPROVE lines 144-166): Fix load balancer code
- Current code does random routing (not true shadowing!)
- Replace with proper shadow implementation:
- Send request to both primary and shadow
- Only return primary response to user
- Log shadow response for comparison
- Use
httpxfor async requests - Add structured logging for comparison
- Step 2 (IMPROVE lines 168-169): Better deployment
- Deploy to Cloud Run (better fit than Cloud Functions)
- Provide requirements.txt
- Full deployment commands with
gcloud run deploy
- Step 3 (NEW): Deploy Two Model Versions
- Primary: Stable MNIST model
- Shadow: Experimental model (e.g., CNN vs. ResNet architecture)
- Configure load balancer to duplicate traffic
- Step 4 (NEW): Analysis Exercise
- Query logs from both versions using Cloud Logging
- Compare predictions on identical inputs
- Analyze latency differences
- Make promotion decision: deploy shadow to production or iterate
- Step 1 (IMPROVE lines 144-166): Fix load balancer code
Phase 5: Comparison Section (NEW)
Estimated addition: ~30 lines
Add new section after Shadow Deployment, before Knowledge Check:
-
"When to Use Which Strategy" subsection
- Decision flowchart (text-based is fine)
- Criteria: risk tolerance, rollback requirements, testing goals, traffic volume
-
"Combining Strategies" subsection
- Example workflow: Shadow first → Canary → A/B test
- Progressive de-risking approach for ML deployments
Phase 6: Best Practices (NEW)
Estimated addition: ~40 lines
Add new section after Comparison:
-
General deployment testing best practices
- Always monitor key metrics during tests
- Define success criteria upfront
- Have rollback plan ready and tested
- Document all decisions and results
-
ML-specific considerations
- Monitor for model drift during deployment
- Watch for feature distribution shifts
- Consider latency vs. accuracy tradeoffs
- Calculate A/B test duration for statistical power
- Version control for models and data preprocessing
Phase 7: Knowledge Check (Lines 171-188)
Estimated addition: ~20 lines
-
Fix table at line 175
- Remove duplicate "Releasing to users based on conditions" column
- Replace with "Negative user impact" or "Complexity"
-
Add scenario-based questions after table
- Q1: "You need to test a new model with zero user risk. Which strategy?"
- Q2: "You want to statistically compare two models. Which strategy?"
- Q3: "You want to gradually roll out a model with easy rollback. Which strategy?"
- Provide solutions with reasoning
Phase 8: Advanced Topics (NEW - Optional)
Estimated addition: ~40 lines
Add new optional section before final "This ends..." line:
-
Blue-Green Deployment
- Brief explanation (2 paragraphs)
- Difference from canary (instant switch vs. gradual)
- Quick GCP implementation note
-
Feature Flags
- How feature flags enable gradual rollout
- Tools: LaunchDarkly, Optimizely, or custom
- Simple implementation example
-
Multi-Armed Bandits (OPTIONAL)
- Dynamic A/B testing that automatically optimizes
- When to consider (high traffic scenarios)
- Brief algorithm overview
Phase 9: Code Quality & Polish
Estimated addition: Throughout all phases
-
Improve all code examples
- Add comprehensive error handling
- Add structured logging statements
- Add type hints for all functions
- Add docstrings
- Include requirements.txt snippets for each exercise
-
Add callout boxes (using mkdocs admonitions)
!!! tipfor common pitfalls!!! notefor GCP-specific features!!! warningfor statistical errors to avoid
-
Ensure consistent formatting
- All exercises follow same structure
- Consistent code style (Black formatted)
- Proper markdown formatting
Estimated Impact
| Section | Current Lines | Estimated Addition | Total Lines |
|---|---|---|---|
| Introduction | 22 | +30 | ~52 |
| A/B Testing | 82 | +80 | ~162 |
| Canary | 19 | +60 | ~79 |
| Shadow | 41 | +70 | ~111 |
| Comparison (NEW) | 0 | +30 | ~30 |
| Best Practices (NEW) | 0 | +40 | ~40 |
| Knowledge Check | 18 | +20 | ~38 |
| Advanced Topics (NEW) | 0 | +40 | ~40 |
| TOTAL | 190 | +370 | ~560 |
Implementation Order
- Phase 1-2 (Introduction & A/B Testing) - Most incomplete, foundation
- Phase 3-4 (Canary & Shadow) - Build on A/B concepts
- Phase 5 (Comparison) - Ties strategies together
- Phase 6-7 (Best Practices & Knowledge Check) - Practical application
- Phase 8 (Advanced Topics) - Optional enrichment
- Phase 9 (Polish) - Final touches throughout
Success Criteria
The module will be considered complete when:
- ✅ All three strategies have clear 2-3 paragraph conceptual explanations
- ✅ Each strategy has 2-3 complete, working exercises with full instructions
- ✅ Exercises use simple ML models (MNIST-based) appropriate for course
- ✅ A/B testing includes moderate statistical coverage with Python examples
- ✅ All placeholder/incomplete content is filled or removed
- ✅ Comparison, best practices, and advanced topics sections exist
- ✅ All code examples include error handling, logging, and type hints
- ✅ Knowledge check table is fixed and includes scenario questions
- ✅ Module follows same style/format as other course modules
Related Files
- Main file:
s7_deployment/deployment_testing.md - Related modules:
s7_deployment/testing_apis.md(prerequisite)s7_deployment/apis.md(prerequisite)s7_deployment/cloud_deployment.md(prerequisite)
- Figures needed: All referenced diagrams should exist in
../figures/
Notes
- This is a substantial completion effort (~370 lines of new content)
- Each phase can be implemented independently in separate PRs if preferred
- All exercises should be tested on actual GCP to ensure they work
- Consider creating sample MNIST models for exercises in a separate directory
- Statistical examples should use commonly available Python packages (scipy, numpy)