Skip to content

Latest commit

 

History

History
139 lines (101 loc) · 4.21 KB

File metadata and controls

139 lines (101 loc) · 4.21 KB

Quick Start Guide

Get value from this toolkit in 30 minutes.

Step 1: Run the Audit (15 minutes)

  1. Open the audit template:

    templates/environment-audit-template.md
  2. Fill out the five gap sections:

    • Traffic Realism
    • Dependency Realism
    • Data Realism
    • Configuration Realism
    • Temporal Realism
  3. Don't overthink it. Just answer:

    • What's different between staging and production?
    • Could that difference hide a real problem?
    • What are we going to do about it?

See a completed example: examples/completed-audit-example.md

Step 2: Use the Scripts (5 minutes)

Check Traffic Gaps Automatically

# Compare production metrics to your load test config
python scripts/traffic_gap_detector.py \
  --prod-metrics examples/prod_metrics_example.json \
  --load-test your_load_test_config.yaml

This will tell you if your load test is missing critical patterns.

Compare Database Configs

# Compare critical DB settings between environments
./scripts/db_config_diff.sh \
  --prod-host prod-db.example.com \
  --staging-host staging-db.example.com

Finds timeout and connection pool mismatches that invalidate tests.

Step 3: Fix Your Mocks (10 minutes)

If you're testing against mocked dependencies, replace perfect stubs with realistic behavior.

Example: Payment Gateway Mock

Before (useless):

# Always succeeds in 50ms
def mock_payment():
    time.sleep(0.05)
    return {"status": "success"}

After (realistic):

# Run our realistic mock instead
python mocks/payment_gateway_realistic.py

This mock includes:

  • Rate limiting (429 errors at 100 TPS)
  • Tail latency (p99 = 2 seconds)
  • Realistic error rate (2%)
  • Timeout behavior

It will expose issues your perfect mock hides.

What You Should Find

After running this audit, you'll typically find:

  1. Your load test runs steady-state, production spikes → Add spike simulation
  2. Your mocks are perfect, real dependencies push back → Make mocks meaner
  3. Your timeouts don't match production → Match them or document the gap
  4. You're testing at 2 PM, production breaks at 10 PM during batch jobs → Test during real failure windows

Example Output

Here's what gaps look like when you find them:

[HIGH] Traffic Spike Gap
  Production spikes to 2.1x baseline during campaigns
  Test runs steady state at 15K RPS
  Impact: May miss connection pool exhaustion during spikes
  Action: Add spike config to load test

[HIGH] Dependency Mock Gap
  Payment gateway throttles at 100 TPS in production
  Mock never throttles
  Impact: Won't see retry storms or circuit breaker behavior
  Action: Replace with realistic mock

[MEDIUM] Data Volume Gap
  Production: 14M rows
  Staging: 400K rows
  Impact: Query performance may differ
  Action: Document gap, consider seeding more data

What to Do With Gaps

You have four options for each gap:

  1. Fix it - Match staging to production (best option if feasible)
  2. Document it - Write down the delta and adjust your conclusions
  3. Test differently - If gap is too big, test in production instead (with guardrails)
  4. Accept it - Some gaps don't matter for your specific test

Don't let perfect be the enemy of good. You're not trying to make staging identical to production. You're trying to understand where they differ and what that means.

Next Steps

After your first audit:

  1. Run your chaos experiment with eyes open about the gaps
  2. Come back to the template and fill out the "Post-Test Follow-up" section
  3. Build a gap library for your team (which gaps matter for which tests?)
  4. Automate what you can using the scripts in this repo

Need Help?

The goal is simple: stop trusting test results from environments that haven't earned that trust.