A practical toolkit for auditing environment gaps before chaos testing, based on lessons from production failures.
Most chaos tests pass in staging and fail in production. Not because the tests are bad, but because staging lies about what production will do.
This toolkit helps you find those lies before they cost you an incident.
- 30-Minute Audit Template: Spreadsheet format for comparing staging vs production
- Gap Detection Scripts: Automated checks for common environment differences
- Mock Configuration Examples: How to make your mocks behave like real dependencies under stress
- Config Diff Tools: Scripts to compare critical settings across environments
- Real Examples: Anonymized audit results showing what we found and what we did about it
- Run the 30-minute audit using the template in
/templates/environment-audit-template.md - Use the gap detection scripts in
/scripts/to automate common checks - Review the examples in
/examples/to see what realistic audits look like - Implement realistic mocks using configs from
/mocks/
This toolkit focuses on the five gaps that most commonly invalidate chaos test results:
- Traffic Realism: Does your load pattern mirror production spikes?
- Dependency Realism: Do your mocks behave like real services under stress?
- Data Realism: Is your data volume and distribution realistic?
- Configuration Realism: Do timeouts, pools, and retries match production?
- Temporal Realism: Are you testing during the right time windows?
This toolkit came from a Black Friday incident where we passed every chaos test in staging and production collapsed anyway. The problem wasn't our testing approach. It was the environment we tested in.
Read the full story:
- War Story: We Passed Every Chaos Test in Staging. Production Still Melted Down.
- Field Guide: The Environment Realism Checklist We Wish We'd Had Six Months Ago
This toolkit involves chaos engineering, which can break production systems if used incorrectly.
Before using any scripts or running chaos experiments:
- Read SECURITY.md for critical security and production considerations
- Get appropriate approvals from your organization
- Start in safe, non-production environments
- Have monitoring, kill switches, and rollback plans ready
TL;DR: These are educational examples and starting points, not production-ready tools. Review, customize, and test thoroughly before use. You are responsible for what you break.
Found a gap we missed? Built a useful script? Open a PR or issue.
MIT License - See LICENSE for details.
This toolkit is part of CERA (Chaos Engineering Requirement Analysis), a framework for structured chaos testing: