+| Chaos tests | Resilience | Contextual | Cause failures in a system to test the resiliency of that system and its environment, and our ability to respond to failures | Give the team confidence that failures in a given environment will not lead to unplanned downtime or a negative user experience<br/><br/>Ensures that the team has visibility (e.g. dashboards and alerts) to be able to identify issues<br/><br/>Help the team to understand their mean time to recovery (MTTR) and to build muscle memory & confidence for recovery activities | Regular (at least every couple of months) game days, and:<br/><br/>Builds fail if any test fails - note, these tests are slow, and are likely to be part of an infrequently-triggered (e.g. overnight) build<br/><br/>The tests cover whether the system self-heals, auto-scales, and alerts as expected | | |
0 commit comments