ChaosKit is a Go framework for code-level chaos engineering and fault injection. It enables controlled failures inside your functions — delays, panics, errors, resource pressure — and provides validators to ensure your application behaves correctly under adverse conditions.
Most chaos tools operate at the infrastructure level (containers, nodes, networks). ChaosKit focuses on what they cannot test: business logic, compensations, concurrency behavior, state invariants, and internal error-handling paths.
- Code-level fault injection (panic, delay, error, resource faults)
- Context-based activation (no-op unless chaos context is attached)
- Scenario DSL (define steps, injectors, validators, scopes)
- Severity-aware validators feeding a verdict engine
- Built-in reporters & success thresholds for PASS/UNSTABLE/FAIL outcomes
- Metrics collector & Prometheus exporter for observability
- Go
testingintegration helpers with configurable thresholds - Integration with ToxiProxy for network chaos
- Optional monkey-patching for advanced cases
- Fully opt-in via build tags (
-tags=chaos)
ChaosKit is safe to include in your codebase: fault injection never activates accidentally.
go get github.com/rom8726/chaoskitscenario := chaoskit.NewScenario("workflow-stability").
WithTarget(engine).
Step("run workflow", ExecuteWorkflow).
Inject("delays", injectors.RandomDelay(10*time.Millisecond, 100*time.Millisecond)).
Inject("panic", injectors.PanicProbability(0.01)).
Assert("goroutines", validators.GoroutineLimit(200)).
Assert("no-deadlock", validators.NoInfiniteLoop(5*time.Second)).
Repeat(100).
Build()
if err := chaoskit.Run(context.Background(), scenario); err != nil {
log.Fatalf("Chaos scenario failed: %v", err)
}ChaosKit’s fault injectors are compiled only when explicitly enabled:
go build -tags=chaos .Without the chaos tag:
- all injectors become no-ops,
- chaos code is excluded from the binary,
- scenarios run without injecting any faults.
This ensures ChaosKit never affects production binaries unless intentionally enabled.
Even with the build tag, chaos runs only when a special context is attached:
ctx := chaoskit.AttachChaos(context.Background())Without this context, calls like:
chaoskit.MaybePanic(ctx)
chaoskit.MaybeDelay(ctx)
if err := chaoskit.MaybeError(ctx); err != nil {
return err
}
child, cancel := chaoskit.MaybeCancelContext(ctx)
defer cancel()
if chaoskit.ApplyChaos(child, "force-retry") {
// Provider-specific logic
}and event hooks such as
chaoskit.RecordRecursionDepth(ctx, depth)
chaoskit.RecordError(ctx)are strictly no-op.
Define multi-step, multi-injector stress or chaos tests:
chaoskit.NewScenario("example").
WithTarget(client).
Step("call API", callAPI).
Inject("latency", injectors.RandomDelay(5*time.Millisecond, 50*time.Millisecond)).
Inject("errors", injectors.ErrorWithProbability(io.ErrUnexpectedEOF, 0.02)).
Assert("goroutines", validators.GoroutineLimit(100)).
Repeat(50).
Build()ChaosKit supports:
- fixed-number runs (
Repeat(n)) - duration-based runs (
RunFor(time.Hour))
PanicProbability(p)RandomDelay(min, max)ErrorWithProbability(err, p)CompositeInjector(...)- network chaos via ToxiProxy
- optional monkey-patching for advanced scenarios
GoroutineLimit(n)RecursionDepthLimit(n)NoInfiniteLoop(timeout)NoSlowIteration(timeout)MemoryUnderLimit(bytes)MaxErrors(limit)- custom validators via:
type Validator interface {
Name() string
Validate(ctx context.Context, target Target) error
Severity() chaoskit.ValidationSeverity
}Reporter().GetVerdict(thresholds)evaluates executions againstSuccessThresholds.- Verdicts are
PASS,UNSTABLE, orFAILwith matching exit codes. - Reports include categorized failures, top error patterns, and JSON/text outputs.
- Threshold helpers:
DefaultThresholds,StrictThresholds,RelaxedThresholds.
MetricsCollectortracks executions, success rates, and injector metrics.exporters.PrometheusExporterexposes/metricscompatible with Prometheus.- HTTP helper:
prom.Handler()to plug intonet/http. - Collect metrics after each run via
executor.Reporter().Results()andexecutor.Metrics().Stats().
chaoskit/testing.RunChaosintegrates scenarios withtesting.T.- Options:
WithRepeat,WithFailurePolicy,WithThresholds,WithoutReport,WithReportToStderr, etc. RunChaosSimpleaccepts plain slices of steps, injectors, and validators for lightweight tests.
- workflow engines (Saga, orchestration)
- systems with compensations or rollback logic
- retry-heavy algorithms
- concurrency-sensitive components
- correctness-critical state machines
- CI stress testing
- black-box testing without code access
- infrastructure-level chaos (use Chaos Mesh / Litmus / Gremlin instead)
ChaosKit integrates with ToxiProxy without modifying your code:
- latency
- bandwidth limits
- timeouts
- connection cuts
- packet shaping
Useful for testing clients, message brokers, databases, etc.
Infrastructure chaos exposes resilience of clusters. ChaosKit exposes resilience of your logic.
Examples of failures ChaosKit can detect:
- rollback recursion loops
- leaked goroutines
- unbounded retries
- inconsistent state after error paths
- panics during compensations
- subtle timing bugs
This is the category of failures that infrastructure-level tools cannot simulate.
ChaosKit is released under the MIT License.
This software is provided “as is”, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability arising from the use of this software.
See LICENSE for details.
ChaosKit is early-stage but stable enough for research, prototyping, and internal testing workflows. Contributions, issue reports, and design discussions are welcome.
