-
Notifications
You must be signed in to change notification settings - Fork 1
Chaos Testing
Stark Orchestrator includes a comprehensive chaos testing system for validating system resilience and fault tolerance. This guide covers how to use the CLI to inject faults into a running cluster.
-
Start the database:
pnpm db:start
-
Start the server with chaos enabled:
cd packages/server STARK_CHAOS_ENABLED=true pnpm dev -
Authenticate the CLI:
stark auth login
⚠️ Warning: Chaos testing is disabled in production mode. The server must haveSTARK_CHAOS_ENABLED=trueset.
# 1. Enable chaos mode
stark chaos enable
# 2. View connected nodes
stark chaos nodes
# 3. Run a chaos scenario
stark chaos run node-failure
# 4. Check what happened
stark chaos events
# 5. Clean up
stark chaos clear
stark chaos disable| Command | Description |
|---|---|
chaos status |
Show chaos system status |
chaos enable |
Enable chaos mode |
chaos disable |
Disable chaos mode |
chaos scenarios |
List available scenarios |
chaos run <scenario> |
Run a chaos scenario |
chaos connections |
List active WebSocket connections |
chaos nodes |
List connected nodes |
chaos kill node <id> |
Kill a node connection |
chaos kill connection <id> |
Kill a specific connection |
chaos pause |
Pause a connection (network freeze) |
chaos resume |
Resume a paused connection |
chaos ban |
Ban a node (sever connection and block reconnection) |
chaos unban |
Unban a node (allow reconnection) |
chaos banned |
List banned nodes |
chaos partition create |
Create a network partition |
chaos partition list |
List active partitions |
chaos partition remove <id> |
Remove a partition |
chaos latency add |
Inject latency |
chaos latency remove <id> |
Remove latency rule |
chaos heartbeat-delay <nodeId> |
Add heartbeat delay |
chaos message-drop |
Add message drop rule |
chaos api-flaky |
Make API calls flaky |
chaos clear |
Clear all chaos rules |
chaos stats |
Get detailed statistics |
chaos events |
Get recent chaos events |
Show current chaos system status and statistics.
stark chaos statusOutput includes:
- Whether chaos mode is enabled
- Current running scenario
- Statistics (messages processed, dropped, latency injections, etc.)
Enable chaos mode on the server. Required before injecting faults.
stark chaos enableDisable chaos mode and clear all active rules.
stark chaos disablePre-built chaos scenarios for common failure patterns.
List all available chaos scenarios with their options.
stark chaos scenariosAvailable scenarios:
| Scenario | Description |
|---|---|
node-loss |
Simulates node disappearing from the cluster |
pod-crash |
Simulates pod crashes and failure conditions |
heartbeat-delay |
Simulates slow or unreliable node heartbeats |
scheduler-conflict |
Simulates scheduling conflicts and resource contention |
service-backoff |
Tests service crash loops and rollback behavior |
orchestrator-restart |
Tests orchestrator restart and state recovery |
api-flakiness |
Simulates intermittent API and database failures |
node-ban-reconciliation |
Tests ban→reschedule→unban with stale state cleanup |
heartbeat-delay-convergence |
Tests heartbeat delays with convergence verification |
service-shape-change |
Tests pod cleanup when service definition changes |
Run a chaos scenario.
stark chaos run node-failure| Option | Description |
|---|---|
-o, --option <key=value> |
Scenario-specific options (repeatable) |
-t, --timeout <ms> |
Scenario timeout in milliseconds |
Examples:
# Run with default options
stark chaos run latency-spike
# Run with custom options
stark chaos run latency-spike --option latencyMs=500 --option durationMs=30000
# Run with timeout
stark chaos run network-partition --timeout 60000These scenarios verify that Stark converges to a single authoritative pod state under realistic failure and recovery conditions. They use precise timing derived from actual codebase constants.
HEARTBEAT_TIMEOUT_MS: 60,000ms (node → SUSPECT)
LEASE_TIMEOUT_MS: 120,000ms (node → OFFLINE)
HEALTH_CHECK_INTERVAL: 30,000ms
RECONCILE_INTERVAL: 10,000ms
RECONNECT_EXHAUSTION: ~200,000ms (10 attempts × backoff)
Tests node ban→reschedule→unban with stale state cleanup.
stark chaos run node-ban-reconciliation \
--option nodeAId=<id> \
--option nodeBId=<id> \
--option mode=standard| Mode | Description |
|---|---|
standard |
Ban, wait for reschedule (~130s), unban, verify stale cleanup |
fast_unban |
Unban before SUSPECT (30s) - pods should NOT move |
late_unban |
Unban after reconnection exhausted (~230s) |
What to verify:
- Exactly ONE pod exists after convergence
- Pod remains on NodeB (NOT reclaimed by NodeA)
- If NodeA reports stale pod, orchestrator sends
pod:stop
Tests heartbeat delays with explicit convergence verification.
stark chaos run heartbeat-delay-convergence \
--option nodeId=<id> \
--option delayMs=45000 \
--option durationMs=120000| Delay | Expected Behavior |
|---|---|
| < 60s | Node stays ONLINE |
| 60-120s | Node becomes SUSPECT |
| > 120s | Node goes OFFLINE, pods rescheduled |
What to verify:
- No over-service (pod count never exceeds initial)
- No duplicate pods created
- System converges correctly after delay removed
Tests pod cleanup when service definition changes.
stark chaos run service-shape-change \
--option serviceId=<id> \
--option changeType=scale_down \
--option newReplicas=1| Change Type | Description |
|---|---|
scale_down |
Reduce replica count |
daemonset_to_replica |
Convert DaemonSet (replicas=0) to fixed count |
What to verify:
- Orchestrator actively stops excess pods
- No pods remain in permanent "stopping" state
- Runtime and database converge within ~10s (reconcile interval)
List all active WebSocket connections.
stark chaos connectionsOutput columns:
- ID: Connection identifier
- Nodes: Associated node IDs
- User: Authenticated user
- IP: Client IP address
- Auth: Authentication status
- Connected: Connection age
List connected nodes from the chaos perspective.
stark chaos nodesOutput columns:
- Node ID: Node identifier
- Connection ID: Associated WebSocket connection
- User: Owning user
- Connected: Connection age
Forcefully terminate connections.
Kill a node's WebSocket connection.
stark chaos kill node production-node-1Kill a specific WebSocket connection by ID.
stark chaos kill connection abc123def456Simulate network freezes by pausing message delivery.
Pause a connection, simulating a network freeze.
stark chaos pause --node <nodeId>| Option | Description |
|---|---|
-n, --node <nodeId> |
Target node ID |
-c, --connection <connectionId> |
Target connection ID |
-d, --duration <ms> |
Auto-resume after duration |
Examples:
# Pause indefinitely
stark chaos pause --node production-node-1
# Pause for 5 seconds
stark chaos pause --node production-node-1 --duration 5000
# Pause by connection ID
stark chaos pause --connection abc123Resume a paused connection.
stark chaos resume --node <nodeId>| Option | Description |
|---|---|
-n, --node <nodeId> |
Target node ID |
-c, --connection <connectionId> |
Target connection ID |
Ban nodes to completely sever their WebSocket connection AND block any reconnection attempts. Unlike pause which queues messages for later delivery, ban cuts off the node entirely.
Ban a node - severs the WebSocket and blocks reconnection.
stark chaos ban --node <nodeId>| Option | Description |
|---|---|
-n, --node <nodeId> |
Target node ID (required) |
-d, --duration <ms> |
Auto-unban after duration |
Examples:
# Ban indefinitely
stark chaos ban --node production-node-1
# Ban for 30 seconds
stark chaos ban --node production-node-1 --duration 30000Unban a node, allowing it to reconnect.
stark chaos unban --node <nodeId>| Option | Description |
|---|---|
-n, --node <nodeId> |
Target node ID (required) |
List all currently banned nodes.
stark chaos bannedOutput includes:
- Node ID
- When the node was banned
- Auto-unban time (if set)
| Feature | Pause | Ban |
|---|---|---|
| Messages | Queued for delivery on resume | Not queued - lost |
| Connection | Kept open | Severed (terminated) |
| Reconnection | N/A (still connected) | Blocked until unbanned |
| Use case | Simulate network freeze | Simulate node blacklisting |
Isolate nodes or connections to simulate network splits.
Create a network partition.
stark chaos partition create --node node1 --node node2| Option | Description |
|---|---|
-n, --node <nodeId> |
Node IDs to partition (repeatable) |
-c, --connection <connectionId> |
Connection IDs to partition (repeatable) |
-d, --duration <ms> |
Auto-heal after duration |
Examples:
# Partition two nodes
stark chaos partition create --node node1 --node node2
# Partition with auto-heal
stark chaos partition create --node node1 --duration 10000
# Partition by connection
stark chaos partition create --connection conn1 --connection conn2List all active network partitions.
stark chaos partition listRemove a partition (heal the network).
stark chaos partition remove partition-abc123Add artificial latency to connections.
Inject latency into connections.
stark chaos latency add --node <nodeId> --latency 200| Option | Description |
|---|---|
-n, --node <nodeId> |
Target node ID |
-c, --connection <connectionId> |
Target connection ID |
-l, --latency <ms> |
Latency to inject (required) |
-j, --jitter <ms> |
Latency jitter (variance) |
-d, --duration <ms> |
Auto-remove after duration |
Examples:
# Add 200ms latency
stark chaos latency add --node node1 --latency 200
# Add latency with jitter
stark chaos latency add --node node1 --latency 200 --jitter 50
# Auto-remove after 30 seconds
stark chaos latency add --node node1 --latency 500 --duration 30000Remove a latency injection rule.
stark chaos latency remove rule-abc123Delay or drop heartbeat messages to trigger node health checks.
Add heartbeat delay for a specific node.
stark chaos heartbeat-delay production-node-1 --delay 3000| Option | Description |
|---|---|
--delay <ms> |
Delay in milliseconds (required) |
--duration <ms> |
Auto-remove after duration |
--drop-rate <rate> |
Probability to drop heartbeats (0-1) |
Examples:
# Add 3 second heartbeat delay
stark chaos heartbeat-delay node1 --delay 3000
# Drop 50% of heartbeats
stark chaos heartbeat-delay node1 --delay 3000 --drop-rate 0.5
# Auto-remove after 1 minute
stark chaos heartbeat-delay node1 --delay 5000 --duration 60000Drop messages based on type or probability.
Add a message drop rule.
stark chaos message-drop --node <nodeId> --rate 0.5| Option | Description |
|---|---|
-n, --node <nodeId> |
Target node ID |
-t, --types <type> |
Message types to drop (repeatable) |
-r, --rate <rate> |
Drop rate 0-1 (required) |
Examples:
# Drop 50% of all messages
stark chaos message-drop --node node1 --rate 0.5
# Drop specific message types
stark chaos message-drop --node node1 --types pod:start --types pod:stop --rate 0.3
# Drop all pod messages globally
stark chaos message-drop --types pod:start --types pod:stop --rate 1.0Make internal API calls unreliable.
Configure flaky API behavior.
stark chaos api-flaky --error-rate 0.1 --timeout-rate 0.05| Option | Description |
|---|---|
--error-rate <rate> |
Probability of API errors (0-1) |
--timeout-rate <rate> |
Probability of timeouts (0-1) |
--timeout-ms <ms> |
Timeout duration |
Examples:
# 10% error rate
stark chaos api-flaky --error-rate 0.1
# 5% timeout rate with 5 second timeouts
stark chaos api-flaky --timeout-rate 0.05 --timeout-ms 5000
# Combined chaos
stark chaos api-flaky --error-rate 0.1 --timeout-rate 0.05Clear all active chaos rules.
stark chaos clearGet detailed chaos statistics.
stark chaos statsGet recent chaos events.
stark chaos events| Option | Description |
|---|---|
-c, --count <n> |
Number of events to retrieve (default: 50) |
Example:
# Get last 100 events
stark chaos events --count 100All chaos commands support JSON output for scripting:
# JSON output
stark -o json chaos status
stark -o json chaos nodes
stark -o json chaos eventsFor interactive testing during development, use the built-in test runner:
pnpm dev:testThis launches an interactive menu for quick chaos injection without the full CLI.
# 1. Enable chaos mode
stark chaos enable
# 2. Check what's connected
stark chaos nodes
stark chaos connections
# 3. Inject some latency
stark chaos latency add --node production-node-1 --latency 300 --jitter 100
# 4. Run a failure scenario
stark chaos run node-failure
# 5. Check the impact
stark chaos stats
stark chaos events
# 6. Create a network partition
stark chaos partition create --node node1 --node node2 --duration 10000
# 7. Watch events in real-time (run in another terminal)
watch -n 1 'stark -o json chaos stats'
# 8. Clean up when done
stark chaos clear
stark chaos disable- Start small: Begin with simple faults before complex scenarios
-
Monitor impact: Always watch
chaos statsandchaos events - Use durations: Set auto-cleanup durations to avoid forgotten chaos
- Test in staging: Never run chaos testing in production
- Document findings: Record how your system responds to each fault
- Incremental complexity: Layer multiple faults to find edge cases
- CLI Reference - Complete CLI documentation
- Architecture - System architecture overview
- Metrics and Observability - Monitoring your cluster
- Home
- Getting Started
- Concepts
- Core Architecture
- Tutorials
- Reference
- Advanced Topics
- Contribution