Chaos Testing

Stark Orchestrator includes a comprehensive chaos testing system for validating system resilience and fault tolerance. This guide covers how to use the CLI to inject faults into a running cluster.

Prerequisites

Start the database:
```
pnpm db:start
```

Start the server with chaos enabled:

cd packages/server
STARK_CHAOS_ENABLED=true pnpm dev

Authenticate the CLI:
```
stark auth login
```

⚠️ Warning: Chaos testing is disabled in production mode. The server must have STARK_CHAOS_ENABLED=true set.

Quick Start

# 1. Enable chaos mode
stark chaos enable

# 2. View connected nodes
stark chaos nodes

# 3. Run a chaos scenario
stark chaos run node-failure

# 4. Check what happened
stark chaos events

# 5. Clean up
stark chaos clear
stark chaos disable

Commands Overview

Command	Description
`chaos status`	Show chaos system status
`chaos enable`	Enable chaos mode
`chaos disable`	Disable chaos mode
`chaos scenarios`	List available scenarios
`chaos run <scenario>`	Run a chaos scenario
`chaos connections`	List active WebSocket connections
`chaos nodes`	List connected nodes
`chaos kill node <id>`	Kill a node connection
`chaos kill connection <id>`	Kill a specific connection
`chaos pause`	Pause a connection (network freeze)
`chaos resume`	Resume a paused connection
`chaos ban`	Ban a node (sever connection and block reconnection)
`chaos unban`	Unban a node (allow reconnection)
`chaos banned`	List banned nodes
`chaos partition create`	Create a network partition
`chaos partition list`	List active partitions
`chaos partition remove <id>`	Remove a partition
`chaos latency add`	Inject latency
`chaos latency remove <id>`	Remove latency rule
`chaos heartbeat-delay <nodeId>`	Add heartbeat delay
`chaos message-drop`	Add message drop rule
`chaos api-flaky`	Make API calls flaky
`chaos clear`	Clear all chaos rules
`chaos stats`	Get detailed statistics
`chaos events`	Get recent chaos events

Chaos Status

`chaos status`

Show current chaos system status and statistics.

stark chaos status

Output includes:

Whether chaos mode is enabled
Current running scenario
Statistics (messages processed, dropped, latency injections, etc.)

`chaos enable`

Enable chaos mode on the server. Required before injecting faults.

stark chaos enable

`chaos disable`

Disable chaos mode and clear all active rules.

stark chaos disable

Scenarios

Pre-built chaos scenarios for common failure patterns.

`chaos scenarios`

List all available chaos scenarios with their options.

stark chaos scenarios

Available scenarios:

Scenario	Description
`node-loss`	Simulates node disappearing from the cluster
`pod-crash`	Simulates pod crashes and failure conditions
`heartbeat-delay`	Simulates slow or unreliable node heartbeats
`scheduler-conflict`	Simulates scheduling conflicts and resource contention
`service-backoff`	Tests service crash loops and rollback behavior
`orchestrator-restart`	Tests orchestrator restart and state recovery
`api-flakiness`	Simulates intermittent API and database failures
`node-ban-reconciliation`	Tests ban→reschedule→unban with stale state cleanup
`heartbeat-delay-convergence`	Tests heartbeat delays with convergence verification
`service-shape-change`	Tests pod cleanup when service definition changes

`chaos run <scenario>`

Run a chaos scenario.

stark chaos run node-failure

Option	Description
`-o, --option <key=value>`	Scenario-specific options (repeatable)
`-t, --timeout <ms>`	Scenario timeout in milliseconds

Examples:

# Run with default options
stark chaos run latency-spike

# Run with custom options
stark chaos run latency-spike --option latencyMs=500 --option durationMs=30000

# Run with timeout
stark chaos run network-partition --timeout 60000

Reconciliation Test Scenarios

These scenarios verify that Stark converges to a single authoritative pod state under realistic failure and recovery conditions. They use precise timing derived from actual codebase constants.

Timing Constants Reference

HEARTBEAT_TIMEOUT_MS:   60,000ms  (node → SUSPECT)
LEASE_TIMEOUT_MS:      120,000ms  (node → OFFLINE)
HEALTH_CHECK_INTERVAL:  30,000ms
RECONCILE_INTERVAL:     10,000ms
RECONNECT_EXHAUSTION:  ~200,000ms (10 attempts × backoff)

`node-ban-reconciliation`

Tests node ban→reschedule→unban with stale state cleanup.

stark chaos run node-ban-reconciliation \
  --option nodeAId=<id> \
  --option nodeBId=<id> \
  --option mode=standard

Mode	Description
`standard`	Ban, wait for reschedule (~130s), unban, verify stale cleanup
`fast_unban`	Unban before SUSPECT (30s) - pods should NOT move
`late_unban`	Unban after reconnection exhausted (~230s)

What to verify:

Exactly ONE pod exists after convergence
Pod remains on NodeB (NOT reclaimed by NodeA)
If NodeA reports stale pod, orchestrator sends pod:stop

`heartbeat-delay-convergence`

Tests heartbeat delays with explicit convergence verification.

stark chaos run heartbeat-delay-convergence \
  --option nodeId=<id> \
  --option delayMs=45000 \
  --option durationMs=120000

Delay	Expected Behavior
< 60s	Node stays ONLINE
60-120s	Node becomes SUSPECT
> 120s	Node goes OFFLINE, pods rescheduled

What to verify:

No over-service (pod count never exceeds initial)
No duplicate pods created
System converges correctly after delay removed

`service-shape-change`

Tests pod cleanup when service definition changes.

stark chaos run service-shape-change \
  --option serviceId=<id> \
  --option changeType=scale_down \
  --option newReplicas=1

Change Type	Description
`scale_down`	Reduce replica count
`daemonset_to_replica`	Convert DaemonSet (replicas=0) to fixed count

What to verify:

Orchestrator actively stops excess pods
No pods remain in permanent "stopping" state
Runtime and database converge within ~10s (reconcile interval)

Connection Inspection

`chaos connections`

List all active WebSocket connections.

stark chaos connections

Output columns:

ID: Connection identifier
Nodes: Associated node IDs
User: Authenticated user
IP: Client IP address
Auth: Authentication status
Connected: Connection age

`chaos nodes`

List connected nodes from the chaos perspective.

stark chaos nodes

Output columns:

Node ID: Node identifier
Connection ID: Associated WebSocket connection
User: Owning user
Connected: Connection age

Kill Commands

Forcefully terminate connections.

`chaos kill node <nodeId>`

Kill a node's WebSocket connection.

stark chaos kill node production-node-1

`chaos kill connection <connectionId>`

Kill a specific WebSocket connection by ID.

stark chaos kill connection abc123def456

Network Freeze

Simulate network freezes by pausing message delivery.

`chaos pause`

Pause a connection, simulating a network freeze.

stark chaos pause --node <nodeId>

Option	Description
`-n, --node <nodeId>`	Target node ID
`-c, --connection <connectionId>`	Target connection ID
`-d, --duration <ms>`	Auto-resume after duration

Examples:

# Pause indefinitely
stark chaos pause --node production-node-1

# Pause for 5 seconds
stark chaos pause --node production-node-1 --duration 5000

# Pause by connection ID
stark chaos pause --connection abc123

`chaos resume`

Resume a paused connection.

stark chaos resume --node <nodeId>

Option	Description
`-n, --node <nodeId>`	Target node ID
`-c, --connection <connectionId>`	Target connection ID

Node Banning

Ban nodes to completely sever their WebSocket connection AND block any reconnection attempts. Unlike pause which queues messages for later delivery, ban cuts off the node entirely.

`chaos ban`

Ban a node - severs the WebSocket and blocks reconnection.

stark chaos ban --node <nodeId>

Option	Description
`-n, --node <nodeId>`	Target node ID (required)
`-d, --duration <ms>`	Auto-unban after duration

Examples:

# Ban indefinitely
stark chaos ban --node production-node-1

# Ban for 30 seconds
stark chaos ban --node production-node-1 --duration 30000

`chaos unban`

Unban a node, allowing it to reconnect.

stark chaos unban --node <nodeId>

Option	Description
`-n, --node <nodeId>`	Target node ID (required)

`chaos banned`

List all currently banned nodes.

stark chaos banned

Output includes:

Node ID
When the node was banned
Auto-unban time (if set)

Pause vs Ban

Feature	Pause	Ban
Messages	Queued for delivery on resume	Not queued - lost
Connection	Kept open	Severed (terminated)
Reconnection	N/A (still connected)	Blocked until unbanned
Use case	Simulate network freeze	Simulate node blacklisting

Network Partitions

Isolate nodes or connections to simulate network splits.

`chaos partition create`

Create a network partition.

stark chaos partition create --node node1 --node node2

Option	Description
`-n, --node <nodeId>`	Node IDs to partition (repeatable)
`-c, --connection <connectionId>`	Connection IDs to partition (repeatable)
`-d, --duration <ms>`	Auto-heal after duration

Examples:

# Partition two nodes
stark chaos partition create --node node1 --node node2

# Partition with auto-heal
stark chaos partition create --node node1 --duration 10000

# Partition by connection
stark chaos partition create --connection conn1 --connection conn2

`chaos partition list`

List all active network partitions.

stark chaos partition list

`chaos partition remove <partitionId>`

Remove a partition (heal the network).

stark chaos partition remove partition-abc123

Latency Injection

Add artificial latency to connections.

`chaos latency add`

Inject latency into connections.

stark chaos latency add --node <nodeId> --latency 200

Option	Description
`-n, --node <nodeId>`	Target node ID
`-c, --connection <connectionId>`	Target connection ID
`-l, --latency <ms>`	Latency to inject (required)
`-j, --jitter <ms>`	Latency jitter (variance)
`-d, --duration <ms>`	Auto-remove after duration

Examples:

# Add 200ms latency
stark chaos latency add --node node1 --latency 200

# Add latency with jitter
stark chaos latency add --node node1 --latency 200 --jitter 50

# Auto-remove after 30 seconds
stark chaos latency add --node node1 --latency 500 --duration 30000

`chaos latency remove <ruleId>`

Remove a latency injection rule.

stark chaos latency remove rule-abc123

Heartbeat Manipulation

Delay or drop heartbeat messages to trigger node health checks.

`chaos heartbeat-delay <nodeId>`

Add heartbeat delay for a specific node.

stark chaos heartbeat-delay production-node-1 --delay 3000

Option	Description
`--delay <ms>`	Delay in milliseconds (required)
`--duration <ms>`	Auto-remove after duration
`--drop-rate <rate>`	Probability to drop heartbeats (0-1)

Examples:

# Add 3 second heartbeat delay
stark chaos heartbeat-delay node1 --delay 3000

# Drop 50% of heartbeats
stark chaos heartbeat-delay node1 --delay 3000 --drop-rate 0.5

# Auto-remove after 1 minute
stark chaos heartbeat-delay node1 --delay 5000 --duration 60000

Message Dropping

Drop messages based on type or probability.

`chaos message-drop`

Add a message drop rule.

stark chaos message-drop --node <nodeId> --rate 0.5

Option	Description
`-n, --node <nodeId>`	Target node ID
`-t, --types <type>`	Message types to drop (repeatable)
`-r, --rate <rate>`	Drop rate 0-1 (required)

Examples:

# Drop 50% of all messages
stark chaos message-drop --node node1 --rate 0.5

# Drop specific message types
stark chaos message-drop --node node1 --types pod:start --types pod:stop --rate 0.3

# Drop all pod messages globally
stark chaos message-drop --types pod:start --types pod:stop --rate 1.0

Flaky API

Make internal API calls unreliable.

`chaos api-flaky`

Configure flaky API behavior.

stark chaos api-flaky --error-rate 0.1 --timeout-rate 0.05

Option	Description
`--error-rate <rate>`	Probability of API errors (0-1)
`--timeout-rate <rate>`	Probability of timeouts (0-1)
`--timeout-ms <ms>`	Timeout duration

Examples:

# 10% error rate
stark chaos api-flaky --error-rate 0.1

# 5% timeout rate with 5 second timeouts
stark chaos api-flaky --timeout-rate 0.05 --timeout-ms 5000

# Combined chaos
stark chaos api-flaky --error-rate 0.1 --timeout-rate 0.05

Cleanup and Monitoring

`chaos clear`

Clear all active chaos rules.

stark chaos clear

`chaos stats`

Get detailed chaos statistics.

stark chaos stats

`chaos events`

Get recent chaos events.

stark chaos events

Option	Description
`-c, --count <n>`	Number of events to retrieve (default: 50)

Example:

# Get last 100 events
stark chaos events --count 100

Output Formats

All chaos commands support JSON output for scripting:

# JSON output
stark -o json chaos status
stark -o json chaos nodes
stark -o json chaos events

Interactive Testing

For interactive testing during development, use the built-in test runner:

pnpm dev:test

This launches an interactive menu for quick chaos injection without the full CLI.

Example: Full Chaos Test Session

# 1. Enable chaos mode
stark chaos enable

# 2. Check what's connected
stark chaos nodes
stark chaos connections

# 3. Inject some latency
stark chaos latency add --node production-node-1 --latency 300 --jitter 100

# 4. Run a failure scenario
stark chaos run node-failure

# 5. Check the impact
stark chaos stats
stark chaos events

# 6. Create a network partition
stark chaos partition create --node node1 --node node2 --duration 10000

# 7. Watch events in real-time (run in another terminal)
watch -n 1 'stark -o json chaos stats'

# 8. Clean up when done
stark chaos clear
stark chaos disable

Best Practices

Start small: Begin with simple faults before complex scenarios
Monitor impact: Always watch chaos stats and chaos events
Use durations: Set auto-cleanup durations to avoid forgotten chaos
Test in staging: Never run chaos testing in production
Document findings: Record how your system responds to each fault
Incremental complexity: Layer multiple faults to find edge cases

Chaos Testing

Chaos Testing

Prerequisites

Quick Start

Commands Overview

Chaos Status

chaos status

chaos enable

chaos disable

Scenarios

chaos scenarios

chaos run <scenario>

Reconciliation Test Scenarios

Timing Constants Reference

node-ban-reconciliation

heartbeat-delay-convergence

service-shape-change

Connection Inspection

chaos connections

chaos nodes

Kill Commands

chaos kill node <nodeId>

chaos kill connection <connectionId>

Network Freeze

chaos pause

chaos resume

Node Banning

chaos ban

chaos unban

chaos banned

Pause vs Ban

Network Partitions

chaos partition create

chaos partition list

chaos partition remove <partitionId>

Latency Injection

chaos latency add

chaos latency remove <ruleId>

Heartbeat Manipulation

chaos heartbeat-delay <nodeId>

Message Dropping

chaos message-drop

Flaky API

chaos api-flaky

Cleanup and Monitoring

chaos clear

chaos stats

chaos events

Output Formats

Interactive Testing

Example: Full Chaos Test Session

Best Practices

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`chaos status`

`chaos enable`

`chaos disable`

`chaos scenarios`

`chaos run <scenario>`

`node-ban-reconciliation`

`heartbeat-delay-convergence`

`service-shape-change`

`chaos connections`

`chaos nodes`

`chaos kill node <nodeId>`

`chaos kill connection <connectionId>`

`chaos pause`

`chaos resume`

`chaos ban`

`chaos unban`

`chaos banned`

`chaos partition create`

`chaos partition list`

`chaos partition remove <partitionId>`

`chaos latency add`

`chaos latency remove <ruleId>`

`chaos heartbeat-delay <nodeId>`

`chaos message-drop`

`chaos api-flaky`

`chaos clear`

`chaos stats`

`chaos events`