Skip to content

Chaos Testing

Adrian Burlacu edited this page Feb 6, 2026 · 4 revisions

Chaos Testing

Stark Orchestrator includes a comprehensive chaos testing system for validating system resilience and fault tolerance. This guide covers how to use the CLI to inject faults into a running cluster.

Prerequisites

  1. Start the database:

    pnpm db:start
  2. Start the server with chaos enabled:

    cd packages/server
    STARK_CHAOS_ENABLED=true pnpm dev
  3. Authenticate the CLI:

    stark auth login

⚠️ Warning: Chaos testing is disabled in production mode. The server must have STARK_CHAOS_ENABLED=true set.


Quick Start

# 1. Enable chaos mode
stark chaos enable

# 2. View connected nodes
stark chaos nodes

# 3. Run a chaos scenario
stark chaos run node-failure

# 4. Check what happened
stark chaos events

# 5. Clean up
stark chaos clear
stark chaos disable

Commands Overview

Command Description
chaos status Show chaos system status
chaos enable Enable chaos mode
chaos disable Disable chaos mode
chaos scenarios List available scenarios
chaos run <scenario> Run a chaos scenario
chaos connections List active WebSocket connections
chaos nodes List connected nodes
chaos kill node <id> Kill a node connection
chaos kill connection <id> Kill a specific connection
chaos pause Pause a connection (network freeze)
chaos resume Resume a paused connection
chaos ban Ban a node (sever connection and block reconnection)
chaos unban Unban a node (allow reconnection)
chaos banned List banned nodes
chaos partition create Create a network partition
chaos partition list List active partitions
chaos partition remove <id> Remove a partition
chaos latency add Inject latency
chaos latency remove <id> Remove latency rule
chaos heartbeat-delay <nodeId> Add heartbeat delay
chaos message-drop Add message drop rule
chaos api-flaky Make API calls flaky
chaos clear Clear all chaos rules
chaos stats Get detailed statistics
chaos events Get recent chaos events

Chaos Status

chaos status

Show current chaos system status and statistics.

stark chaos status

Output includes:

  • Whether chaos mode is enabled
  • Current running scenario
  • Statistics (messages processed, dropped, latency injections, etc.)

chaos enable

Enable chaos mode on the server. Required before injecting faults.

stark chaos enable

chaos disable

Disable chaos mode and clear all active rules.

stark chaos disable

Scenarios

Pre-built chaos scenarios for common failure patterns.

chaos scenarios

List all available chaos scenarios with their options.

stark chaos scenarios

Available scenarios:

Scenario Description
node-loss Simulates node disappearing from the cluster
pod-crash Simulates pod crashes and failure conditions
heartbeat-delay Simulates slow or unreliable node heartbeats
scheduler-conflict Simulates scheduling conflicts and resource contention
service-backoff Tests service crash loops and rollback behavior
orchestrator-restart Tests orchestrator restart and state recovery
api-flakiness Simulates intermittent API and database failures
node-ban-reconciliation Tests ban→reschedule→unban with stale state cleanup
heartbeat-delay-convergence Tests heartbeat delays with convergence verification
service-shape-change Tests pod cleanup when service definition changes

chaos run <scenario>

Run a chaos scenario.

stark chaos run node-failure
Option Description
-o, --option <key=value> Scenario-specific options (repeatable)
-t, --timeout <ms> Scenario timeout in milliseconds

Examples:

# Run with default options
stark chaos run latency-spike

# Run with custom options
stark chaos run latency-spike --option latencyMs=500 --option durationMs=30000

# Run with timeout
stark chaos run network-partition --timeout 60000

Reconciliation Test Scenarios

These scenarios verify that Stark converges to a single authoritative pod state under realistic failure and recovery conditions. They use precise timing derived from actual codebase constants.

Timing Constants Reference

HEARTBEAT_TIMEOUT_MS:   60,000ms  (node → SUSPECT)
LEASE_TIMEOUT_MS:      120,000ms  (node → OFFLINE)
HEALTH_CHECK_INTERVAL:  30,000ms
RECONCILE_INTERVAL:     10,000ms
RECONNECT_EXHAUSTION:  ~200,000ms (10 attempts × backoff)

node-ban-reconciliation

Tests node ban→reschedule→unban with stale state cleanup.

stark chaos run node-ban-reconciliation \
  --option nodeAId=<id> \
  --option nodeBId=<id> \
  --option mode=standard
Mode Description
standard Ban, wait for reschedule (~130s), unban, verify stale cleanup
fast_unban Unban before SUSPECT (30s) - pods should NOT move
late_unban Unban after reconnection exhausted (~230s)

What to verify:

  • Exactly ONE pod exists after convergence
  • Pod remains on NodeB (NOT reclaimed by NodeA)
  • If NodeA reports stale pod, orchestrator sends pod:stop

heartbeat-delay-convergence

Tests heartbeat delays with explicit convergence verification.

stark chaos run heartbeat-delay-convergence \
  --option nodeId=<id> \
  --option delayMs=45000 \
  --option durationMs=120000
Delay Expected Behavior
< 60s Node stays ONLINE
60-120s Node becomes SUSPECT
> 120s Node goes OFFLINE, pods rescheduled

What to verify:

  • No over-service (pod count never exceeds initial)
  • No duplicate pods created
  • System converges correctly after delay removed

service-shape-change

Tests pod cleanup when service definition changes.

stark chaos run service-shape-change \
  --option serviceId=<id> \
  --option changeType=scale_down \
  --option newReplicas=1
Change Type Description
scale_down Reduce replica count
daemonset_to_replica Convert DaemonSet (replicas=0) to fixed count

What to verify:

  • Orchestrator actively stops excess pods
  • No pods remain in permanent "stopping" state
  • Runtime and database converge within ~10s (reconcile interval)

Connection Inspection

chaos connections

List all active WebSocket connections.

stark chaos connections

Output columns:

  • ID: Connection identifier
  • Nodes: Associated node IDs
  • User: Authenticated user
  • IP: Client IP address
  • Auth: Authentication status
  • Connected: Connection age

chaos nodes

List connected nodes from the chaos perspective.

stark chaos nodes

Output columns:

  • Node ID: Node identifier
  • Connection ID: Associated WebSocket connection
  • User: Owning user
  • Connected: Connection age

Kill Commands

Forcefully terminate connections.

chaos kill node <nodeId>

Kill a node's WebSocket connection.

stark chaos kill node production-node-1

chaos kill connection <connectionId>

Kill a specific WebSocket connection by ID.

stark chaos kill connection abc123def456

Network Freeze

Simulate network freezes by pausing message delivery.

chaos pause

Pause a connection, simulating a network freeze.

stark chaos pause --node <nodeId>
Option Description
-n, --node <nodeId> Target node ID
-c, --connection <connectionId> Target connection ID
-d, --duration <ms> Auto-resume after duration

Examples:

# Pause indefinitely
stark chaos pause --node production-node-1

# Pause for 5 seconds
stark chaos pause --node production-node-1 --duration 5000

# Pause by connection ID
stark chaos pause --connection abc123

chaos resume

Resume a paused connection.

stark chaos resume --node <nodeId>
Option Description
-n, --node <nodeId> Target node ID
-c, --connection <connectionId> Target connection ID

Node Banning

Ban nodes to completely sever their WebSocket connection AND block any reconnection attempts. Unlike pause which queues messages for later delivery, ban cuts off the node entirely.

chaos ban

Ban a node - severs the WebSocket and blocks reconnection.

stark chaos ban --node <nodeId>
Option Description
-n, --node <nodeId> Target node ID (required)
-d, --duration <ms> Auto-unban after duration

Examples:

# Ban indefinitely
stark chaos ban --node production-node-1

# Ban for 30 seconds
stark chaos ban --node production-node-1 --duration 30000

chaos unban

Unban a node, allowing it to reconnect.

stark chaos unban --node <nodeId>
Option Description
-n, --node <nodeId> Target node ID (required)

chaos banned

List all currently banned nodes.

stark chaos banned

Output includes:

  • Node ID
  • When the node was banned
  • Auto-unban time (if set)

Pause vs Ban

Feature Pause Ban
Messages Queued for delivery on resume Not queued - lost
Connection Kept open Severed (terminated)
Reconnection N/A (still connected) Blocked until unbanned
Use case Simulate network freeze Simulate node blacklisting

Network Partitions

Isolate nodes or connections to simulate network splits.

chaos partition create

Create a network partition.

stark chaos partition create --node node1 --node node2
Option Description
-n, --node <nodeId> Node IDs to partition (repeatable)
-c, --connection <connectionId> Connection IDs to partition (repeatable)
-d, --duration <ms> Auto-heal after duration

Examples:

# Partition two nodes
stark chaos partition create --node node1 --node node2

# Partition with auto-heal
stark chaos partition create --node node1 --duration 10000

# Partition by connection
stark chaos partition create --connection conn1 --connection conn2

chaos partition list

List all active network partitions.

stark chaos partition list

chaos partition remove <partitionId>

Remove a partition (heal the network).

stark chaos partition remove partition-abc123

Latency Injection

Add artificial latency to connections.

chaos latency add

Inject latency into connections.

stark chaos latency add --node <nodeId> --latency 200
Option Description
-n, --node <nodeId> Target node ID
-c, --connection <connectionId> Target connection ID
-l, --latency <ms> Latency to inject (required)
-j, --jitter <ms> Latency jitter (variance)
-d, --duration <ms> Auto-remove after duration

Examples:

# Add 200ms latency
stark chaos latency add --node node1 --latency 200

# Add latency with jitter
stark chaos latency add --node node1 --latency 200 --jitter 50

# Auto-remove after 30 seconds
stark chaos latency add --node node1 --latency 500 --duration 30000

chaos latency remove <ruleId>

Remove a latency injection rule.

stark chaos latency remove rule-abc123

Heartbeat Manipulation

Delay or drop heartbeat messages to trigger node health checks.

chaos heartbeat-delay <nodeId>

Add heartbeat delay for a specific node.

stark chaos heartbeat-delay production-node-1 --delay 3000
Option Description
--delay <ms> Delay in milliseconds (required)
--duration <ms> Auto-remove after duration
--drop-rate <rate> Probability to drop heartbeats (0-1)

Examples:

# Add 3 second heartbeat delay
stark chaos heartbeat-delay node1 --delay 3000

# Drop 50% of heartbeats
stark chaos heartbeat-delay node1 --delay 3000 --drop-rate 0.5

# Auto-remove after 1 minute
stark chaos heartbeat-delay node1 --delay 5000 --duration 60000

Message Dropping

Drop messages based on type or probability.

chaos message-drop

Add a message drop rule.

stark chaos message-drop --node <nodeId> --rate 0.5
Option Description
-n, --node <nodeId> Target node ID
-t, --types <type> Message types to drop (repeatable)
-r, --rate <rate> Drop rate 0-1 (required)

Examples:

# Drop 50% of all messages
stark chaos message-drop --node node1 --rate 0.5

# Drop specific message types
stark chaos message-drop --node node1 --types pod:start --types pod:stop --rate 0.3

# Drop all pod messages globally
stark chaos message-drop --types pod:start --types pod:stop --rate 1.0

Flaky API

Make internal API calls unreliable.

chaos api-flaky

Configure flaky API behavior.

stark chaos api-flaky --error-rate 0.1 --timeout-rate 0.05
Option Description
--error-rate <rate> Probability of API errors (0-1)
--timeout-rate <rate> Probability of timeouts (0-1)
--timeout-ms <ms> Timeout duration

Examples:

# 10% error rate
stark chaos api-flaky --error-rate 0.1

# 5% timeout rate with 5 second timeouts
stark chaos api-flaky --timeout-rate 0.05 --timeout-ms 5000

# Combined chaos
stark chaos api-flaky --error-rate 0.1 --timeout-rate 0.05

Cleanup and Monitoring

chaos clear

Clear all active chaos rules.

stark chaos clear

chaos stats

Get detailed chaos statistics.

stark chaos stats

chaos events

Get recent chaos events.

stark chaos events
Option Description
-c, --count <n> Number of events to retrieve (default: 50)

Example:

# Get last 100 events
stark chaos events --count 100

Output Formats

All chaos commands support JSON output for scripting:

# JSON output
stark -o json chaos status
stark -o json chaos nodes
stark -o json chaos events

Interactive Testing

For interactive testing during development, use the built-in test runner:

pnpm dev:test

This launches an interactive menu for quick chaos injection without the full CLI.


Example: Full Chaos Test Session

# 1. Enable chaos mode
stark chaos enable

# 2. Check what's connected
stark chaos nodes
stark chaos connections

# 3. Inject some latency
stark chaos latency add --node production-node-1 --latency 300 --jitter 100

# 4. Run a failure scenario
stark chaos run node-failure

# 5. Check the impact
stark chaos stats
stark chaos events

# 6. Create a network partition
stark chaos partition create --node node1 --node node2 --duration 10000

# 7. Watch events in real-time (run in another terminal)
watch -n 1 'stark -o json chaos stats'

# 8. Clean up when done
stark chaos clear
stark chaos disable

Best Practices

  1. Start small: Begin with simple faults before complex scenarios
  2. Monitor impact: Always watch chaos stats and chaos events
  3. Use durations: Set auto-cleanup durations to avoid forgotten chaos
  4. Test in staging: Never run chaos testing in production
  5. Document findings: Record how your system responds to each fault
  6. Incremental complexity: Layer multiple faults to find edge cases

See Also

Clone this wiki locally