Sustained Load Scenario

File: configs/scenarios/enhanced/sustained-load.yml Version: 1.0 Tags: sustained, stability, long, endurance

Purpose

Extended 30-minute test designed to validate system stability under continuous load. This scenario is critical for detecting memory leaks, resource exhaustion, gradual performance degradation, and other issues that only manifest over longer time periods.

When to Use

Production readiness testing - Validate system before releases
Memory leak detection - Identify resource leaks over time
Long-term stability validation - Ensure system doesn't degrade
Load testing before releases - Stress test with realistic duration
Soak testing - Verify sustained performance under load
Capacity planning - Understand long-term resource usage

Configuration Details

Test Parameters

Parameter	Value	Description
Duration	1800 seconds (30 min)	Extended duration for stability testing
Traders	25	Higher concurrency than regression test
Expected Orders	~18,000	High volume over test duration
Trading Pattern	constant_rate	Consistent load for stability observation
Base Rate	600.0 orders/min	10 orders/second aggregate rate

Order Type Distribution

Market Orders: 45%
Limit Orders: 45%
TWAP Orders: 5%
Stop-Loss Orders: 3%
Good-After-Time Orders: 2%

Rationale: Realistic mix including conditional orders to test all code paths over extended period.

Resource Requirements

Minimum:

Memory: 4.0 GB
CPU Cores: 4

Recommended:

Memory: 8.0 GB
CPU Cores: 8

Note: Higher resource requirements reflect longer duration and larger state accumulation.

Success Criteria

The test automatically validates results against these thresholds:

Metric	Threshold	Rationale
Success Rate	≥ 95%	Higher threshold for production readiness
P95 Latency	≤ 10 seconds	Stricter latency requirement
Error Rate	≤ 5%	Lower error tolerance for stability
Throughput	≥ 9.0 orders/sec	Must sustain high throughput

Expected Behavior

Normal Operation

With healthy system performance, you should see:

Success rate: 96-99%
P95 latency: 6-9 seconds
Error rate: 1-3%
Throughput: 9.5-10.5 orders/second
Stable memory usage over entire test
No degradation in later vs. earlier periods

Performance Degradation Indicators

Watch for these warning signs:

Memory Issues:

Increasing memory usage over time → Memory leak
Out of memory errors → Insufficient resources

Performance Degradation:

Latency increasing over time → Resource exhaustion
Throughput declining over time → Bottleneck accumulation
Success rate dropping in later periods → State corruption

System Instability:

Error rate increasing over time → System instability
Service crashes or restarts → Resource exhaustion
Connection timeouts → Network saturation

Usage Examples

Command Line

# Run sustained load test
cow-perf run --config configs/scenarios/enhanced/sustained-load.yml

# Run with extended settlement wait (recommended for long tests)
cow-perf run \
  --config configs/scenarios/enhanced/sustained-load.yml \
  --settlement-wait 600

# Run and save as baseline for production readiness
cow-perf run \
  --config configs/scenarios/enhanced/sustained-load.yml \
  --save-baseline "v2.0-stability" \
  --baseline-description "30-min stability test before v2.0 release"

# Run with Prometheus monitoring
cow-perf run \
  --config configs/scenarios/enhanced/sustained-load.yml \
  --prometheus-port 9091

Monitoring During Test

# In separate terminal, monitor resource usage
watch -n 5 'docker stats'

# Monitor memory specifically
watch -n 10 'ps aux | grep cow-perf'

# Monitor Prometheus metrics (if enabled)
curl http://localhost:9091/metrics | grep cow_performance

Interpreting Results

Time-Series Analysis

The key to sustained load testing is analyzing trends over time, not just final metrics.

Plot these metrics over time:

Memory usage (should be stable or grow slowly)
Latency (should remain consistent)
Success rate (should not trend downward)
Throughput (should remain constant)
Error rate (should stay consistently low)

Baseline Comparison

Compare metrics between different time periods within the same test:

# Pseudo-code for analysis
first_10_min = results[0:600]
middle_10_min = results[600:1200]
last_10_min = results[1200:1800]

# Performance should be similar across all periods
assert abs(first_10_min.success_rate - last_10_min.success_rate) < 0.05
assert last_10_min.p95_latency < 1.2 * first_10_min.p95_latency  # Allow 20% degradation max

Memory Leak Detection

Indicators of memory leaks:

Linear memory growth throughout test
Memory doesn't plateau or stabilize
Memory usage >> expected based on state size
System slowdown in later periods

Analysis approach:

Record memory at start, middle, end
Calculate growth rate
Extrapolate to longer durations
Determine if growth is bounded

Best Practices

Run before major releases - Validate stability before production
Monitor resource usage - Watch memory, CPU, network during test
Compare time periods - Ensure consistent performance throughout
Establish baselines - Track stability metrics over releases
Run in production-like environment - Match deployment configuration
Allow for settlement - Use longer --settlement-wait (600s recommended)
Monitor logs - Watch for errors, warnings, anomalies
Run overnight - Schedule for off-hours to avoid interference

Metrics to Monitor

During Test

Metric	Tool	What to Watch
Memory Usage	`docker stats`	Should plateau, not grow linearly
CPU Usage	`docker stats`	Should be steady, not increasing
Network I/O	`docker stats`	Should be consistent
Disk I/O	`iostat`	Should not be bottleneck
Connection Count	`netstat`	Should be stable

After Test

Metric	Analysis	Healthy Range
Success Rate	Mean ± std dev	96-99% ± 1%
P95 Latency	Trend over time	< 10s, flat trend
Error Rate	Mean ± std dev	1-3% ± 1%
Throughput	Mean	9.5-10.5 orders/s

Common Issues

Memory continuously increasing

Check for unreleased resources (connections, file handles)
Review order tracker cleanup
Monitor metrics storage

Performance degrades over time

Check for state accumulation
Review caching strategy
Monitor database query performance

High error rate in later periods

Check for connection pool exhaustion
Review retry logic
Monitor API rate limits

Service crashes mid-test

Increase memory allocation
Review error handling
Check for race conditions

Production Readiness Checklist

Before releasing to production, this scenario should demonstrate:

Success rate > 95% sustained over 30 minutes
P95 latency < 10 seconds consistently
No memory leaks (memory plateaus)
No performance degradation over time
Error rate < 5% consistently
No service crashes or restarts
Resource usage within acceptable limits
All 18,000+ orders processed
Conditional orders work correctly

Related Scenarios

regression-test.yml - Quick 2-minute smoke test for CI/CD
high-frequency.yml - Short burst of extreme load
large-orders.yml - Edge case testing with large order sizes

Advanced Analysis

Performance Percentiles Over Time

# Analyze latency distribution changes over test duration
import numpy as np

early_latencies = get_latencies(0, 600)   # First 10 min
late_latencies = get_latencies(1200, 1800)  # Last 10 min

print(f"P50 change: {np.percentile(late_latencies, 50) / np.percentile(early_latencies, 50)}")
print(f"P95 change: {np.percentile(late_latencies, 95) / np.percentile(early_latencies, 95)}")
print(f"P99 change: {np.percentile(late_latencies, 99) / np.percentile(early_latencies, 99)}")

# Should all be close to 1.0 (no change)

Memory Growth Rate

# Calculate memory growth rate
memory_samples = [(t, get_memory_usage(t)) for t in range(0, 1800, 60)]
slope = calculate_linear_regression_slope(memory_samples)

# Acceptable: < 1 MB/minute growth
# Warning: 1-5 MB/minute
# Critical: > 5 MB/minute

Troubleshooting

Test doesn't complete

Increase timeout settings
Check for deadlocks
Review service health

Intermittent failures

Check for rate limiting
Review connection timeouts
Monitor network stability

Results inconsistent between runs

Ensure consistent environment
Check for external interference
Review random seed handling

Version History

1.0 (Initial) - 30-minute stability test with realistic order mix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sustained Load Scenario

Purpose

When to Use

Configuration Details

Test Parameters

Order Type Distribution

Resource Requirements

Success Criteria

Expected Behavior

Normal Operation

Performance Degradation Indicators

Usage Examples

Command Line

Monitoring During Test

Interpreting Results

Time-Series Analysis

Baseline Comparison

Memory Leak Detection

Best Practices

Metrics to Monitor

During Test

After Test

Common Issues

Production Readiness Checklist

Related Scenarios

Advanced Analysis

Performance Percentiles Over Time

Memory Growth Rate

Troubleshooting

Version History

FilesExpand file tree

sustained-load.md

Latest commit

History

sustained-load.md

File metadata and controls

Sustained Load Scenario

Purpose

When to Use

Configuration Details

Test Parameters

Order Type Distribution

Resource Requirements

Success Criteria

Expected Behavior

Normal Operation

Performance Degradation Indicators

Usage Examples

Command Line

Monitoring During Test

Interpreting Results

Time-Series Analysis

Baseline Comparison

Memory Leak Detection

Best Practices

Metrics to Monitor

During Test

After Test

Common Issues

Production Readiness Checklist

Related Scenarios

Advanced Analysis

Performance Percentiles Over Time

Memory Growth Rate

Troubleshooting

Version History