Skip to content

perf: per-user metrics batching to reduce channel pressure#682

Open
jeremyandrews wants to merge 9 commits intotag1consulting:mainfrom
jeremyandrews:feature/per-user-metrics-batching
Open

perf: per-user metrics batching to reduce channel pressure#682
jeremyandrews wants to merge 9 commits intotag1consulting:mainfrom
jeremyandrews:feature/per-user-metrics-batching

Conversation

@jeremyandrews
Copy link
Member

@jeremyandrews jeremyandrews commented Mar 7, 2026

Summary

  • Each GooseUser pre-aggregates metrics locally into batches (up to 100 requests or 250ms) before flushing as a single GooseMetric::Batch message, reducing channel traffic from O(RPS) to O(RPS/batch_size)
  • Successful non-CO requests are pre-aggregated (timing BTreeMaps, status codes, status code timing summaries, graph data); failed and CO-affected requests are buffered individually within the batch for proper error logging and CO backfill
  • Epoch-based batch validity handles metrics resets correctly: stale batches are discarded on both the user side and processor side

Design

The implementation adds:

  • GooseMetricBatch struct with pre-aggregated request, error, transaction, and scenario data
  • GooseMetric::Batch variant for the metrics channel
  • MetricsProcessor::process_batch() for merging pre-aggregated data into global aggregates
  • Batch accumulation methods on GooseUser (accumulate_request, accumulate_transaction, accumulate_scenario, flush_batch, ensure_batch_current)
  • Batch-aware graph data recording methods (record_average_response_time_per_second_batch uses proper weighted merge via MovingAverage::merge())
  • Shared MetricsEpoch (Arc) for coordinating metrics resets

Batch size / timing rationale

  • 100 requests: reduces channel traffic by ~100× for hot paths; at 1000 RPS/user a flush happens every ~100ms
  • 250ms max age: ensures dashboard freshness (metrics tools refresh at 1–5s intervals, so 250ms latency is invisible to operators); low-throughput users still report within a quarter-second
User throughput Flush trigger Effective delay
1000+ RPS/user Size (100 requests) ~100ms
400 RPS/user Size (100 requests) ~250ms (hits time limit)
100 RPS/user Time (250ms) 250ms
10 RPS/user Time (250ms) 250ms

Bug fix: transaction/scenario time rounding

This PR also fixes a pre-existing bug in TransactionMetricAggregate::set_time and ScenarioMetricAggregate::set_time where the rounding logic for response times >500ms used incorrect multipliers:

// BEFORE (buggy):
501..=1000 => ((time as f64 / 100.0).round() * 10.0) as usize,   // * 10.0 should be * 100.0
_ => ((time as f64 / 1000.0).round() * 10.0) as usize,           // * 10.0 should be * 1000.0

// AFTER (fixed via round_metric_time helper):
501..=1000 => ((time as f64 / 100.0).round() * 100.0) as usize,
_ => ((time as f64 / 1000.0).round() * 1000.0) as usize,

This caused transaction and scenario percentile time buckets above 500ms to be ~10x too low (e.g. a 750ms transaction was bucketed at 80 instead of the correct 800). The request timing path (GooseRequestMetricTimingData::record_time) did not have this bug.

The fix extracts a single round_metric_time() helper that replaces all five copies of the rounding logic across metrics.rs and goose.rs, fixing the bug and eliminating duplication.

Note: This changes the percentile distribution for transaction/scenario times >500ms. Historical data or baselines that relied on the old (broken) rounding will not be directly comparable.

Files changed

  • src/metrics.rs — Batch types, process_batch() merging, epoch handling, round_metric_time helper, 14 new unit tests
  • src/goose.rs — GooseUser batch fields and accumulation/flush methods, dead code cleanup, debug assertions
  • src/user.rs — Batch path in record_scenario, batch flush on user shutdown
  • src/graph.rs — Batch-aware graph data recording with proper weighted MovingAverage merge, unit test
  • src/lib.rs — Shared epoch initialization, batch state on GooseUser
  • src/metrics/coordinated_omission.rs — Make update_synthetic_percentage pub(crate)
  • tests/coordinated_omission_integration.rs — Fix pre-existing bug: user_id is 0-indexed
  • tests/status_code_response_times.rs — Fix pre-existing bug: sub-ms min_time can be 0

Performance

Benchmarked with httpmock (localhost) at 10/50/100/200 users, 5s each, best of 3:

Users Baseline (req/s) Batch (req/s) Delta
10 80,021 81,579 +1.9%
50 91,402 90,766 -0.7%
100 90,828 89,445 -1.5%
200 86,722 85,519 -1.4%

With localhost mock servers the HTTP roundtrip is the bottleneck, not the metrics channel — results are within noise. The batching benefit manifests at scale with real network latency where channel contention becomes the limiting factor.

Load testing at scale

To properly validate the batching benefit, test with real network latency where channel contention is the bottleneck rather than HTTP roundtrips.

Setup

Two machines (VMs or cloud instances), one for goose and one for the target server. The key is having enough network latency (≥1ms) that user threads produce metrics faster than the processor can consume them without batching.

Target server: A simple HTTP server returning a static 200 with ~2ms artificial delay:

// Example using actix-web
use actix_web::{web, App, HttpServer, HttpResponse};

async fn handler() -> HttpResponse {
    tokio::time::sleep(std::time::Duration::from_millis(2)).await;
    HttpResponse::Ok().body("ok")
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    HttpServer::new(|| App::new().route("/", web::get().to(handler)))
        .bind("0.0.0.0:8080")?
        .run()
        .await
}

Alternatively, add latency with tc netem:

# On the goose machine, add 2ms latency to the interface facing the target
sudo tc qdisc add dev eth0 root netem delay 2ms

Test script

# Build both branches
git checkout main && cargo build --release
cp target/release/goose goose-baseline

git checkout feature/per-user-metrics-batching && cargo build --release
cp target/release/goose goose-batch

# Run baseline (best of 3, 60s each)
for i in 1 2 3; do
  ./goose-baseline --host http://target:8080 \
    --users 200 --increase-rate 50 --run-time 60s \
    --no-reset-metrics 2>&1 | tee baseline-$i.log
done

# Run batch (best of 3, 60s each)
for i in 1 2 3; do
  ./goose-batch --host http://target:8080 \
    --users 200 --increase-rate 50 --run-time 60s \
    --no-reset-metrics 2>&1 | tee batch-$i.log
done

What to measure

  • Total RPS (headline number)
  • P99 response time (should be lower if channel backpressure was slowing user threads)
  • CPU utilization on the goose machine (pidstat -p $(pgrep goose) 1)
  • Optional: perf stat or flamegraph to see channel send/recv overhead reduction

What to vary

  • User counts: 50, 100, 200, 500
  • Network latency: 1ms, 5ms, 10ms (higher latency → more concurrent requests in flight → more channel pressure)

Expected outcome

At 200+ users with real network latency, expect a measurable RPS improvement (5–15%) and lower tail latencies. The win grows with user count and endpoint diversity.

Test plan

  • All 98 unit tests pass (14 new batch-specific tests)
  • All 6 atomic_counters tests pass (including metrics reset scenarios)
  • All 15 coordinated_omission_integration tests pass
  • All integration tests pass (0 failures across entire test suite)
  • cargo fmt clean
  • Performance comparison shows no regression (localhost)
  • Validate at scale with real network latency (see load testing section above)

Addresses #675.

…h RPS

Each GooseUser now pre-aggregates metrics locally into batches of up to
100 requests (or 250ms max age) before flushing them over the channel as
a single GooseMetric::Batch message. This reduces channel traffic from
O(RPS) to O(RPS/batch_size), lowering contention and MetricsProcessor
overhead at high request rates.

Key design decisions:
- Successful non-CO requests are pre-aggregated (timing BTreeMaps,
  status codes, status code timing summaries, graph data)
- Failed requests and CO-affected requests are buffered individually
  within the batch (they need per-request error logging and CO backfill)
- Epoch-based batch validity ensures metrics resets don't contaminate
  post-reset data: the processor discards stale batches, and users
  discard stale local batches before accumulating new metrics

Also fixes two pre-existing test bugs:
- coordinated_omission_integration: user_id is 0-indexed, not 1-indexed
- status_code_response_times: sub-ms responses can have min_time of 0

Addresses tag1consulting#675.
- Add `record_average_response_time_per_second_batch()` to GraphData
  that performs a proper weighted MovingAverage merge instead of the
  previous loop-with-truncation approach (which lost precision from
  `avg as u64` truncation)

- Remove dead code in `accumulate_request`: the `!request_metric.success`
  graph error branch and `!request_metric.error.is_empty()` error batch
  branch were unreachable since only successful non-error requests reach
  this method. Add debug_assert! to document the invariant.

- Add 14 unit tests for batch operations in metrics::test:
  - batch_new_has_correct_epoch
  - batch_serialization_round_trip
  - batch_sent_on_channel
  - process_batch_merges_request_timing
  - process_batch_merges_multiple_batches
  - process_batch_handles_individual_requests
  - process_batch_merges_error_counts
  - process_batch_merges_transaction_timing
  - process_batch_merges_scenario_timing
  - process_batch_discards_stale_epoch
  - process_batch_accepts_current_epoch
  - process_batch_produces_same_result_as_individual
  - process_batch_graph_data_with_report_file
  - batch_constants_are_sensible

- Add 1 unit test for graph batch merge in graph::test:
  - test_record_average_response_time_per_second_batch
Remove ErrorBatchEntry, batch.errors, and batch.graph_eps which were
never populated by user-side accumulation code (failed requests route
through accumulate_individual_request instead). Remove the associated
merge loops in process_batch(), the record_errors_per_second_batch()
graph method, and the process_batch_merges_error_counts test.

Fix extra string clone in accumulate_request's graph data path by
moving each counter_key_buf.clone() directly into entry() instead of
cloning an intermediate variable.

Rename batch_request_count to batch_item_count with expanded doc
comment clarifying it tracks request metrics as the size-based flush
trigger.
…ing bug

The transaction and scenario rounding logic used `* 10.0` instead of
`* 100.0` / `* 1000.0` for the 500ms+ ranges, causing percentile
buckets to be ~10x too low (e.g. a 750ms transaction bucketed as 80
instead of 800). The request timing path had the correct formula.

Extract a single `round_metric_time()` helper that replaces all five
copies of the rounding logic across metrics.rs and goose.rs, fixing
the bug and eliminating duplication.

Also: replace `#[allow(dead_code)]` with `#[cfg(test)]` on
test-only `ItemsPerSecond` methods in graph.rs, and add
`cleanup_files()` calls to user_metrics_graph_reset tests.
- Use is_some_and instead of map_or for simplified option check
- Allow field_reassign_with_default in test module (GooseConfiguration
  and GooseMetrics require Default + field overrides in tests)
- Use const blocks for compile-time constant assertions
Merge graph_rps and graph_avg_rt into a single graph_request_data
HashMap keyed by (request_key, second), eliminating a redundant key
clone in accumulate_request and a redundant iteration in process_batch.

Replace contains_key + insert + get_mut().unwrap() with entry() API
in ItemsPerSecond::initialize_or_increment and GraphData's response
time recording methods. Remove the now-unused contains_key method and
mark insert as #[cfg(test)].

Add early return for count == 0 in the batch average response time
merge path.
Remove extra blank line and expand function arguments to satisfy
the formatting rules in Rust 1.94's rustfmt.
…ate mod

- Update batch test code to use `"TestScenario".into()` for Arc<str>
  fields introduced by PR tag1consulting#683
- Remove duplicate `mod common;` in user_metrics_graph_reset test
@jeremyandrews jeremyandrews force-pushed the feature/per-user-metrics-batching branch from b02dfef to 3517125 Compare March 10, 2026 16:14
…nds checks

- In accumulate_request, replace the unreachable `if !request_metric.update`
  guard with a debug_assert (update requests bypass this method entirely
  via set_success/set_failure → send_request_metric_now)
- Add bounds checks in process_batch for transaction and scenario indexes
  to prevent panics in Gaggle mode with mismatched worker configurations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant