perf: per-user metrics batching to reduce channel pressure by jeremyandrews · Pull Request #682 · tag1consulting/goose

jeremyandrews · 2026-03-07T09:26:42Z

Summary

Each GooseUser pre-aggregates metrics locally into batches (up to 100 requests or 250ms) before flushing as a single GooseMetric::Batch message, reducing channel traffic from O(RPS) to O(RPS/batch_size)
Successful non-CO requests are pre-aggregated (timing BTreeMaps, status codes, status code timing summaries, graph data); failed and CO-affected requests are buffered individually within the batch for proper error logging and CO backfill
Epoch-based batch validity handles metrics resets correctly: stale batches are discarded on both the user side and processor side

Design

The implementation adds:

GooseMetricBatch struct with pre-aggregated request, error, transaction, and scenario data
GooseMetric::Batch variant for the metrics channel
MetricsProcessor::process_batch() for merging pre-aggregated data into global aggregates
Batch accumulation methods on GooseUser (accumulate_request, accumulate_transaction, accumulate_scenario, flush_batch, ensure_batch_current)
Batch-aware graph data recording methods (record_average_response_time_per_second_batch uses proper weighted merge via MovingAverage::merge())
Shared MetricsEpoch (Arc) for coordinating metrics resets

Batch size / timing rationale

100 requests: reduces channel traffic by ~100× for hot paths; at 1000 RPS/user a flush happens every ~100ms
250ms max age: ensures dashboard freshness (metrics tools refresh at 1–5s intervals, so 250ms latency is invisible to operators); low-throughput users still report within a quarter-second

User throughput	Flush trigger	Effective delay
1000+ RPS/user	Size (100 requests)	~100ms
400 RPS/user	Size (100 requests)	~250ms (hits time limit)
100 RPS/user	Time (250ms)	250ms
10 RPS/user	Time (250ms)	250ms

Bug fix: transaction/scenario time rounding

This PR also fixes a pre-existing bug in TransactionMetricAggregate::set_time and ScenarioMetricAggregate::set_time where the rounding logic for response times >500ms used incorrect multipliers:

// BEFORE (buggy):
501..=1000 => ((time as f64 / 100.0).round() * 10.0) as usize,   // * 10.0 should be * 100.0
_ => ((time as f64 / 1000.0).round() * 10.0) as usize,           // * 10.0 should be * 1000.0

// AFTER (fixed via round_metric_time helper):
501..=1000 => ((time as f64 / 100.0).round() * 100.0) as usize,
_ => ((time as f64 / 1000.0).round() * 1000.0) as usize,

This caused transaction and scenario percentile time buckets above 500ms to be ~10x too low (e.g. a 750ms transaction was bucketed at 80 instead of the correct 800). The request timing path (GooseRequestMetricTimingData::record_time) did not have this bug.

The fix extracts a single round_metric_time() helper that replaces all five copies of the rounding logic across metrics.rs and goose.rs, fixing the bug and eliminating duplication.

Note: This changes the percentile distribution for transaction/scenario times >500ms. Historical data or baselines that relied on the old (broken) rounding will not be directly comparable.

Files changed

src/metrics.rs — Batch types, process_batch() merging, epoch handling, round_metric_time helper, 14 new unit tests
src/goose.rs — GooseUser batch fields and accumulation/flush methods, dead code cleanup, debug assertions
src/user.rs — Batch path in record_scenario, batch flush on user shutdown
src/graph.rs — Batch-aware graph data recording with proper weighted MovingAverage merge, unit test
src/lib.rs — Shared epoch initialization, batch state on GooseUser
src/metrics/coordinated_omission.rs — Make update_synthetic_percentage pub(crate)
tests/coordinated_omission_integration.rs — Fix pre-existing bug: user_id is 0-indexed
tests/status_code_response_times.rs — Fix pre-existing bug: sub-ms min_time can be 0

Performance

Benchmarked with httpmock (localhost) at 10/50/100/200 users, 5s each, best of 3:

Users	Baseline (req/s)	Batch (req/s)	Delta
10	80,021	81,579	+1.9%
50	91,402	90,766	-0.7%
100	90,828	89,445	-1.5%
200	86,722	85,519	-1.4%

With localhost mock servers the HTTP roundtrip is the bottleneck, not the metrics channel — results are within noise. The batching benefit manifests at scale with real network latency where channel contention becomes the limiting factor.

Load testing at scale

To properly validate the batching benefit, test with real network latency where channel contention is the bottleneck rather than HTTP roundtrips.

Setup

Two machines (VMs or cloud instances), one for goose and one for the target server. The key is having enough network latency (≥1ms) that user threads produce metrics faster than the processor can consume them without batching.

Target server: A simple HTTP server returning a static 200 with ~2ms artificial delay:

// Example using actix-web
use actix_web::{web, App, HttpServer, HttpResponse};

async fn handler() -> HttpResponse {
    tokio::time::sleep(std::time::Duration::from_millis(2)).await;
    HttpResponse::Ok().body("ok")
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    HttpServer::new(|| App::new().route("/", web::get().to(handler)))
        .bind("0.0.0.0:8080")?
        .run()
        .await
}

Alternatively, add latency with tc netem:

# On the goose machine, add 2ms latency to the interface facing the target
sudo tc qdisc add dev eth0 root netem delay 2ms

Test script

# Build both branches
git checkout main && cargo build --release
cp target/release/goose goose-baseline

git checkout feature/per-user-metrics-batching && cargo build --release
cp target/release/goose goose-batch

# Run baseline (best of 3, 60s each)
for i in 1 2 3; do
  ./goose-baseline --host http://target:8080 \
    --users 200 --increase-rate 50 --run-time 60s \
    --no-reset-metrics 2>&1 | tee baseline-$i.log
done

# Run batch (best of 3, 60s each)
for i in 1 2 3; do
  ./goose-batch --host http://target:8080 \
    --users 200 --increase-rate 50 --run-time 60s \
    --no-reset-metrics 2>&1 | tee batch-$i.log
done

What to measure

Total RPS (headline number)
P99 response time (should be lower if channel backpressure was slowing user threads)
CPU utilization on the goose machine (pidstat -p $(pgrep goose) 1)
Optional: perf stat or flamegraph to see channel send/recv overhead reduction

What to vary

User counts: 50, 100, 200, 500
Network latency: 1ms, 5ms, 10ms (higher latency → more concurrent requests in flight → more channel pressure)

Expected outcome

At 200+ users with real network latency, expect a measurable RPS improvement (5–15%) and lower tail latencies. The win grows with user count and endpoint diversity.

Test plan

All 98 unit tests pass (14 new batch-specific tests)
All 6 atomic_counters tests pass (including metrics reset scenarios)
All 15 coordinated_omission_integration tests pass
All integration tests pass (0 failures across entire test suite)
cargo fmt clean
Performance comparison shows no regression (localhost)
Validate at scale with real network latency (see load testing section above)

Addresses #675.

…h RPS Each GooseUser now pre-aggregates metrics locally into batches of up to 100 requests (or 250ms max age) before flushing them over the channel as a single GooseMetric::Batch message. This reduces channel traffic from O(RPS) to O(RPS/batch_size), lowering contention and MetricsProcessor overhead at high request rates. Key design decisions: - Successful non-CO requests are pre-aggregated (timing BTreeMaps, status codes, status code timing summaries, graph data) - Failed requests and CO-affected requests are buffered individually within the batch (they need per-request error logging and CO backfill) - Epoch-based batch validity ensures metrics resets don't contaminate post-reset data: the processor discards stale batches, and users discard stale local batches before accumulating new metrics Also fixes two pre-existing test bugs: - coordinated_omission_integration: user_id is 0-indexed, not 1-indexed - status_code_response_times: sub-ms responses can have min_time of 0 Addresses tag1consulting#675.

- Add `record_average_response_time_per_second_batch()` to GraphData that performs a proper weighted MovingAverage merge instead of the previous loop-with-truncation approach (which lost precision from `avg as u64` truncation) - Remove dead code in `accumulate_request`: the `!request_metric.success` graph error branch and `!request_metric.error.is_empty()` error batch branch were unreachable since only successful non-error requests reach this method. Add debug_assert! to document the invariant. - Add 14 unit tests for batch operations in metrics::test: - batch_new_has_correct_epoch - batch_serialization_round_trip - batch_sent_on_channel - process_batch_merges_request_timing - process_batch_merges_multiple_batches - process_batch_handles_individual_requests - process_batch_merges_error_counts - process_batch_merges_transaction_timing - process_batch_merges_scenario_timing - process_batch_discards_stale_epoch - process_batch_accepts_current_epoch - process_batch_produces_same_result_as_individual - process_batch_graph_data_with_report_file - batch_constants_are_sensible - Add 1 unit test for graph batch merge in graph::test: - test_record_average_response_time_per_second_batch

Remove ErrorBatchEntry, batch.errors, and batch.graph_eps which were never populated by user-side accumulation code (failed requests route through accumulate_individual_request instead). Remove the associated merge loops in process_batch(), the record_errors_per_second_batch() graph method, and the process_batch_merges_error_counts test. Fix extra string clone in accumulate_request's graph data path by moving each counter_key_buf.clone() directly into entry() instead of cloning an intermediate variable. Rename batch_request_count to batch_item_count with expanded doc comment clarifying it tracks request metrics as the size-based flush trigger.

…ing bug The transaction and scenario rounding logic used `* 10.0` instead of `* 100.0` / `* 1000.0` for the 500ms+ ranges, causing percentile buckets to be ~10x too low (e.g. a 750ms transaction bucketed as 80 instead of 800). The request timing path had the correct formula. Extract a single `round_metric_time()` helper that replaces all five copies of the rounding logic across metrics.rs and goose.rs, fixing the bug and eliminating duplication. Also: replace `#[allow(dead_code)]` with `#[cfg(test)]` on test-only `ItemsPerSecond` methods in graph.rs, and add `cleanup_files()` calls to user_metrics_graph_reset tests.

- Use is_some_and instead of map_or for simplified option check - Allow field_reassign_with_default in test module (GooseConfiguration and GooseMetrics require Default + field overrides in tests) - Use const blocks for compile-time constant assertions

Merge graph_rps and graph_avg_rt into a single graph_request_data HashMap keyed by (request_key, second), eliminating a redundant key clone in accumulate_request and a redundant iteration in process_batch. Replace contains_key + insert + get_mut().unwrap() with entry() API in ItemsPerSecond::initialize_or_increment and GraphData's response time recording methods. Remove the now-unused contains_key method and mark insert as #[cfg(test)]. Add early return for count == 0 in the batch average response time merge path.

Remove extra blank line and expand function arguments to satisfy the formatting rules in Rust 1.94's rustfmt.

…ate mod - Update batch test code to use `"TestScenario".into()` for Arc<str> fields introduced by PR tag1consulting#683 - Remove duplicate `mod common;` in user_metrics_graph_reset test

…nds checks - In accumulate_request, replace the unreachable `if !request_metric.update` guard with a debug_assert (update requests bypass this method entirely via set_success/set_failure → send_request_metric_now) - Add bounds checks in process_batch for transaction and scenario indexes to prevent panics in Gaggle mode with mismatched worker configurations

jeremyandrews added 8 commits March 10, 2026 15:51

fix: rustfmt 1.94 formatting (CI compatibility)

f1df04c

Remove extra blank line and expand function arguments to satisfy the formatting rules in Rust 1.94's rustfmt.

fix: resolve rebase conflicts with Arc<str> metrics fields and duplic…

3517125

…ate mod - Update batch test code to use `"TestScenario".into()` for Arc<str> fields introduced by PR tag1consulting#683 - Remove duplicate `mod common;` in user_metrics_graph_reset test

jeremyandrews force-pushed the feature/per-user-metrics-batching branch from b02dfef to 3517125 Compare March 10, 2026 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: per-user metrics batching to reduce channel pressure#682

perf: per-user metrics batching to reduce channel pressure#682
jeremyandrews wants to merge 9 commits intotag1consulting:mainfrom
jeremyandrews:feature/per-user-metrics-batching

jeremyandrews commented Mar 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremyandrews commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Batch size / timing rationale

Bug fix: transaction/scenario time rounding

Files changed

Performance

Load testing at scale

Setup

Test script

What to measure

What to vary

Expected outcome

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeremyandrews commented Mar 7, 2026 •

edited

Loading