perf: per-user metrics batching to reduce channel pressure#682
Open
jeremyandrews wants to merge 9 commits intotag1consulting:mainfrom
Open
perf: per-user metrics batching to reduce channel pressure#682jeremyandrews wants to merge 9 commits intotag1consulting:mainfrom
jeremyandrews wants to merge 9 commits intotag1consulting:mainfrom
Conversation
…h RPS Each GooseUser now pre-aggregates metrics locally into batches of up to 100 requests (or 250ms max age) before flushing them over the channel as a single GooseMetric::Batch message. This reduces channel traffic from O(RPS) to O(RPS/batch_size), lowering contention and MetricsProcessor overhead at high request rates. Key design decisions: - Successful non-CO requests are pre-aggregated (timing BTreeMaps, status codes, status code timing summaries, graph data) - Failed requests and CO-affected requests are buffered individually within the batch (they need per-request error logging and CO backfill) - Epoch-based batch validity ensures metrics resets don't contaminate post-reset data: the processor discards stale batches, and users discard stale local batches before accumulating new metrics Also fixes two pre-existing test bugs: - coordinated_omission_integration: user_id is 0-indexed, not 1-indexed - status_code_response_times: sub-ms responses can have min_time of 0 Addresses tag1consulting#675.
- Add `record_average_response_time_per_second_batch()` to GraphData that performs a proper weighted MovingAverage merge instead of the previous loop-with-truncation approach (which lost precision from `avg as u64` truncation) - Remove dead code in `accumulate_request`: the `!request_metric.success` graph error branch and `!request_metric.error.is_empty()` error batch branch were unreachable since only successful non-error requests reach this method. Add debug_assert! to document the invariant. - Add 14 unit tests for batch operations in metrics::test: - batch_new_has_correct_epoch - batch_serialization_round_trip - batch_sent_on_channel - process_batch_merges_request_timing - process_batch_merges_multiple_batches - process_batch_handles_individual_requests - process_batch_merges_error_counts - process_batch_merges_transaction_timing - process_batch_merges_scenario_timing - process_batch_discards_stale_epoch - process_batch_accepts_current_epoch - process_batch_produces_same_result_as_individual - process_batch_graph_data_with_report_file - batch_constants_are_sensible - Add 1 unit test for graph batch merge in graph::test: - test_record_average_response_time_per_second_batch
Remove ErrorBatchEntry, batch.errors, and batch.graph_eps which were never populated by user-side accumulation code (failed requests route through accumulate_individual_request instead). Remove the associated merge loops in process_batch(), the record_errors_per_second_batch() graph method, and the process_batch_merges_error_counts test. Fix extra string clone in accumulate_request's graph data path by moving each counter_key_buf.clone() directly into entry() instead of cloning an intermediate variable. Rename batch_request_count to batch_item_count with expanded doc comment clarifying it tracks request metrics as the size-based flush trigger.
…ing bug The transaction and scenario rounding logic used `* 10.0` instead of `* 100.0` / `* 1000.0` for the 500ms+ ranges, causing percentile buckets to be ~10x too low (e.g. a 750ms transaction bucketed as 80 instead of 800). The request timing path had the correct formula. Extract a single `round_metric_time()` helper that replaces all five copies of the rounding logic across metrics.rs and goose.rs, fixing the bug and eliminating duplication. Also: replace `#[allow(dead_code)]` with `#[cfg(test)]` on test-only `ItemsPerSecond` methods in graph.rs, and add `cleanup_files()` calls to user_metrics_graph_reset tests.
- Use is_some_and instead of map_or for simplified option check - Allow field_reassign_with_default in test module (GooseConfiguration and GooseMetrics require Default + field overrides in tests) - Use const blocks for compile-time constant assertions
Merge graph_rps and graph_avg_rt into a single graph_request_data HashMap keyed by (request_key, second), eliminating a redundant key clone in accumulate_request and a redundant iteration in process_batch. Replace contains_key + insert + get_mut().unwrap() with entry() API in ItemsPerSecond::initialize_or_increment and GraphData's response time recording methods. Remove the now-unused contains_key method and mark insert as #[cfg(test)]. Add early return for count == 0 in the batch average response time merge path.
Remove extra blank line and expand function arguments to satisfy the formatting rules in Rust 1.94's rustfmt.
…ate mod - Update batch test code to use `"TestScenario".into()` for Arc<str> fields introduced by PR tag1consulting#683 - Remove duplicate `mod common;` in user_metrics_graph_reset test
b02dfef to
3517125
Compare
…nds checks - In accumulate_request, replace the unreachable `if !request_metric.update` guard with a debug_assert (update requests bypass this method entirely via set_success/set_failure → send_request_metric_now) - Add bounds checks in process_batch for transaction and scenario indexes to prevent panics in Gaggle mode with mismatched worker configurations
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GooseMetric::Batchmessage, reducing channel traffic from O(RPS) to O(RPS/batch_size)Design
The implementation adds:
GooseMetricBatchstruct with pre-aggregated request, error, transaction, and scenario dataGooseMetric::Batchvariant for the metrics channelMetricsProcessor::process_batch()for merging pre-aggregated data into global aggregatesGooseUser(accumulate_request,accumulate_transaction,accumulate_scenario,flush_batch,ensure_batch_current)record_average_response_time_per_second_batchuses proper weighted merge viaMovingAverage::merge())MetricsEpoch(Arc) for coordinating metrics resetsBatch size / timing rationale
Bug fix: transaction/scenario time rounding
This PR also fixes a pre-existing bug in
TransactionMetricAggregate::set_timeandScenarioMetricAggregate::set_timewhere the rounding logic for response times >500ms used incorrect multipliers:This caused transaction and scenario percentile time buckets above 500ms to be ~10x too low (e.g. a 750ms transaction was bucketed at 80 instead of the correct 800). The request timing path (
GooseRequestMetricTimingData::record_time) did not have this bug.The fix extracts a single
round_metric_time()helper that replaces all five copies of the rounding logic acrossmetrics.rsandgoose.rs, fixing the bug and eliminating duplication.Note: This changes the percentile distribution for transaction/scenario times >500ms. Historical data or baselines that relied on the old (broken) rounding will not be directly comparable.
Files changed
src/metrics.rs— Batch types,process_batch()merging, epoch handling,round_metric_timehelper, 14 new unit testssrc/goose.rs— GooseUser batch fields and accumulation/flush methods, dead code cleanup, debug assertionssrc/user.rs— Batch path inrecord_scenario, batch flush on user shutdownsrc/graph.rs— Batch-aware graph data recording with proper weightedMovingAveragemerge, unit testsrc/lib.rs— Shared epoch initialization, batch state on GooseUsersrc/metrics/coordinated_omission.rs— Makeupdate_synthetic_percentagepub(crate)tests/coordinated_omission_integration.rs— Fix pre-existing bug: user_id is 0-indexedtests/status_code_response_times.rs— Fix pre-existing bug: sub-ms min_time can be 0Performance
Benchmarked with httpmock (localhost) at 10/50/100/200 users, 5s each, best of 3:
With localhost mock servers the HTTP roundtrip is the bottleneck, not the metrics channel — results are within noise. The batching benefit manifests at scale with real network latency where channel contention becomes the limiting factor.
Load testing at scale
To properly validate the batching benefit, test with real network latency where channel contention is the bottleneck rather than HTTP roundtrips.
Setup
Two machines (VMs or cloud instances), one for goose and one for the target server. The key is having enough network latency (≥1ms) that user threads produce metrics faster than the processor can consume them without batching.
Target server: A simple HTTP server returning a static 200 with ~2ms artificial delay:
Alternatively, add latency with
tc netem:# On the goose machine, add 2ms latency to the interface facing the target sudo tc qdisc add dev eth0 root netem delay 2msTest script
What to measure
pidstat -p $(pgrep goose) 1)perf stator flamegraph to see channel send/recv overhead reductionWhat to vary
Expected outcome
At 200+ users with real network latency, expect a measurable RPS improvement (5–15%) and lower tail latencies. The win grows with user count and endpoint diversity.
Test plan
cargo fmtcleanAddresses #675.