Optimize histogram reservoir #7443

dashpole · 2025-10-02T13:56:40Z

This improves the concurrent performance of the histogram reservoir's Offer function by 4x (i.e. 75% reduction).

Accomplish this by locking each measurement, rather than locking around the entire storage. Also, defer extracting the trace context from context.Context until collection time. This improves the performance of Offer, which is on the measure hot path. Exemplars are often overwritten, so deferring the operation until Collect reduces the overall work.

goos: linux
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric/exemplar
cpu: AMD EPYC 7B12
                           │   main.txt   │              hist.txt              │
                           │    sec/op    │   sec/op     vs base               │
FixedSizeReservoirOffer-24    211.4n ± 3%   177.5n ± 3%  -16.04% (p=0.002 n=6)
HistogramReservoirOffer-24   200.85n ± 2%   47.41n ± 2%  -76.40% (p=0.002 n=6)
geomean                       206.1n        91.73n       -55.48%

                           │   main.txt   │              hist.txt              │
                           │     B/op     │    B/op     vs base                │
FixedSizeReservoirOffer-24   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramReservoirOffer-24   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                 ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │   main.txt   │              hist.txt              │
                           │  allocs/op   │ allocs/op   vs base                │
FixedSizeReservoirOffer-24   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramReservoirOffer-24   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                 ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

I explored using a []atomic.Pointer[measurement], but this had similar performance while being much more complex (needing a sync.Pool to eliminate allocations). The single-threaded performance was also much worse for that solution. See main...dashpole:optimize_histogram_reservoir_old.

codecov · 2025-10-02T13:58:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.2%. Comparing base (9dea78c) to head (7d0f036).

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #7443     +/-   ##
=======================================
- Coverage   86.2%   86.2%   -0.1%     
=======================================
  Files        295     295             
  Lines      25864   25863      -1     
=======================================
- Hits       22307   22303      -4     
- Misses      3184    3187      +3     
  Partials     373     373

Files with missing lines	Coverage Δ
sdk/metric/exemplar/fixed_size_reservoir.go	`97.6% <100.0%> (ø)`
sdk/metric/exemplar/histogram_reservoir.go	`92.0% <100.0%> (-1.4%)`	⬇️
sdk/metric/exemplar/storage.go	`100.0% <100.0%> (ø)`

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sdk/metric/exemplar/storage.go

Forked from this discussion here: #7443 (comment) It seems like a good idea for us as a group to align on and document what we are comfortable with in terms of how ordered measurements are reflected in collected metric data. --------- Co-authored-by: Tyler Yahn <[email protected]>

sdk/metric/exemplar/storage.go

bboreham · 2025-10-13T14:23:19Z

On further reflection, I fixed the copying issue before running the benchmark, so it is perhaps reasonable that less racy code runs slower.

Would be good if the tests and/or linter detected the issue. I note that NoCopy was removed from atomic.Value here: golang/go#21504.

dashpole · 2025-10-15T01:08:33Z

I also see slightly worse results, but agree it is definitely better to be correct. I'll work on a test.

dashpole · 2025-10-15T15:40:05Z

I added a ConcurrentSafe test, and verified that it fails (quite spectacularly) with the previous atomic.Value implementation.

bboreham

lgtm

sdk/metric/exemplar/reservoir_test.go

dashpole · 2025-10-15T16:31:22Z

The concurrent safe test found another race condition around my usage of sync.Pool, which i'm looking into

dashpole · 2025-10-15T20:08:23Z

The other race had to do with my usage of sync.Pool. After Collect loaded an element, that element could be placed into the sync.Pool by a subsequent store() that replaced the measurement, and then modified by another store() that retrieved it from the sync.Pool. I worked out a way to fix this, but it made the performance around ~45ns. In the end, I decided to just lock around each measurement, since that has the same parallel performance, is much more simple and readable, and has better single-threaded performance.

bboreham

Much simpler now.

bboreham · 2025-10-16T10:25:31Z

sdk/metric/exemplar/fixed_size_reservoir.go


 	r.mu.Lock()
 	defer r.mu.Unlock()
 	if int(r.count) < cap(r.measurements) {


drive-by: I think this (and all similar code) should be len not cap.
In the current code they are always the same, but it's a slight jar when reading it to wonder what was intended.

sdk/metric/exemplar/storage.go

dashpole force-pushed the optimize_histogram_reservoir branch 2 times, most recently from 512b67e to a7d395d Compare October 2, 2025 14:00

This was referenced Oct 2, 2025

Optimize fixedsize reservoir #7447

Draft

AlignedHistogramBucketExemplarReservoir should use time-weighted sampling open-telemetry/opentelemetry-specification#4675

Open

dashpole force-pushed the optimize_histogram_reservoir branch from 864211f to 6497b59 Compare October 3, 2025 13:20

dashpole marked this pull request as ready for review October 3, 2025 13:27

dashpole requested review from MrAlias, XSAM, dmathieu, flc1125 and pellared as code owners October 3, 2025 13:27

MrAlias reviewed Oct 3, 2025

View reviewed changes

sdk/metric/exemplar/storage.go Outdated Show resolved Hide resolved

sdk/metric/exemplar/storage.go Outdated Show resolved Hide resolved

dashpole mentioned this pull request Oct 4, 2025

Document the ordering guarantees provided by the metrics SDK #7453

Merged

dashpole force-pushed the optimize_histogram_reservoir branch 2 times, most recently from 7457c73 to 7c1476f Compare October 4, 2025 03:47

dashpole mentioned this pull request Oct 6, 2025

PoC: HistogramReservoir uses a time-weighted algorithm #7458

Closed

dashpole force-pushed the optimize_histogram_reservoir branch from 7c1476f to 7b79e43 Compare October 7, 2025 14:40

MrAlias approved these changes Oct 7, 2025

View reviewed changes

sdk/metric/exemplar/storage.go Outdated Show resolved Hide resolved

pellared mentioned this pull request Oct 10, 2025

SIG meeting notes #6648

Open

bboreham reviewed Oct 12, 2025

View reviewed changes

sdk/metric/exemplar/storage.go Outdated Show resolved Hide resolved

sdk/metric/exemplar/storage.go Outdated Show resolved Hide resolved

dashpole force-pushed the optimize_histogram_reservoir branch from 433ff16 to e4dfbac Compare October 15, 2025 01:01

dashpole force-pushed the optimize_histogram_reservoir branch from e4dfbac to 67df837 Compare October 15, 2025 15:38

bboreham approved these changes Oct 15, 2025

View reviewed changes

sdk/metric/exemplar/reservoir_test.go Outdated Show resolved Hide resolved

dashpole force-pushed the optimize_histogram_reservoir branch from 597d23c to 81231b8 Compare October 15, 2025 20:00

dashpole force-pushed the optimize_histogram_reservoir branch from 81231b8 to 2c82611 Compare October 15, 2025 20:02

dashpole requested review from MrAlias and bboreham October 15, 2025 20:08

lock around each measurement in exemplar reservoir storage

5e17e43

dashpole force-pushed the optimize_histogram_reservoir branch from 2c82611 to 5e17e43 Compare October 15, 2025 20:12

lint

7d0f036

bboreham approved these changes Oct 16, 2025

View reviewed changes

MrAlias added this to the v1.39.0 milestone Oct 16, 2025

Optimize histogram reservoir #7443

Are you sure you want to change the base?

Optimize histogram reservoir #7443

Uh oh!

Conversation

dashpole commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bboreham commented Oct 13, 2025

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dashpole commented Oct 2, 2025 •

edited

Loading

codecov bot commented Oct 2, 2025 •

edited

Loading