Skip to content

Commit 86533fd

Browse files
authored
Improve histogram, summary performance under contention by striping observationCount (#1794)
Was working on improving the performance of opentelemetry-java metrics under high contention, and realized that the same strategy I identified to help over there helps for the prometheus implementation as well! The idea here is recognizing that `Buffer.observationCount` is the bottleneck under contention. In contrast to the other histogram / summary `LongAdder` fields, `Buffer.observationCount` is `AtomicLong` which performs much worse than `LongAdder` under high contention. Its necessary that the type is `AtomicLong` because the CAS APIs accommodate the two way communication that the record / collect paths need to signal that a collection has started and all records have successfully completed (preventing partial writes). However, we can "have our cake and eat it to" by striping `Buffer.observationCount` into many instances, such that the contention on any instance is reduced. This is actually what `LongAdder` does under the covers. This implementation stripes it into `Runtime.getRuntime().availableProcessors()` instances, and uses `Thread.currentThread().getId()) % stripedObservationCounts.length` to select which instance any particular record thread should use. Performance increase is substantial. Here's the before and after of `HistogramBenchmark` on my machine (Apple M4 Mac Pro w/ 48gb RAM): Before: ``` Benchmark Mode Cnt Score Error Units HistogramBenchmark.openTelemetryClassic thrpt 25 1138.465 ± 165.921 ops/s HistogramBenchmark.openTelemetryExponential thrpt 25 677.483 ± 28.765 ops/s HistogramBenchmark.prometheusClassic thrpt 25 5126.048 ± 153.878 ops/s HistogramBenchmark.prometheusNative thrpt 25 3854.323 ± 107.789 ops/s HistogramBenchmark.simpleclient thrpt 25 13285.351 ± 1784.506 ops/s ``` After: ``` Benchmark Mode Cnt Score Error Units HistogramBenchmark.openTelemetryClassic thrpt 25 925.528 ± 13.744 ops/s HistogramBenchmark.openTelemetryExponential thrpt 25 584.404 ± 32.762 ops/s HistogramBenchmark.prometheusClassic thrpt 25 14623.971 ± 2117.588 ops/s HistogramBenchmark.prometheusNative thrpt 25 7405.672 ± 857.611 ops/s HistogramBenchmark.simpleclient thrpt 25 13102.822 ± 3081.096 ops/s ``` --------- Signed-off-by: Jack Berg <34418638+jack-berg@users.noreply.github.com>
1 parent e07e6dd commit 86533fd

File tree

1 file changed

+31
-7
lines changed
  • prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics

1 file changed

+31
-7
lines changed

prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics/Buffer.java

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,15 @@
1818
class Buffer {
1919

2020
private static final long bufferActiveBit = 1L << 63;
21-
private final AtomicLong observationCount = new AtomicLong(0);
21+
// Tracking observation counts requires an AtomicLong for coordination between recording and
22+
// collecting. AtomicLong does much worse under contention than the LongAdder instances used
23+
// elsewhere to hold aggregated state. To improve, we stripe the AtomicLong into N instances,
24+
// where N is the number of available processors. Each record operation chooses the appropriate
25+
// instance to use based on the modulo of its thread id and N. This is a more naive / simple
26+
// implementation compared to the striping used under the hood in java.util.concurrent classes
27+
// like LongAdder - contention and hot spots can still occur if recording thread ids happen to
28+
// resolve to the same index. Further improvement is possible.
29+
private final AtomicLong[] stripedObservationCounts;
2230
private double[] observationBuffer = new double[0];
2331
private int bufferPos = 0;
2432
private boolean reset = false;
@@ -27,8 +35,17 @@ class Buffer {
2735
ReentrantLock runLock = new ReentrantLock();
2836
Condition bufferFilled = appendLock.newCondition();
2937

38+
Buffer() {
39+
stripedObservationCounts = new AtomicLong[Runtime.getRuntime().availableProcessors()];
40+
for (int i = 0; i < stripedObservationCounts.length; i++) {
41+
stripedObservationCounts[i] = new AtomicLong(0);
42+
}
43+
}
44+
3045
boolean append(double value) {
31-
long count = observationCount.incrementAndGet();
46+
int index = Math.abs((int) Thread.currentThread().getId()) % stripedObservationCounts.length;
47+
AtomicLong observationCountForThread = stripedObservationCounts[index];
48+
long count = observationCountForThread.incrementAndGet();
3249
if ((count & bufferActiveBit) == 0) {
3350
return false; // sign bit not set -> buffer not active.
3451
} else {
@@ -69,7 +86,10 @@ <T extends DataPointSnapshot> T run(
6986
runLock.lock();
7087
try {
7188
// Signal that the buffer is active.
72-
Long expectedCount = observationCount.getAndAdd(bufferActiveBit);
89+
long expectedCount = 0L;
90+
for (AtomicLong observationCount : stripedObservationCounts) {
91+
expectedCount += observationCount.getAndAdd(bufferActiveBit);
92+
}
7393

7494
while (!complete.apply(expectedCount)) {
7595
// Wait until all in-flight threads have added their observations to the histogram /
@@ -81,14 +101,18 @@ <T extends DataPointSnapshot> T run(
81101
result = createResult.get();
82102

83103
// Signal that the buffer is inactive.
84-
int expectedBufferSize;
104+
long expectedBufferSize = 0;
85105
if (reset) {
86-
expectedBufferSize =
87-
(int) ((observationCount.getAndSet(0) & ~bufferActiveBit) - expectedCount);
106+
for (AtomicLong observationCount : stripedObservationCounts) {
107+
expectedBufferSize += observationCount.getAndSet(0) & ~bufferActiveBit;
108+
}
88109
reset = false;
89110
} else {
90-
expectedBufferSize = (int) (observationCount.addAndGet(bufferActiveBit) - expectedCount);
111+
for (AtomicLong observationCount : stripedObservationCounts) {
112+
expectedBufferSize += observationCount.addAndGet(bufferActiveBit);
113+
}
91114
}
115+
expectedBufferSize -= expectedCount;
92116

93117
appendLock.lock();
94118
try {

0 commit comments

Comments
 (0)