Add implementation for exponential histogram merging and percentiles #131220

JonasKunz · 2025-07-14T14:35:50Z

Implements exponential histogram merging and quantile estimation as a library.

Uses UDDSketch as merging algorithm and the OpenTelemetry definition of base 2, perfectly subsetting exponential histograms for the data structure. Check the readme included in this PR for more details.

Have you signed the contributor license agreement?
Have you followed the contributor guidelines?
If submitting code, have you built your formula locally prior to submission with gradle check?
If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
If submitting code, have you checked that your submission is for an OS and architecture that we support?
If you are submitting this code for a class then read our policy for that.

libs/exponential-histogram/README.md

...ponential-histogram/src/main/java/org/elasticsearch/exponentialhistogram/DownscaleStats.java

kkrik-es

LGTM overall. Some parts could be more readable, but histograms are tricky to implement..

I suggest you wait for Mark to take a look as well, unless he can't complete the review this week. Also, @felixbarny should stamp.

…entional-histos # Conflicts: # libs/exponential-histogram/src/main/java/org/elasticsearch/exponentialhistogram/ExponentialHistogram.java

...va/org/elasticsearch/benchmark/exponentialhistogram/ExponentialHistogramGenerationBench.java

...in/java/org/elasticsearch/benchmark/exponentialhistogram/ExponentialHistogramMergeBench.java

libs/exponential-histogram/README.md

.../src/main/java/org/elasticsearch/exponentialhistogram/FixedCapacityExponentialHistogram.java

...stogram/src/main/java/org/elasticsearch/exponentialhistogram/ExponentialHistogramMerger.java

.../src/main/java/org/elasticsearch/exponentialhistogram/FixedCapacityExponentialHistogram.java

not-napoleon

A this point, I've reviewed all of the main code, but have not yet gone through the tests. It's end of day for me, and as I have a lot of comments on the main code, I don't want to hold up submitting this review for another day while I get to the tests.

A couple of general consideration:

Memory Accounting

We need to think about how we're going to track memory for this data structure. In our T-Digest fork, we did this by creating an interface that wraps array-like behavior. This allows us to build an implementation in core elasticsearch that interacts with the circuit breakers for memory tracking within the library, without needing to expose the CB infrastructure. I would like us to do something similar here.

I'm open to adding this in a follow-up, but I don't think we should build the rest of the feature without memory accounting.

Exception Typing

There are many places this library throws IllegalArgumentException. If these bubble up to the REST error handler, they will be returned to the user as 400 errors, but that is probably not correct. Errors such as values being out of range seem to me much more likely to be the result of misuse elsewhere in Elasticsearch, rather than a user sending bad data or queries. Sending back a 400 in such cases is not correct.

This might be worth a larger discussion.

Clarity of Terminology

The concept of "index" (and "index order") in the sparse representation is quite difficult to follow. We have arrays with consecutive indices that store the bucket-index values, which are not consecutive integers as I understand it (because of the sparse representation). I've left several comments to this effect throughout the review. I think adding some definitions of terms to the README would help, and possibly some additional clarity as to when we are referring to the bucket ID number vs its position in the array of buckets. I expect that we are not going to be touching this code frequently, so I want to make it as easy as possible for people to ramp up on to this in the future.

...ponential-histogram/src/main/java/org/elasticsearch/exponentialhistogram/DownscaleStats.java

not-napoleon · 2025-07-25T19:05:25Z

...ponential-histogram/src/main/java/org/elasticsearch/exponentialhistogram/DownscaleStats.java

+     */
+    int getRequiredScaleReductionToReduceBucketCountBy(int desiredCollapsedBucketCount) {
+        if (desiredCollapsedBucketCount < 0) {
+            throw new IllegalArgumentException("desiredCollapsedBucketCount must be greater than or equal to 0");


Building on what @kkrik-es said above, I am dubious about throwing IllegalArgumentException from a bunch of places in this library. Elasticsearch maps exception classes onto HTTP status codes when replying to requests, and IllegalArgumentException is used for 400 errors, indicating a bad user query. But an error this deep in the library, it seems unlikely the user of elasticsearch can adjust their query to correct for the error.

I think asserts (and enough testing to be confidant they aren't being tripped) are the right choice here. Errors here are non-actionable at run time, and potentially confusing to users.

Imo throwing InvalidArgumentException is the Java standard practice when validating inputs. Even if we avoid it in our code, even e.g. java.lang will raise them if you use any core methods wrongly. So to me avoiding IllegalArgumentException because it causes 400 errors feels like a code smell.

I think asserts (and enough testing to be confidant they aren't being tripped) are the right choice here

Tests can never prevent 100% of bugs, there might always be the 0.1% case that slipped through. My problem with assertions is that they are (to my knowledge) not active in production. This means that if an invariant is broken, the code will (a) explode later with e.g. ArrayIndexOutOfBoundsException or (b) return wrong results which in my opinion is much worse because it can go undetected.
For (a) this makes dealing with production issues much harder. Most of the time you have to be able to reproduce it to understand what's happening, as you need to re-trigger it with assertions enabled. For (b) it makes it close to impossible, because it needs to be detected first.

That said, I'm fine with leaving my opinion aside and following the style and decisions of your team.

I already changed a lot of exceptions to assertions for package-private methods in c63ba90, which I think I pushed to late to have it make into your review. In 941223a I now also avoided IllegalArgumentExceptions entirely due to the 400s problem, with one exception:
The quantile-computation will throw one if the input is outside of the [0, 1] range, which should be fine imo.

Other than that there's only IllegalStateExceptions left, which are thrown when there is no other way of continuing the control flow in a meaningful way. I guess those should be fine, as they end up as 5xx errors?

So to me avoiding IllegalArgumentException because it causes 400 errors feels like a code smell.

I don't disagree with that, but changing that behavior in Elasticsearch right now would be very challenging.

I'm open to leaving them in the library and catching them when we use them in Elasticsearch, but we need to remember to do that. The goal is to make sure that 400 errors that make it back to users are actionable for them, and I'm open to multiple ways of achieving that.

I don't disagree with that, but changing that behavior in Elasticsearch right now would be very challenging.

100%. That's why I now just removed the IllegalArgumentExceptions. So I assume the remaining IllegalStateException are fine?

libs/exponential-histogram/README.md

...gram/src/main/java/org/elasticsearch/exponentialhistogram/ExponentialHistogramGenerator.java

.../src/main/java/org/elasticsearch/exponentialhistogram/FixedCapacityExponentialHistogram.java

libs/exponential-histogram/src/main/java/org/elasticsearch/exponentialhistogram/ZeroBucket.java

JonasKunz · 2025-07-29T14:24:56Z

The concept of "index" (and "index order") in the sparse representation is quite difficult to follow. We have arrays with consecutive indices that store the bucket-index values, which are not consecutive integers as I understand it (because of the sparse representation). I've left several comments to this effect throughout the review. I think adding some definitions of terms to the README would help, and possibly some additional clarity as to when we are referring to the bucket ID number vs its position in the array of buckets. I expect that we are not going to be touching this code frequently, so I want to make it as easy as possible for people to ramp up on to this in the future.

@not-napoleon :

I don't think this is worth documenting in the README, as it is an implementation detail of the FixedCapacityExponentialHistogram which does not leak outside of this class. Within this class, I refer to positions in the array as slots, to not have them confused with the bucket-indices in the exponential histogram definiton.

I've reworked the specific javadoc you highlighted to avoid confusion, enhanced the documentation at the top explaining the difference and double checked that we don't use index wrongly anywhere.

Hope this is good enough?

not-napoleon

I think this looks good, with the understanding that we can't start using this in production until it's using the circuit breakers for memory protection. Once tests are passing and we're satisfied with the licensing plan, I'm +1 to merge this.

JonasKunz added 18 commits July 8, 2025 11:14

Initial experiments commit

43dd073

Clean up generator

da159a9

Clean up merger

fa4efe0

More tests, a bit of cleanup

fba967f

Stash benchmark changes

cef3b11

Merge remote-tracking branch 'elastic/main' into exponentional-histos

aac9f6d

spotless, checkstyle

6a2b62f

more build fixes

eb955cd

Fix license headers

2eb5fdd

spotless round 2

fd7064e

Fix tests, implement benchmarks

66b5e2c

Reduce max scale to preserve numeric accuracy

2f293d0

Check for sane scale and indices

91193bc

Fix and clean percentile computation

92efdcf

Add some tests based on TDigestTest

e6924e9

Clean up, bug fixes and javadoc

cab3fdf

Remove dead code

454a9cc

A bit more javadoc

486a8bd

elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 14, 2025

JonasKunz added 3 commits July 15, 2025 11:41

AI-assisted javadoc and spotless

7c6655b

Readme bullet points

a980c0e

Add readme

a46914a

JonasKunz force-pushed the exponentional-histos branch from 7711093 to a46914a Compare July 15, 2025 12:40