Skip to content

Conversation

@JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Jul 14, 2025

Implements exponential histogram merging and quantile estimation as a library.

Uses UDDSketch as merging algorithm and the OpenTelemetry definition of base 2, perfectly subsetting exponential histograms for the data structure. Check the readme included in this PR for more details.

  • Have you signed the contributor license agreement?
  • Have you followed the contributor guidelines?
  • If submitting code, have you built your formula locally prior to submission with gradle check?
  • If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • If you are submitting this code for a class then read our policy for that.

@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 14, 2025
@JonasKunz JonasKunz force-pushed the exponentional-histos branch from 7711093 to a46914a Compare July 15, 2025 12:40
Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Some parts could be more readable, but histograms are tricky to implement..

I suggest you wait for Mark to take a look as well, unless he can't complete the review this week. Also, @felixbarny should stamp.

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A this point, I've reviewed all of the main code, but have not yet gone through the tests. It's end of day for me, and as I have a lot of comments on the main code, I don't want to hold up submitting this review for another day while I get to the tests.

A couple of general consideration:

Memory Accounting

We need to think about how we're going to track memory for this data structure. In our T-Digest fork, we did this by creating an interface that wraps array-like behavior. This allows us to build an implementation in core elasticsearch that interacts with the circuit breakers for memory tracking within the library, without needing to expose the CB infrastructure. I would like us to do something similar here.

I'm open to adding this in a follow-up, but I don't think we should build the rest of the feature without memory accounting.

Exception Typing

There are many places this library throws IllegalArgumentException. If these bubble up to the REST error handler, they will be returned to the user as 400 errors, but that is probably not correct. Errors such as values being out of range seem to me much more likely to be the result of misuse elsewhere in Elasticsearch, rather than a user sending bad data or queries. Sending back a 400 in such cases is not correct.

This might be worth a larger discussion.

Clarity of Terminology

The concept of "index" (and "index order") in the sparse representation is quite difficult to follow. We have arrays with consecutive indices that store the bucket-index values, which are not consecutive integers as I understand it (because of the sparse representation). I've left several comments to this effect throughout the review. I think adding some definitions of terms to the README would help, and possibly some additional clarity as to when we are referring to the bucket ID number vs its position in the array of buckets. I expect that we are not going to be touching this code frequently, so I want to make it as easy as possible for people to ramp up on to this in the future.

*/
int getRequiredScaleReductionToReduceBucketCountBy(int desiredCollapsedBucketCount) {
if (desiredCollapsedBucketCount < 0) {
throw new IllegalArgumentException("desiredCollapsedBucketCount must be greater than or equal to 0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on what @kkrik-es said above, I am dubious about throwing IllegalArgumentException from a bunch of places in this library. Elasticsearch maps exception classes onto HTTP status codes when replying to requests, and IllegalArgumentException is used for 400 errors, indicating a bad user query. But an error this deep in the library, it seems unlikely the user of elasticsearch can adjust their query to correct for the error.

I think asserts (and enough testing to be confidant they aren't being tripped) are the right choice here. Errors here are non-actionable at run time, and potentially confusing to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo throwing InvalidArgumentException is the Java standard practice when validating inputs. Even if we avoid it in our code, even e.g. java.lang will raise them if you use any core methods wrongly. So to me avoiding IllegalArgumentException because it causes 400 errors feels like a code smell.

I think asserts (and enough testing to be confidant they aren't being tripped) are the right choice here

Tests can never prevent 100% of bugs, there might always be the 0.1% case that slipped through. My problem with assertions is that they are (to my knowledge) not active in production. This means that if an invariant is broken, the code will (a) explode later with e.g. ArrayIndexOutOfBoundsException or (b) return wrong results which in my opinion is much worse because it can go undetected.
For (a) this makes dealing with production issues much harder. Most of the time you have to be able to reproduce it to understand what's happening, as you need to re-trigger it with assertions enabled. For (b) it makes it close to impossible, because it needs to be detected first.

That said, I'm fine with leaving my opinion aside and following the style and decisions of your team.

I already changed a lot of exceptions to assertions for package-private methods in c63ba90, which I think I pushed to late to have it make into your review. In 941223a I now also avoided IllegalArgumentExceptions entirely due to the 400s problem, with one exception:
The quantile-computation will throw one if the input is outside of the [0, 1] range, which should be fine imo.

Other than that there's only IllegalStateExceptions left, which are thrown when there is no other way of continuing the control flow in a meaningful way. I guess those should be fine, as they end up as 5xx errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to me avoiding IllegalArgumentException because it causes 400 errors feels like a code smell.

I don't disagree with that, but changing that behavior in Elasticsearch right now would be very challenging.

I'm open to leaving them in the library and catching them when we use them in Elasticsearch, but we need to remember to do that. The goal is to make sure that 400 errors that make it back to users are actionable for them, and I'm open to multiple ways of achieving that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree with that, but changing that behavior in Elasticsearch right now would be very challenging.

100%. That's why I now just removed the IllegalArgumentExceptions. So I assume the remaining IllegalStateException are fine?

@JonasKunz JonasKunz force-pushed the exponentional-histos branch from 337d426 to d34da19 Compare July 29, 2025 13:42
@JonasKunz
Copy link
Contributor Author

The concept of "index" (and "index order") in the sparse representation is quite difficult to follow. We have arrays with consecutive indices that store the bucket-index values, which are not consecutive integers as I understand it (because of the sparse representation). I've left several comments to this effect throughout the review. I think adding some definitions of terms to the README would help, and possibly some additional clarity as to when we are referring to the bucket ID number vs its position in the array of buckets. I expect that we are not going to be touching this code frequently, so I want to make it as easy as possible for people to ramp up on to this in the future.

@not-napoleon :

I don't think this is worth documenting in the README, as it is an implementation detail of the FixedCapacityExponentialHistogram which does not leak outside of this class. Within this class, I refer to positions in the array as slots, to not have them confused with the bucket-indices in the exponential histogram definiton.

I've reworked the specific javadoc you highlighted to avoid confusion, enhanced the documentation at the top explaining the difference and double checked that we don't use index wrongly anywhere.

Hope this is good enough?

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, with the understanding that we can't start using this in production until it's using the circuit breakers for memory protection. Once tests are passing and we're satisfied with the licensing plan, I'm +1 to merge this.

@JonasKunz JonasKunz merged commit 37b6f8f into elastic:main Aug 8, 2025
33 checks passed
@JonasKunz JonasKunz deleted the exponentional-histos branch August 8, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants