-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add compute functions to ExponentialHistogramState #136749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
...r/src/main/java/org/elasticsearch/search/aggregations/metrics/ExponentialHistogramState.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/search/aggregations/metrics/ExponentialHistogramState.java
Show resolved
Hide resolved
public double cdf(double x) { | ||
ExponentialHistogram histogram = histogram(); | ||
long numValuesLess = ExponentialHistogramQuantile.estimateRank(histogram, x, false); | ||
long numValuesLessOrEqual = ExponentialHistogramQuantile.estimateRank(histogram, x, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed, these don't check if the value is outside min/max? You should do that separately, return 0 for < min and 1 for > max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness wise, we already do this here.
So you are suggesting to do this for performance reasons?
I'm not sure if this is worth it, because it increases the complexity: We currently clamp the bucket POLRE to the histogram min / max. So if the POLRE was clamped, we need to return different values based on whether the requested rank was inclusive or exclusive equal values.
I don't think that this is a common enough case to justify this extra code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this increase the complexity? Isn't this a simple condition to return early, that can be added at the top of ExponentialHistogramQuantile.estimateRank
? It should be uncontroversial, in terms of correctness, and more efficient.
Btw, this is unrelated to this pr so not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally right, I had this messed up in my head. I confused the < min()
precondition with <= min()
. The latter would increase complexity due to having to account for inclusivity, the first one is trivial.
Part of #135625 , follow up of #136075.
Makes
ExponentialHistogramState
mimic theTDigestState
provided functionality used by aggregations, so that we can build the drop-in replacementHistogramState
consisting of both. This will then allow us to apply e.g. percentile aggregation on exponential histograms in addition to T-Digests and a mix of both.We also had to implement a
centroids()
functionality, which is implemented by returning the mean values of the populated histogram buckets. Based on my research,centroids()
is only used in the boxplot aggregation in order to define the length of the whiskers, where this usage should be fine.