-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Add rank estimation for exponential histograms #135692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
double bucketMidpoint = ExponentialScaleUtils.getPointOfLeastRelativeError(buckets.peekIndex(), buckets.scale()); | ||
bucketMidpoint = Math.min(bucketMidpoint, maxValue); | ||
if (bucketMidpoint < value || (inclusive && bucketMidpoint == value)) { | ||
rank += buckets.peekCount(); | ||
buckets.advance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For buckets where the value is between the lower and the upper boundary, I'm wondering whether we should add a count proportional to where the value falls into the bucket. It seems like that could increase the accuracy of the estimate.
In other words, we'd increment the rank by (value - lowerBound) / (upperBound - lowerBound) * count
. We can have an optimized special case for value > upperBound
where we increment by count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the algorithm currently assumes that all values in a bucket lie on the point of least relative error, just like for the percentiles algorithm. This ensures that we minimize the relative error: we minimize the relative error of percentile( rank(someValue) / valueCount)
, meaning that the returned percentile
is as close as possible to someValue
.
If we now change the assumption of how values are distributed in a bucket, I think we'd need to do the same for the percentiles algorithm. While this would smoothen the values, it would also increase the worst-case relative error.
Also changing this assumption would probably also mean that we should get rid of upscaling in the exponential histogram merging algorithm: The upscaling there happens to make sure that misconfigured SDKs (e.g. way too low bucket count) don't drag down the accuracy of the overall aggregation.
While the upscaling barely moves the point of least relative error of buckets, it greatly reduces their size.
So with your proposed change this can lead to the weir behaviour where the rank of a given value shifts by a large margin pre and post merging of histograms.
So I'd propose to stay with the "mathematically most correct" way of assuming that all points in a bucket lie on a single point. In practice buckets should be small enough that this is not really noticeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. Definitely agree that we want to keep the percentile ranks implementation in sync with the percentile implementation. Is there a specific percentiles implementation suggested by OTel that uses midpoints?
Maybe add some commentary why we're using midpoints rather than interpolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing in Otel, but this implementation is what is used in the DDSketch
and UDDSketch
papers, as they provide proofs for the worst-case relative error.
Prometheus does this differently for their native histograms (which are actually exponential histograms):
The worst case is an estimation at one end of a bucket where the actual value is at the other end of the bucket. Therefore, the maximum possible error is the whole width of a bucket. Not doing any interpolation and using some fixed midpoint within a bucket (for example the arithmetic mean or even the harmonic mean) would minimize the maximum possible error (which would then be half of the bucket width in case of the arithmetic mean), but in practice, the linear interpolation yields an error that is lower on average. Since the interpolation has worked well over many years of classic histogram usage, interpolation is also applied for native histograms.
Therefore, PromQL uses exponential extrapolation for the standard schemas, which models the assumption that dividing a bucket into two when increasing the schema number by one (i.e. doubling the resolution) will on average see similar populations in both new buckets. A more detailed explanation can be found in the PR implementing the interpolation method.
(Source)
So in order words, they assume an exponential distribution within the bucket (fewer values on the border towards zero, more on the further away border). We could adopt that approach, which means we would have to drop the upscaling and make converting explicit-bucket histograms more expensive and inaccurate.
I also noticed after thinking about it further that what I said above is wrong:
we minimize the relative error of percentile( rank(someValue) / valueCount), meaning that the returned percentile is as close as possible to someValue.
If we now change the assumption of how values are distributed in a bucket, I think we'd need to do the same for the percentiles algorithm.
It doesn't matter if we return the rank of the first or last element within a bucket, the resulting percentile would be the same with our current algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'd say let's leave it as-is for now and add an issue to re-think midpoint vs interpolation. Should we decide to switch the algorithm, it should be done consistently both for percentile rank and percentile. It's probably also a matter of how strictly we want to be compliant with prometheus and if we actually want convert explicit bounds histograms to exponential histograms long-term, or whether we want to have a dedicated type for it.
Adds a utility function for estimating the number of values less than (or optional less-or-equal to) a given value in an exponential histogram. Just like the percentile algorithm, this algorithm assumes that all values in a bucket lie on the point of least relative error within that bucket.
This is required for #135625 in order to implement the
range
andpercentile_ranks
aggregations.I ran the included randomized test on repeat locally to ensure it works correctly.