-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Add rank estimation for exponential histograms #135692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+129
−0
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
90 changes: 90 additions & 0 deletions
90
...ial-histogram/src/test/java/org/elasticsearch/exponentialhistogram/RankAccuracyTests.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
/* | ||
* Copyright Elasticsearch B.V., and/or licensed to Elasticsearch B.V. | ||
* under one or more license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch B.V. licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
* | ||
* This file is based on a modification of https://github.com/open-telemetry/opentelemetry-java which is licensed under the Apache 2.0 License. | ||
*/ | ||
|
||
package org.elasticsearch.exponentialhistogram; | ||
|
||
import java.util.Arrays; | ||
import java.util.stream.DoubleStream; | ||
|
||
import static org.hamcrest.Matchers.equalTo; | ||
|
||
public class RankAccuracyTests extends ExponentialHistogramTestCase { | ||
|
||
public void testRandomDistribution() { | ||
int numValues = randomIntBetween(10, 10_000); | ||
double[] values = new double[numValues]; | ||
|
||
int valuesGenerated = 0; | ||
while (valuesGenerated < values.length) { | ||
double value; | ||
if (randomDouble() < 0.01) { // 1% chance of exact zero | ||
value = 0; | ||
} else { | ||
value = randomDouble() * 2_000_000 - 1_000_000; | ||
} | ||
// Add some duplicates | ||
for (int i = 0; i < randomIntBetween(1, 10) && valuesGenerated < values.length; i++) { | ||
values[valuesGenerated++] = value; | ||
} | ||
} | ||
|
||
int numBuckets = randomIntBetween(4, 400); | ||
ExponentialHistogram histo = createAutoReleasedHistogram(numBuckets, values); | ||
|
||
Arrays.sort(values); | ||
double min = values[0]; | ||
double max = values[values.length - 1]; | ||
|
||
double[] valuesRoundedToBucketCenters = DoubleStream.of(values).map(value -> { | ||
if (value == 0) { | ||
return 0; | ||
} | ||
long index = ExponentialScaleUtils.computeIndex(value, histo.scale()); | ||
double bucketCenter = Math.signum(value) * ExponentialScaleUtils.getPointOfLeastRelativeError(index, histo.scale()); | ||
return Math.clamp(bucketCenter, min, max); | ||
}).toArray(); | ||
|
||
// Test the values at exactly the bucket center for exclusivity correctness | ||
for (double v : valuesRoundedToBucketCenters) { | ||
long inclusiveRank = getRank(v, valuesRoundedToBucketCenters, true); | ||
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, true), equalTo(inclusiveRank)); | ||
long exclusiveRank = getRank(v, valuesRoundedToBucketCenters, false); | ||
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, false), equalTo(exclusiveRank)); | ||
} | ||
// Test the original values to have values in between bucket centers | ||
for (double v : values) { | ||
long inclusiveRank = getRank(v, valuesRoundedToBucketCenters, true); | ||
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, true), equalTo(inclusiveRank)); | ||
long exclusiveRank = getRank(v, valuesRoundedToBucketCenters, false); | ||
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, false), equalTo(exclusiveRank)); | ||
} | ||
|
||
} | ||
|
||
private static long getRank(double value, double[] sortedValues, boolean inclusive) { | ||
for (int i = 0; i < sortedValues.length; i++) { | ||
if (sortedValues[i] > value || (inclusive == false && sortedValues[i] == value)) { | ||
return i; | ||
} | ||
} | ||
return sortedValues.length; | ||
} | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For buckets where the value is between the lower and the upper boundary, I'm wondering whether we should add a count proportional to where the value falls into the bucket. It seems like that could increase the accuracy of the estimate.
In other words, we'd increment the rank by
(value - lowerBound) / (upperBound - lowerBound) * count
. We can have an optimized special case forvalue > upperBound
where we increment bycount
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the algorithm currently assumes that all values in a bucket lie on the point of least relative error, just like for the percentiles algorithm. This ensures that we minimize the relative error: we minimize the relative error of
percentile( rank(someValue) / valueCount)
, meaning that the returnedpercentile
is as close as possible tosomeValue
.If we now change the assumption of how values are distributed in a bucket, I think we'd need to do the same for the percentiles algorithm. While this would smoothen the values, it would also increase the worst-case relative error.
Also changing this assumption would probably also mean that we should get rid of upscaling in the exponential histogram merging algorithm: The upscaling there happens to make sure that misconfigured SDKs (e.g. way too low bucket count) don't drag down the accuracy of the overall aggregation.
While the upscaling barely moves the point of least relative error of buckets, it greatly reduces their size.
So with your proposed change this can lead to the weir behaviour where the rank of a given value shifts by a large margin pre and post merging of histograms.
So I'd propose to stay with the "mathematically most correct" way of assuming that all points in a bucket lie on a single point. In practice buckets should be small enough that this is not really noticeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. Definitely agree that we want to keep the percentile ranks implementation in sync with the percentile implementation. Is there a specific percentiles implementation suggested by OTel that uses midpoints?
Maybe add some commentary why we're using midpoints rather than interpolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing in Otel, but this implementation is what is used in the
DDSketch
andUDDSketch
papers, as they provide proofs for the worst-case relative error.Prometheus does this differently for their native histograms (which are actually exponential histograms):
(Source)
So in order words, they assume an exponential distribution within the bucket (fewer values on the border towards zero, more on the further away border). We could adopt that approach, which means we would have to drop the upscaling and make converting explicit-bucket histograms more expensive and inaccurate.
I also noticed after thinking about it further that what I said above is wrong:
It doesn't matter if we return the rank of the first or last element within a bucket, the resulting percentile would be the same with our current algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'd say let's leave it as-is for now and add an issue to re-think midpoint vs interpolation. Should we decide to switch the algorithm, it should be done consistently both for percentile rank and percentile. It's probably also a matter of how strictly we want to be compliant with prometheus and if we actually want convert explicit bounds histograms to exponential histograms long-term, or whether we want to have a dedicated type for it.