Skip to content

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Oct 17, 2025

Part of #135625 , follow up of #136075.

Makes ExponentialHistogramState mimic the TDigestState provided functionality used by aggregations, so that we can build the drop-in replacement HistogramState consisting of both. This will then allow us to apply e.g. percentile aggregation on exponential histograms in addition to T-Digests and a mix of both.

We also had to implement a centroids() functionality, which is implemented by returning the mean values of the populated histogram buckets. Based on my research, centroids() is only used in the boxplot aggregation in order to define the length of the whiskers, where this usage should be fine.

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.3.0 labels Oct 17, 2025
@JonasKunz JonasKunz marked this pull request as ready for review October 17, 2025 14:22
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 17, 2025
* @return an array of the mean values of the populated histogram buckets with their counts
*/
public Collection<Centroid> centroids() {
List<Centroid> centroids = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to pre-allocate the list using centroidCount?

Comment on lines +170 to +171
// negative buckets are in decreasing order, we want increasing order, therefore reverse
Collections.reverse(centroids);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this.
This had me believe that you start with the lowest values and go up to the highest ones:

// They store all buckets for the negative range first, with the bucket indices in ascending order,
// followed by all buckets for the positive range, also with their indices in ascending order.
// This means we store the buckets ordered by their boundaries in ascending order (from -INF to +INF).
private final long[] bucketIndices;
private final long[] bucketCounts;

I guess the last sentence in the comment is wrong then? The indices are ascending but the highest index for the negative scale has the lowest value. Did I get that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants