[Profiling] Ignore events count value #130293
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
To display statistically sound data, Universal Profiling queries require to fetch a lot of documents (aka profiling events). This can be more than 100k, even if only 20k documents are needed.
For this, the number of documents allowed in a single response is set to 150k. It also requires to increase a cluster-wide setting
search.max_bucketsto 150k. The alternative, to paginate the response, is/was not an option as it increased the query latency unacceptably.The new (and still experimental) random_sampler aggregation can not be used as every profiling event document has a weight (aka
count).Problem
On serverless, the cluster-wide setting
search.max_bucketscan no longer be changed and it defaults to 64k. The option to fetch the data in a paginated way is too slow (up to 15 sequential requests).Solution
The profiling events were recently switched to have nanosecond precise timestamps.
This made it very unlikely to have events with a count value != 1.
So with ES 9.2, the count value is either dropped or always set to 1 (see also open-telemetry/opentelemetry-collector-contrib#40947).
With profiling event documents having all the same weight, we can leverage the random_sampler aggregation to reduce the number of documents to be fetched to a maximum of 20k. This allows using a paginated response (1 additional request/response roundtrip) without massively increasing the total query latency.
This PR is the first step, using the aggregated
doc_countvalue instead of using aggregatedcountvalues. In a second PR, we'll switch to use the random_sampler aggregation in combination with pagination.