Skip to content

Conversation

@chrisparrinello
Copy link
Contributor

For https://elasticco.atlassian.net/browse/ES-12391, splitting DFS metrics from #135285 per @javanna 's suggestion.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 labels Sep 29, 2025
@chrisparrinello chrisparrinello added >enhancement :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label v9.2.0 labels Sep 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Sep 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @chrisparrinello, I've created a changelog YAML for you.

@chrisparrinello
Copy link
Contributor Author

@javanna I implemented your suggestion to not pull out any of the search execution attributes for the DFS phase metrics as we talked about on #135285

"1"
);
final List<Measurement> dfsMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(DFS_SEARCH_PHASE_METRIC);
assertEquals(num_primaries, dfsMeasurements.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check that the measurement make some sense? For instance, are they always greater than 0? Are they always lower than the total took time?

Copy link
Contributor Author

@chrisparrinello chrisparrinello Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, they're not always greater than zero because we convert nanoseconds to milliseconds before storing them in the histogram so if something took less than a millisecond, we record a zero. This definitely happens in the unit tests. I took a stab at checking against took time but that I means I need to pull apart all of the asserts to get the SearchResponse, for example:

public void testMetricsDfsQueryThenFetch() {
        SearchRequestBuilder requestBuilder = client().prepareSearch(indexName)
            .setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
            .setQuery(simpleQueryStringQuery("doc1"));
        SearchResponse searchResponse = requestBuilder.get();
        try {
            assertNoFailures(searchResponse);
            assertHitCount(searchResponse, 1);
            assertSearchHits(searchResponse, "1");
            final List<Measurement> dfsMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(DFS_SEARCH_PHASE_METRIC);
            assertMeasurements(dfsMeasurements, num_primaries, searchResponse.getTook().millis());
            final List<Measurement> queryMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(QUERY_SEARCH_PHASE_METRIC);
            assertEquals(num_primaries, queryMeasurements.size());
            final List<Measurement> fetchMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(FETCH_SEARCH_PHASE_METRIC);
            assertEquals(1, fetchMeasurements.size());
            assertAttributes(fetchMeasurements, false, false);
        } finally {
            searchResponse.decRef();
        }
    }

where assertMeasurements checks to make sure the measurements are less than or equal to the took time from the response and we have the right number of measurements. Let me know if you want to take this approach and I'll modify all of the tests to make sure we're asserting valid measurements.

About the nanoseconds getting converted to 0 milliseconds, one thought I had was to change the units from milliseconds to microseconds or nanoseconds but the issue is that the underlying OpenTelemetry implementation of the histogram buckets the measurements before reporting to the APM server and there is an upper bound to the buckets (something like 110k) so if you choose the wrong scale you lose precision for measurements greater than 110k. There is a way to control the bucketing but it is deep deep in the OpenTelemetry meter code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry, took time is at the coord level, it's not possible to get it here. I see! and thanks for the explanation about the rounding. And for checking further about precision. I think we are good here!

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chrisparrinello chrisparrinello merged commit cb2907a into elastic:main Sep 30, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants