-
Notifications
You must be signed in to change notification settings - Fork 25.6k
DFS search phase per shard duration APM metric #135652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DFS search phase per shard duration APM metric #135652
Conversation
…hards.phases.query.duration.histogram
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @chrisparrinello, I've created a changelog YAML for you. |
| "1" | ||
| ); | ||
| final List<Measurement> dfsMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(DFS_SEARCH_PHASE_METRIC); | ||
| assertEquals(num_primaries, dfsMeasurements.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check that the measurement make some sense? For instance, are they always greater than 0? Are they always lower than the total took time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, they're not always greater than zero because we convert nanoseconds to milliseconds before storing them in the histogram so if something took less than a millisecond, we record a zero. This definitely happens in the unit tests. I took a stab at checking against took time but that I means I need to pull apart all of the asserts to get the SearchResponse, for example:
public void testMetricsDfsQueryThenFetch() {
SearchRequestBuilder requestBuilder = client().prepareSearch(indexName)
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(simpleQueryStringQuery("doc1"));
SearchResponse searchResponse = requestBuilder.get();
try {
assertNoFailures(searchResponse);
assertHitCount(searchResponse, 1);
assertSearchHits(searchResponse, "1");
final List<Measurement> dfsMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(DFS_SEARCH_PHASE_METRIC);
assertMeasurements(dfsMeasurements, num_primaries, searchResponse.getTook().millis());
final List<Measurement> queryMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(QUERY_SEARCH_PHASE_METRIC);
assertEquals(num_primaries, queryMeasurements.size());
final List<Measurement> fetchMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(FETCH_SEARCH_PHASE_METRIC);
assertEquals(1, fetchMeasurements.size());
assertAttributes(fetchMeasurements, false, false);
} finally {
searchResponse.decRef();
}
}
where assertMeasurements checks to make sure the measurements are less than or equal to the took time from the response and we have the right number of measurements. Let me know if you want to take this approach and I'll modify all of the tests to make sure we're asserting valid measurements.
About the nanoseconds getting converted to 0 milliseconds, one thought I had was to change the units from milliseconds to microseconds or nanoseconds but the issue is that the underlying OpenTelemetry implementation of the histogram buckets the measurements before reporting to the APM server and there is an upper bound to the buckets (something like 110k) so if you choose the wrong scale you lose precision for measurements greater than 110k. There is a way to control the bucketing but it is deep deep in the OpenTelemetry meter code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry, took time is at the coord level, it's not possible to get it here. I see! and thanks for the explanation about the rounding. And for checking further about precision. I think we are good here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For https://elasticco.atlassian.net/browse/ES-12391, splitting DFS metrics from #135285 per @javanna 's suggestion.