Search query phase coordinator duration APM metric. #136059

chrisparrinello · 2025-10-06T20:47:17Z

Added es.search.coordinator.phases.query.duration.histogram APM metric to track the duration of the search query phase at the coordinator level..

to track the duration of the search query phase.

elasticsearchmachine · 2025-10-07T14:02:51Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

javanna

I left some comments, thanks!

javanna · 2025-10-07T16:57:59Z

server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java


    @Override
    protected void doRun(Map<SearchShardIterator, Integer> shardIndexMap) {
+        phaseStartTimeNanos = System.nanoTime();


This could be potentially streamlined into the run method in the parent class. We may be able to even report the latency at the coordinator in a generic manner, with all the code in AbstractSearchAsyncAction?

My other PR attempted the "generic" path as well. There is some weirdness around the PIT creation and queries that caused a lot of issues in CI but I think I have an idea of where the issue was. I'll try and move this code into the AbstractSearchAsyncAction and that should cover at least DFS and query phases.

I'll have to check if that helps fetch other subsequent phases. The issue with them is that they don't subclass off of AbstractSearchAsyncAction but reference it via a passed in context so those phases might not hit run to set the start time of the phase. I'll have to do some debug tracing.

javanna · 2025-10-07T16:59:26Z

server/src/main/java/org/elasticsearch/action/search/SearchDfsQueryThenFetchAsyncAction.java

            request.getMaxConcurrentShardRequests(),
-            clusters
+            clusters,
+            coordinatorSearchPhaseAPMMetrics


I am a bit surprised that we don't record latency for this. I don't want to confuse you , I don't mean reporting dfs phase latency at the coordinator. What I mean is that DFS query then fetch has an additional DFS roundtrip in the beginning, but after DFS it executes the query phase, yet the codepath is all in SearchDfsQueryThenFetchAction.

Yeah its a separate code path if you do a DFS query. It joins the "normal" code path when you get back to the Fetch and subsequent phases. In my original PR, it was reporting a "dfs" and a "dfs_query" phase duration. Not for this PR but for the future one where we record the DFS phase metric, do we want to record a separate DFS roundtrip metric? Also, do we want to differentiate the two code query code paths with two different metrics (DFS and non-DFS query phases) or record both paths with the same query phase metric?

javanna · 2025-10-07T17:00:16Z

server/src/main/java/org/elasticsearch/index/search/stats/CoordinatorSearchPhaseAPMMetrics.java

+/**
+ * Coordinator level APM metrics for search phases. Records phase execution times as histograms.
+ */
+public class CoordinatorSearchPhaseAPMMetrics {


have you considered reusing SearchResponseMetrics for this? Perhaps we would need to rename it? Other downsides?

I think I might have thought at one point SearchResponseMetrics was related to the metrics we were returning as part of the search response (took time, etc.) but on a second look, you're right that this might be a good option as opposed to creating a separate class and having to inject a new object. I'll look into the logistics of moving these metrics to this class. Thanks for pointing that out!

...st/java/org/elasticsearch/search/TelemetryMetrics/CoordinatorSearchPhaseAPMMetricsTests.java

server/src/main/java/org/elasticsearch/index/search/stats/CoordinatorSearchPhaseAPMMetrics.java

smalyshev · 2025-10-07T17:50:31Z

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

        executeNextPhase(getName(), this::getNextPhase);
    }

+    protected void recordPhaseLatency() {}


It feels a bit strange to me, this concept being split between base class and specific class. On one hand, we always call recordPhaseLatency now, on the other hand, only one class does anything with it. Is it the case that in the future more classes would override this method? If not, then it may make sense to make that class just override onPhaseDone maybe, do it's own peculiar thing and then call super for the rest of the common thing? Then you don't need to introduce knowledge in this class that is not actually being used by this class.

Part of the reason why the code is structured this was is because we want to break up recording the metrics for each of the phases into individual PRs to reduce the size of the changes and potential impact. So it only has one class now but will have more in subsequent PRs. @javanna please correct me if I'm misunderstanding the approach you suggested I take on these changes.

OK if there's more to come then it's fine. I'd just add a comment as to what this method is intended for, this is especially important for methods targeted for override where the context may be in a different class so sometimes it's hard to understand whether or not a particular class should override it.

smalyshev · 2025-10-07T17:51:34Z

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

    private final Map<String, PendingExecutions> pendingExecutionsPerNode;
    private final AtomicBoolean requestCancelled = new AtomicBoolean();
    private final int skippedCount;
+    protected final CoordinatorSearchPhaseAPMMetrics coordinatorSearchPhaseAPMMetrics;


Here again I wonder why we have this if only a single child class is using it. Is it the case that more classes are going to use it in the future?

smalyshev

Some code structure comments

Added es.search.coordinator.phases.query.duration.histogram APM metric

9239f46

to track the duration of the search query phase.

chrisparrinello requested a review from a team as a code owner October 6, 2025 20:47

Merge branch 'main' into query_phase_coordinator_metric

70b59c1

chrisparrinello requested review from javanna and smalyshev and removed request for a team and javanna October 6, 2025 20:47

elasticsearchmachine added v9.3.0 needs:triage Requires assignment of a team area label labels Oct 6, 2025

chrisparrinello requested a review from javanna October 6, 2025 20:48

chrisparrinello added >enhancement Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Oct 6, 2025

Merge branch 'main' into query_phase_coordinator_metric

0ff47ab

chrisparrinello added 2 commits October 7, 2025 09:07

clean up unused code

5973b2c

remove redundant asserts

58d237e

javanna reviewed Oct 7, 2025

View reviewed changes

smalyshev reviewed Oct 7, 2025

View reviewed changes

...st/java/org/elasticsearch/search/TelemetryMetrics/CoordinatorSearchPhaseAPMMetricsTests.java Outdated Show resolved Hide resolved

smalyshev reviewed Oct 7, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/search/stats/CoordinatorSearchPhaseAPMMetrics.java Outdated Show resolved Hide resolved

smalyshev reviewed Oct 7, 2025

View reviewed changes

PR fixes

182b619

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Search query phase coordinator duration APM metric. #136059

Search query phase coordinator duration APM metric. #136059

chrisparrinello commented Oct 6, 2025

Uh oh!

elasticsearchmachine commented Oct 7, 2025

Uh oh!

javanna left a comment

Uh oh!

javanna Oct 7, 2025

Uh oh!

chrisparrinello Oct 7, 2025

Uh oh!

javanna Oct 7, 2025

Uh oh!

chrisparrinello Oct 7, 2025

Uh oh!

javanna Oct 7, 2025

Uh oh!

chrisparrinello Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

smalyshev Oct 7, 2025

Uh oh!

chrisparrinello Oct 7, 2025

Uh oh!

smalyshev Oct 7, 2025

Uh oh!

smalyshev Oct 7, 2025

Uh oh!

smalyshev left a comment

Uh oh!

Uh oh!

Search query phase coordinator duration APM metric. #136059

Are you sure you want to change the base?

Search query phase coordinator duration APM metric. #136059

Conversation

chrisparrinello commented Oct 6, 2025

Uh oh!

elasticsearchmachine commented Oct 7, 2025

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smalyshev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!