Skip to content

Conversation

chrisparrinello
Copy link
Contributor

Added es.search.coordinator.phases.query.duration.histogram APM metric to track the duration of the search query phase at the coordinator level..

to track the duration of the search query phase.
@chrisparrinello chrisparrinello requested a review from a team as a code owner October 6, 2025 20:47
@chrisparrinello chrisparrinello requested review from javanna and smalyshev and removed request for a team and javanna October 6, 2025 20:47
@elasticsearchmachine elasticsearchmachine added v9.3.0 needs:triage Requires assignment of a team area label labels Oct 6, 2025
@chrisparrinello chrisparrinello added >enhancement Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Oct 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, thanks!


@Override
protected void doRun(Map<SearchShardIterator, Integer> shardIndexMap) {
phaseStartTimeNanos = System.nanoTime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be potentially streamlined into the run method in the parent class. We may be able to even report the latency at the coordinator in a generic manner, with all the code in AbstractSearchAsyncAction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other PR attempted the "generic" path as well. There is some weirdness around the PIT creation and queries that caused a lot of issues in CI but I think I have an idea of where the issue was. I'll try and move this code into the AbstractSearchAsyncAction and that should cover at least DFS and query phases.

I'll have to check if that helps fetch other subsequent phases. The issue with them is that they don't subclass off of AbstractSearchAsyncAction but reference it via a passed in context so those phases might not hit run to set the start time of the phase. I'll have to do some debug tracing.

request.getMaxConcurrentShardRequests(),
clusters
clusters,
coordinatorSearchPhaseAPMMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit surprised that we don't record latency for this. I don't want to confuse you , I don't mean reporting dfs phase latency at the coordinator. What I mean is that DFS query then fetch has an additional DFS roundtrip in the beginning, but after DFS it executes the query phase, yet the codepath is all in SearchDfsQueryThenFetchAction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its a separate code path if you do a DFS query. It joins the "normal" code path when you get back to the Fetch and subsequent phases. In my original PR, it was reporting a "dfs" and a "dfs_query" phase duration. Not for this PR but for the future one where we record the DFS phase metric, do we want to record a separate DFS roundtrip metric? Also, do we want to differentiate the two code query code paths with two different metrics (DFS and non-DFS query phases) or record both paths with the same query phase metric?

/**
* Coordinator level APM metrics for search phases. Records phase execution times as histograms.
*/
public class CoordinatorSearchPhaseAPMMetrics {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered reusing SearchResponseMetrics for this? Perhaps we would need to rename it? Other downsides?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might have thought at one point SearchResponseMetrics was related to the metrics we were returning as part of the search response (took time, etc.) but on a second look, you're right that this might be a good option as opposed to creating a separate class and having to inject a new object. I'll look into the logistics of moving these metrics to this class. Thanks for pointing that out!

executeNextPhase(getName(), this::getNextPhase);
}

protected void recordPhaseLatency() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit strange to me, this concept being split between base class and specific class. On one hand, we always call recordPhaseLatency now, on the other hand, only one class does anything with it. Is it the case that in the future more classes would override this method? If not, then it may make sense to make that class just override onPhaseDone maybe, do it's own peculiar thing and then call super for the rest of the common thing? Then you don't need to introduce knowledge in this class that is not actually being used by this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reason why the code is structured this was is because we want to break up recording the metrics for each of the phases into individual PRs to reduce the size of the changes and potential impact. So it only has one class now but will have more in subsequent PRs. @javanna please correct me if I'm misunderstanding the approach you suggested I take on these changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK if there's more to come then it's fine. I'd just add a comment as to what this method is intended for, this is especially important for methods targeted for override where the context may be in a different class so sometimes it's hard to understand whether or not a particular class should override it.

private final Map<String, PendingExecutions> pendingExecutionsPerNode;
private final AtomicBoolean requestCancelled = new AtomicBoolean();
private final int skippedCount;
protected final CoordinatorSearchPhaseAPMMetrics coordinatorSearchPhaseAPMMetrics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again I wonder why we have this if only a single child class is using it. Is it the case that more classes are going to use it in the future?

Copy link
Contributor

@smalyshev smalyshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code structure comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants