-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Search query phase coordinator duration APM metric. #136059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Search query phase coordinator duration APM metric. #136059
Conversation
to track the duration of the search query phase.
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, thanks!
|
||
@Override | ||
protected void doRun(Map<SearchShardIterator, Integer> shardIndexMap) { | ||
phaseStartTimeNanos = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be potentially streamlined into the run
method in the parent class. We may be able to even report the latency at the coordinator in a generic manner, with all the code in AbstractSearchAsyncAction
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My other PR attempted the "generic" path as well. There is some weirdness around the PIT creation and queries that caused a lot of issues in CI but I think I have an idea of where the issue was. I'll try and move this code into the AbstractSearchAsyncAction
and that should cover at least DFS and query phases.
I'll have to check if that helps fetch other subsequent phases. The issue with them is that they don't subclass off of AbstractSearchAsyncAction
but reference it via a passed in context
so those phases might not hit run
to set the start time of the phase. I'll have to do some debug tracing.
request.getMaxConcurrentShardRequests(), | ||
clusters | ||
clusters, | ||
coordinatorSearchPhaseAPMMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit surprised that we don't record latency for this. I don't want to confuse you , I don't mean reporting dfs phase latency at the coordinator. What I mean is that DFS query then fetch has an additional DFS roundtrip in the beginning, but after DFS it executes the query phase, yet the codepath is all in SearchDfsQueryThenFetchAction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its a separate code path if you do a DFS query. It joins the "normal" code path when you get back to the Fetch and subsequent phases. In my original PR, it was reporting a "dfs" and a "dfs_query" phase duration. Not for this PR but for the future one where we record the DFS phase metric, do we want to record a separate DFS roundtrip metric? Also, do we want to differentiate the two code query code paths with two different metrics (DFS and non-DFS query phases) or record both paths with the same query phase metric?
/** | ||
* Coordinator level APM metrics for search phases. Records phase execution times as histograms. | ||
*/ | ||
public class CoordinatorSearchPhaseAPMMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you considered reusing SearchResponseMetrics for this? Perhaps we would need to rename it? Other downsides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might have thought at one point SearchResponseMetrics
was related to the metrics we were returning as part of the search response (took time, etc.) but on a second look, you're right that this might be a good option as opposed to creating a separate class and having to inject a new object. I'll look into the logistics of moving these metrics to this class. Thanks for pointing that out!
...st/java/org/elasticsearch/search/TelemetryMetrics/CoordinatorSearchPhaseAPMMetricsTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/search/stats/CoordinatorSearchPhaseAPMMetrics.java
Outdated
Show resolved
Hide resolved
executeNextPhase(getName(), this::getNextPhase); | ||
} | ||
|
||
protected void recordPhaseLatency() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit strange to me, this concept being split between base class and specific class. On one hand, we always call recordPhaseLatency
now, on the other hand, only one class does anything with it. Is it the case that in the future more classes would override this method? If not, then it may make sense to make that class just override onPhaseDone
maybe, do it's own peculiar thing and then call super
for the rest of the common thing? Then you don't need to introduce knowledge in this class that is not actually being used by this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the reason why the code is structured this was is because we want to break up recording the metrics for each of the phases into individual PRs to reduce the size of the changes and potential impact. So it only has one class now but will have more in subsequent PRs. @javanna please correct me if I'm misunderstanding the approach you suggested I take on these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK if there's more to come then it's fine. I'd just add a comment as to what this method is intended for, this is especially important for methods targeted for override where the context may be in a different class so sometimes it's hard to understand whether or not a particular class should override it.
private final Map<String, PendingExecutions> pendingExecutionsPerNode; | ||
private final AtomicBoolean requestCancelled = new AtomicBoolean(); | ||
private final int skippedCount; | ||
protected final CoordinatorSearchPhaseAPMMetrics coordinatorSearchPhaseAPMMetrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again I wonder why we have this if only a single child class is using it. Is it the case that more classes are going to use it in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code structure comments
Added es.search.coordinator.phases.query.duration.histogram APM metric to track the duration of the search query phase at the coordinator level..