-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Send max of two types of max queue latency to ClusterInfo #132675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send max of two types of max queue latency to ClusterInfo #132675
Conversation
The TransportNodeUsageStatsForThreadPoolsAction now takes the max latency of any task currently queued in the write thread pool queue AND the previously collected max queue latency of any task dequeued since the last call. This covers the possibility that queue times can rise greatly before being reflected in execution: imagine all the write threads are stalled. Adds additional IT testing to exercise both forms of queue latency, a followup for ES-12316.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @DiannaHohensee, I've created a changelog YAML for you. |
| * that is returned, which waits for total-write-threads + 1 callers. The caller can release the tasks by calling | ||
| * {@code barrier.await()} or interrupt them with {@code barrier.reset()}. | ||
| */ | ||
| public CyclicBarrier blockDataNodeIndexing(String dataNodeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stole this from some serverless testing.
| ); | ||
| } | ||
|
|
||
| public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling this from another test file, so it can be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. ee2159a
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a number of comments, otherwise looks good.
I wonder if we really need this in a first version, but now it is here, it is fine to get in.
.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
| ); | ||
| } | ||
|
|
||
| public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.
…e write thread pool: eliminate replicas requiring sync
DiannaHohensee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, I've addressed the feedback.
I also realized that the peek() method was returning Nanos, not Millis, so I fixed that (a23c716).
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
| ); | ||
| } | ||
|
|
||
| public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. ee2159a
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
Show resolved
Hide resolved
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…must wait for 1ms, not 1ns, to see a change in queue latency
…2675) The TransportNodeUsageStatsForThreadPoolsAction now takes the max latency of any task currently queued in the write thread pool queue AND the previously collected max queue latency of any task dequeued since the last call. This covers the possibility that queue times can rise greatly before being reflected in execution: imagine all the write threads are stalled or have long running tasks. This action feeds a max queue latency stat to the ClusterInfo. Follow up from ES-12233. Adds additional IT testing to exercise both forms of queue latency, a followup for ES-12316. ------------- Completing the follow up testing [Henning requested previously](https://github.com/elastic/elasticsearch/pull/131480/files#r2243610807).
The TransportNodeUsageStatsForThreadPoolsAction now
takes the max latency of any task currently queued
in the write thread pool queue AND the previously
collected max queue latency of any task dequeued
since the last call. This covers the possibility
that queue times can rise greatly before being
reflected in execution: imagine all the write
threads are stalled or have long running tasks.
This action feeds a max queue latency stat to
the ClusterInfo. Follow up from ES-12233.
Adds additional IT testing to exercise both
forms of queue latency, a followup for ES-12316.
Completing the follow up testing Henning requested previously.