Send max of two types of max queue latency to ClusterInfo #132675

DiannaHohensee · 2025-08-11T17:23:56Z

The TransportNodeUsageStatsForThreadPoolsAction now
takes the max latency of any task currently queued
in the write thread pool queue AND the previously
collected max queue latency of any task dequeued
since the last call. This covers the possibility
that queue times can rise greatly before being
reflected in execution: imagine all the write
threads are stalled or have long running tasks.
This action feeds a max queue latency stat to
the ClusterInfo. Follow up from ES-12233.

Adds additional IT testing to exercise both
forms of queue latency, a followup for ES-12316.

Completing the follow up testing Henning requested previously.

The TransportNodeUsageStatsForThreadPoolsAction now takes the max latency of any task currently queued in the write thread pool queue AND the previously collected max queue latency of any task dequeued since the last call. This covers the possibility that queue times can rise greatly before being reflected in execution: imagine all the write threads are stalled. Adds additional IT testing to exercise both forms of queue latency, a followup for ES-12316.

elasticsearchmachine · 2025-08-11T17:24:20Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-08-11T17:24:21Z

Hi @DiannaHohensee, I've created a changelog YAML for you.

DiannaHohensee · 2025-08-11T17:27:12Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+     * that is returned, which waits for total-write-threads + 1 callers. The caller can release the tasks by calling
+     * {@code barrier.await()} or interrupt them with {@code barrier.reset()}.
+     */
+    public CyclicBarrier blockDataNodeIndexing(String dataNodeName) {


I stole this from some serverless testing.

DiannaHohensee · 2025-08-11T17:57:35Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

        );
    }
+
+    public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException {


Pulling this from another test file, so it can be reused.

I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.

Removed. ee2159a

henningandersen

Left a number of comments, otherwise looks good.

I wonder if we really need this in a first version, but now it is here, it is fine to get in.

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

henningandersen · 2025-08-12T14:14:44Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

        );
    }
+
+    public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException {


I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.

…iable

… wanted

…e write thread pool: eliminate replicas requiring sync

DiannaHohensee

Thanks for the review, I've addressed the feedback.

I also realized that the peek() method was returning Nanos, not Millis, so I fixed that (a23c716).

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

DiannaHohensee · 2025-08-13T23:42:11Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

        );
    }
+
+    public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException {


Removed. ee2159a

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

henningandersen

LGTM.

…must wait for 1ms, not 1ns, to see a change in queue latency

…2675) The TransportNodeUsageStatsForThreadPoolsAction now takes the max latency of any task currently queued in the write thread pool queue AND the previously collected max queue latency of any task dequeued since the last call. This covers the possibility that queue times can rise greatly before being reflected in execution: imagine all the write threads are stalled or have long running tasks. This action feeds a max queue latency stat to the ClusterInfo. Follow up from ES-12233. Adds additional IT testing to exercise both forms of queue latency, a followup for ES-12316. ------------- Completing the follow up testing [Henning requested previously](https://github.com/elastic/elasticsearch/pull/131480/files#r2243610807).

DiannaHohensee self-assigned this Aug 11, 2025

DiannaHohensee requested a review from a team as a code owner August 11, 2025 17:23

DiannaHohensee added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Aug 11, 2025

Update docs/changelog/132675.yaml

1050436

DiannaHohensee commented Aug 11, 2025

View reviewed changes

DiannaHohensee added 2 commits August 11, 2025 10:58

tidying up, some self review

d4631d2

add comment

a213160

DiannaHohensee requested review from DaveCTurner and henningandersen August 11, 2025 18:01

DiannaHohensee changed the title ~~Add second max queue latency stat to ClusterInfo~~ Send max of two types of max queue latency to ClusterInfo Aug 11, 2025

Merge branch 'main' into 2025/08/11/max-of-two-latencies

e8979df

henningandersen reviewed Aug 12, 2025

View reviewed changes

DiannaHohensee and others added 7 commits August 13, 2025 16:31

Merge branch 'main' into 2025/08/11/max-of-two-latencies

7434c8f

undo waitForTimeToElapse, ensure non-zero instead; rename counter var…

ee2159a

…iable

fix peekMaxQueueLatencyInQueue() code to check for null

e8491a2

[CI] Auto commit changes from spotless

b6f4a2b

fix: peekMaxQueueLatencyInQueue() was returning nanos when millis are…

a23c716

… wanted

remove stray concurrent GlobalCheckpointSyncAction.Request task on th…

4241df0

…e write thread pool: eliminate replicas requiring sync

Merge branch 'main' into 2025/08/11/max-of-two-latencies

d4313be

DiannaHohensee commented Aug 14, 2025

View reviewed changes

DiannaHohensee requested a review from henningandersen August 14, 2025 01:19

Merge branch 'main' into 2025/08/11/max-of-two-latencies

77c1213

henningandersen approved these changes Aug 14, 2025

View reviewed changes

DiannaHohensee added 2 commits August 14, 2025 08:17

Merge branch 'main' into 2025/08/11/max-of-two-latencies

1d2af74

Fix test after *peek() method was changed to millis from nanos; test …

58eedcd

…must wait for 1ms, not 1ns, to see a change in queue latency

DiannaHohensee added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 14, 2025

elasticsearchmachine merged commit 97a6dc8 into elastic:main Aug 14, 2025
33 checks passed

DiannaHohensee deleted the 2025/08/11/max-of-two-latencies branch August 14, 2025 16:27

DiannaHohensee mentioned this pull request Aug 14, 2025

[CI] ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo failing #132957

Closed

This was referenced Sep 2, 2025

[Test] Fix ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo #133956

Merged

[Test] Use generic for GlobalCheckpoingSyncAction #134180

Merged

Send max of two types of max queue latency to ClusterInfo #132675

Send max of two types of max queue latency to ClusterInfo #132675

Uh oh!

Conversation

DiannaHohensee commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 11, 2025

Uh oh!

elasticsearchmachine commented Aug 11, 2025

Uh oh!

DiannaHohensee Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

henningandersen Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henningandersen Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DiannaHohensee Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DiannaHohensee commented Aug 11, 2025 •

edited

Loading