Peek at queued task's latency in a thread pool #131329

DiannaHohensee · 2025-07-15T21:33:08Z

Relates ES-12233

This was described here and requested here.

This has a lot of type casting.. I'm not sure what our policy is on using this much type casting in production code?

elasticsearchmachine · 2025-07-15T21:33:33Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DiannaHohensee · 2025-07-15T21:37:26Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+    public void testFrontOfQueueLatency() throws Exception {
        ThreadContext context = new ThreadContext(Settings.EMPTY);
-        RecordingMeterRegistry meterRegistry = new RecordingMeterRegistry();
-        final var threadPoolName = randomIdentifier();


Both of these were unused, so I took the opportunity to delete them. Does mess up the diff, a bit, though.

DiannaHohensee · 2025-07-15T21:39:51Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+                // Release any potentially running task. This could be racy (a task may start executing and hit the barrier afterward) and
+                // is best-effort.
+                safeAwait(barrier);
+            }


The test failure output focuses on CyclicBarrier timeout because a Task is hanging during execution in the thread pool. An assert from the try-block is also reported if you dig around the failure stacks, but I think this might help make failure causes more obvious when a task is left running.

Hmm, I think we're ok with digging through exception logs looking for the right one, so this isn't really needed. Moreover I think it doesn't work reliably, if we're stuck at the first barrier then releasing it will briefly have barrier.getNumberWaiting() == 0 until the executor gets to the second barrier, so we might leave the loop anyway. I'd rather just drop this code and (like we do in lots of other places) accept that the first-reported exception isn't necessarily the root cause of the test failure.

Yes, it is best-effort only. I didn't have the test running successfully at first, so I had to debug it :) Sure, removed, since it's normal 👍

DiannaHohensee · 2025-07-18T19:25:08Z

@nicktindall Not sure if you've been busy, or have concerns whether we want to continue using queue latency? I brought up a new discussion thread in the project channel to clarify. I do think queue latency is still important.

nicktindall

Sorry for the delay here, not sure what was discussed today, but let's hold off merging until we know what (if any) role queue latency will play in the decider.

nicktindall · 2025-07-22T04:27:50Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

        var adjustableTimedRunnable = new AdjustableQueueTimeWithExecutionBarrierTimedRunnable(
            barrier,
-            TimeUnit.NANOSECONDS.toNanos(1000000)
+            TimeUnit.NANOSECONDS.toNanos(1000000) // Until changed, queue latencies will always be 1 millisecond.


Why not TimeUnit.MILLISECONDS.toNanos(1); ?

Sure, changed 20641af

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

…d peek is longer than the first

DiannaHohensee · 2025-07-23T19:07:22Z

Thanks for the review. Yep, we'll discuss in the sync 👍

DaveCTurner

Makes sense to me

DaveCTurner · 2025-07-31T13:20:11Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+            executor.execute(() -> {});
+
+            var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue();
+            assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L));


This is assuming that System.nanoTime() advances by at least 1ns in between the second call to executor.execute() and the call to peekMaxQueueLatencyInQueue() and that's not a safe assumption, the clock ticks can be coarse enough to see the same time in both places. It needs us to sleep in a loop until we ourselves see the nanoTime() advance.

I've made this an assertBusy, since there's no problem peeking at the front of the queue repeatedly until we see something.

DaveCTurner · 2025-07-31T13:23:23Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+
+            var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue();
+            assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L));
+            safeSleep(10);


Likewise here, the scheduler and/or clock might be coarse enough that no time passes here. We again need to sleep in a loop until we see time pass. See e.g. org.elasticsearch.cluster.service.ClusterServiceIT#waitForTimeToElapse

Oh, that's a very fancy method. Hmm, I could make a variation of that method for this file (other one is an IT test, not unit test). But how about I wrap these calls in an assertBusy and document the concern? I've gone ahead with that, let me know if you prefer a variation of ClusterServiceIT#waitForTimeToElapse instead.

Yeah I'd rather we waited just for time to pass. It doesn't need to do everything that waitForTimeToElapse does, this is designed to deal with the caching in ThreadPool::relativeTimeInMillis too (across many different ThreadPool instances) - we just need to check System.nanoTime.

Ah, that is simpler than I was initially imagining. Updated in 7b29402, I think that's what you mean

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

DaveCTurner · 2025-07-31T13:29:55Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+                // Release any potentially running task. This could be racy (a task may start executing and hit the barrier afterward) and
+                // is best-effort.
+                safeAwait(barrier);
+            }


Hmm, I think we're ok with digging through exception logs looking for the right one, so this isn't really needed. Moreover I think it doesn't work reliably, if we're stuck at the first barrier then releasing it will briefly have barrier.getNumberWaiting() == 0 until the executor gets to the second barrier, so we might leave the loop anyway. I'd rather just drop this code and (like we do in lots of other places) accept that the first-reported exception isn't necessarily the root cause of the test failure.

DaveCTurner · 2025-07-31T13:31:15Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+            executor.shutdown();
+            executor.awaitTermination(10, TimeUnit.SECONDS);


Suggest using org.elasticsearch.threadpool.ThreadPool#terminate(java.util.concurrent.ExecutorService, long, java.util.concurrent.TimeUnit) here, it's handy to call shutdownNow() if shutdown() isn't enough.

(acking that the other tests in this suite don't use ThreadPool#terminate either, but they probably should)

Ah, good to know, thanks! Updated.

I also went ahead and updated the other callers, so copy-paste propagation will cease in this file at least.

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

Replaces executor shutdowns with more reliable ThreadPool#terminate calls Added assertBusy around queue latency checks, to avoid races with ThreadPool clock not moving forward

DiannaHohensee

Thanks for the review. I've updated per the feedback in 44e5f18

DiannaHohensee · 2025-08-01T18:28:14Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+                // Release any potentially running task. This could be racy (a task may start executing and hit the barrier afterward) and
+                // is best-effort.
+                safeAwait(barrier);
+            }


Yes, it is best-effort only. I didn't have the test running successfully at first, so I had to debug it :) Sure, removed, since it's normal 👍

DiannaHohensee · 2025-08-01T18:32:26Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+            executor.shutdown();
+            executor.awaitTermination(10, TimeUnit.SECONDS);


Ah, good to know, thanks! Updated.

I also went ahead and updated the other callers, so copy-paste propagation will cease in this file at least.

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

DiannaHohensee · 2025-08-01T19:10:29Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+
+            var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue();
+            assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L));
+            safeSleep(10);


Oh, that's a very fancy method. Hmm, I could make a variation of that method for this file (other one is an IT test, not unit test). But how about I wrap these calls in an assertBusy and document the concern? I've gone ahead with that, let me know if you prefer a variation of ClusterServiceIT#waitForTimeToElapse instead.

DiannaHohensee · 2025-08-01T19:11:33Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+            executor.execute(() -> {});
+
+            var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue();
+            assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L));


I've made this an assertBusy, since there's no problem peeking at the front of the queue repeatedly until we see something.

DaveCTurner · 2025-08-05T10:53:15Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+            assertBusy(
+                // Wrap this call in an assertBusy because it's feasible for the thread pool's clock to see no time pass.


I'd rather we were specific about what we're waiting for - it should be enough to check that time has passed, so we should fail the test if time has passed but we don't see anything in the queue for some reason.

(also this won't work anyway because you need to re-read frontOfQueueDuration if you're retrying)

DaveCTurner · 2025-08-05T10:54:28Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+
+            var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue();
+            assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L));
+            safeSleep(10);


Yeah I'd rather we waited just for time to pass. It doesn't need to do everything that waitForTimeToElapse does, this is designed to deal with the caching in ThreadPool::relativeTimeInMillis too (across many different ThreadPool instances) - we just need to check System.nanoTime.

DiannaHohensee

Replaced the assertBusy with a check to ensure time passes: 7b29402

DiannaHohensee · 2025-08-05T20:34:21Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+
+            var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue();
+            assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L));
+            safeSleep(10);


Ah, that is simpler than I was initially imagining. Updated in 7b29402, I think that's what you mean

DaveCTurner

LGTM

DaveCTurner · 2025-08-08T07:52:17Z

...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java

+        while (TimeUnit.MILLISECONDS.convert(System.nanoTime() - startNanoTime, TimeUnit.NANOSECONDS) <= 100) {
+            Thread.sleep(100);


I think the test will be fine if the clock advances at all, even by 1ns - no need to keep retrying until it hits 100,000,000ns :)

Updated to an itty-bitty single nano :) Not sure I'll ever write such a small number again. e2f8e78

Peek at queued task's latency in a thread pool

6b8854f

DiannaHohensee self-assigned this Jul 15, 2025

DiannaHohensee requested a review from a team as a code owner July 15, 2025 21:33

DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.1 labels Jul 15, 2025

elasticsearchmachine added the v9.2.0 label Jul 15, 2025

DiannaHohensee commented Jul 15, 2025

View reviewed changes

DiannaHohensee requested a review from nicktindall July 15, 2025 23:12

nicktindall reviewed Jul 22, 2025

View reviewed changes

DiannaHohensee added 3 commits July 23, 2025 11:44

Merge branch 'main' into 2025/07/15/peek-queue-head

a14d8c2

peek at the queue twice, ensure the queue duration reported the secon…

e3d7d0e

…d peek is longer than the first

time unit change

20641af

DaveCTurner reviewed Jul 31, 2025

View reviewed changes

DiannaHohensee added 3 commits August 1, 2025 11:23

Merge branch 'main' into 2025/07/15/peek-queue-head

99a0637

Removed best-effort while-loops

44e5f18

Replaces executor shutdowns with more reliable ThreadPool#terminate calls Added assertBusy around queue latency checks, to avoid races with ThreadPool clock not moving forward

Merge branch 'main' into 2025/07/15/peek-queue-head

c61e055

DiannaHohensee commented Aug 1, 2025

View reviewed changes

DiannaHohensee requested a review from DaveCTurner August 1, 2025 19:22

DaveCTurner reviewed Aug 5, 2025

View reviewed changes

DiannaHohensee added 2 commits August 5, 2025 12:54

Merge branch 'main' into 2025/07/15/peek-queue-head

f1e9d12

add method to ensure time passes, replace assertBusy

7b29402

DiannaHohensee commented Aug 5, 2025

View reviewed changes

DiannaHohensee requested a review from DaveCTurner August 5, 2025 20:39

elasticsearchmachine added v9.1.2 and removed v9.1.1 labels Aug 6, 2025

Merge branch 'main' into 2025/07/15/peek-queue-head

c0f8604

DaveCTurner approved these changes Aug 8, 2025

View reviewed changes

DiannaHohensee added 2 commits August 8, 2025 09:53

Merge branch 'main' into 2025/07/15/peek-queue-head

4d030c9

update sleep to be very small -- any amount of time is sufficient

e2f8e78

DiannaHohensee merged commit 93b16dc into elastic:main Aug 8, 2025
33 checks passed

		executor.shutdown();
		executor.awaitTermination(10, TimeUnit.SECONDS);

		assertBusy(
		// Wrap this call in an assertBusy because it's feasible for the thread pool's clock to see no time pass.

		while (TimeUnit.MILLISECONDS.convert(System.nanoTime() - startNanoTime, TimeUnit.NANOSECONDS) <= 100) {
		Thread.sleep(100);

Peek at queued task's latency in a thread pool #131329

Peek at queued task's latency in a thread pool #131329

Uh oh!

Conversation

DiannaHohensee commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee commented Jul 18, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee commented Jul 23, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

DiannaHohensee commented Jul 15, 2025 •

edited

Loading

DiannaHohensee Aug 8, 2025 •

edited

Loading