-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Peek at queued task's latency in a thread pool #131329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peek at queued task's latency in a thread pool #131329
Conversation
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| public void testFrontOfQueueLatency() throws Exception { | ||
| ThreadContext context = new ThreadContext(Settings.EMPTY); | ||
| RecordingMeterRegistry meterRegistry = new RecordingMeterRegistry(); | ||
| final var threadPoolName = randomIdentifier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these were unused, so I took the opportunity to delete them. Does mess up the diff, a bit, though.
| // Release any potentially running task. This could be racy (a task may start executing and hit the barrier afterward) and | ||
| // is best-effort. | ||
| safeAwait(barrier); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failure output focuses on CyclicBarrier timeout because a Task is hanging during execution in the thread pool. An assert from the try-block is also reported if you dig around the failure stacks, but I think this might help make failure causes more obvious when a task is left running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we're ok with digging through exception logs looking for the right one, so this isn't really needed. Moreover I think it doesn't work reliably, if we're stuck at the first barrier then releasing it will briefly have barrier.getNumberWaiting() == 0 until the executor gets to the second barrier, so we might leave the loop anyway. I'd rather just drop this code and (like we do in lots of other places) accept that the first-reported exception isn't necessarily the root cause of the test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is best-effort only. I didn't have the test running successfully at first, so I had to debug it :) Sure, removed, since it's normal 👍
|
@nicktindall Not sure if you've been busy, or have concerns whether we want to continue using queue latency? I brought up a new discussion thread in the project channel to clarify. I do think queue latency is still important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here, not sure what was discussed today, but let's hold off merging until we know what (if any) role queue latency will play in the decider.
| var adjustableTimedRunnable = new AdjustableQueueTimeWithExecutionBarrierTimedRunnable( | ||
| barrier, | ||
| TimeUnit.NANOSECONDS.toNanos(1000000) | ||
| TimeUnit.NANOSECONDS.toNanos(1000000) // Until changed, queue latencies will always be 1 millisecond. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not TimeUnit.MILLISECONDS.toNanos(1); ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed 20641af
...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java
Show resolved
Hide resolved
|
Thanks for the review. Yep, we'll discuss in the sync 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
| executor.execute(() -> {}); | ||
|
|
||
| var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue(); | ||
| assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that System.nanoTime() advances by at least 1ns in between the second call to executor.execute() and the call to peekMaxQueueLatencyInQueue() and that's not a safe assumption, the clock ticks can be coarse enough to see the same time in both places. It needs us to sleep in a loop until we ourselves see the nanoTime() advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this an assertBusy, since there's no problem peeking at the front of the queue repeatedly until we see something.
|
|
||
| var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue(); | ||
| assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L)); | ||
| safeSleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here, the scheduler and/or clock might be coarse enough that no time passes here. We again need to sleep in a loop until we see time pass. See e.g. org.elasticsearch.cluster.service.ClusterServiceIT#waitForTimeToElapse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a very fancy method. Hmm, I could make a variation of that method for this file (other one is an IT test, not unit test). But how about I wrap these calls in an assertBusy and document the concern? I've gone ahead with that, let me know if you prefer a variation of ClusterServiceIT#waitForTimeToElapse instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd rather we waited just for time to pass. It doesn't need to do everything that waitForTimeToElapse does, this is designed to deal with the caching in ThreadPool::relativeTimeInMillis too (across many different ThreadPool instances) - we just need to check System.nanoTime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is simpler than I was initially imagining. Updated in 7b29402, I think that's what you mean
...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java
Outdated
Show resolved
Hide resolved
| // Release any potentially running task. This could be racy (a task may start executing and hit the barrier afterward) and | ||
| // is best-effort. | ||
| safeAwait(barrier); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we're ok with digging through exception logs looking for the right one, so this isn't really needed. Moreover I think it doesn't work reliably, if we're stuck at the first barrier then releasing it will briefly have barrier.getNumberWaiting() == 0 until the executor gets to the second barrier, so we might leave the loop anyway. I'd rather just drop this code and (like we do in lots of other places) accept that the first-reported exception isn't necessarily the root cause of the test failure.
| executor.shutdown(); | ||
| executor.awaitTermination(10, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using org.elasticsearch.threadpool.ThreadPool#terminate(java.util.concurrent.ExecutorService, long, java.util.concurrent.TimeUnit) here, it's handy to call shutdownNow() if shutdown() isn't enough.
(acking that the other tests in this suite don't use ThreadPool#terminate either, but they probably should)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know, thanks! Updated.
I also went ahead and updated the other callers, so copy-paste propagation will cease in this file at least.
...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java
Outdated
Show resolved
Hide resolved
Replaces executor shutdowns with more reliable ThreadPool#terminate calls Added assertBusy around queue latency checks, to avoid races with ThreadPool clock not moving forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I've updated per the feedback in 44e5f18
| // Release any potentially running task. This could be racy (a task may start executing and hit the barrier afterward) and | ||
| // is best-effort. | ||
| safeAwait(barrier); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is best-effort only. I didn't have the test running successfully at first, so I had to debug it :) Sure, removed, since it's normal 👍
| executor.shutdown(); | ||
| executor.awaitTermination(10, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know, thanks! Updated.
I also went ahead and updated the other callers, so copy-paste propagation will cease in this file at least.
...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java
Outdated
Show resolved
Hide resolved
...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java
Outdated
Show resolved
Hide resolved
|
|
||
| var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue(); | ||
| assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L)); | ||
| safeSleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a very fancy method. Hmm, I could make a variation of that method for this file (other one is an IT test, not unit test). But how about I wrap these calls in an assertBusy and document the concern? I've gone ahead with that, let me know if you prefer a variation of ClusterServiceIT#waitForTimeToElapse instead.
| executor.execute(() -> {}); | ||
|
|
||
| var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue(); | ||
| assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this an assertBusy, since there's no problem peeking at the front of the queue repeatedly until we see something.
| assertBusy( | ||
| // Wrap this call in an assertBusy because it's feasible for the thread pool's clock to see no time pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we were specific about what we're waiting for - it should be enough to check that time has passed, so we should fail the test if time has passed but we don't see anything in the queue for some reason.
(also this won't work anyway because you need to re-read frontOfQueueDuration if you're retrying)
|
|
||
| var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue(); | ||
| assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L)); | ||
| safeSleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd rather we waited just for time to pass. It doesn't need to do everything that waitForTimeToElapse does, this is designed to deal with the caching in ThreadPool::relativeTimeInMillis too (across many different ThreadPool instances) - we just need to check System.nanoTime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced the assertBusy with a check to ensure time passes: 7b29402
|
|
||
| var frontOfQueueDuration = executor.peekMaxQueueLatencyInQueue(); | ||
| assertThat("Expected a task to be queued", frontOfQueueDuration, greaterThan(0L)); | ||
| safeSleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is simpler than I was initially imagining. Updated in 7b29402, I think that's what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| while (TimeUnit.MILLISECONDS.convert(System.nanoTime() - startNanoTime, TimeUnit.NANOSECONDS) <= 100) { | ||
| Thread.sleep(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test will be fine if the clock advances at all, even by 1ns - no need to keep retrying until it hits 100,000,000ns :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to an itty-bitty single nano :) Not sure I'll ever write such a small number again. e2f8e78
Relates ES-12233
This was described here and requested here.
This has a lot of type casting.. I'm not sure what our policy is on using this much type casting in production code?