-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461
Conversation
} | ||
Arrays.stream(threadsToJoin).forEach(thread -> assertFalse(thread.isAlive())); | ||
// Wait for the write executor to go idle | ||
assertBusy(() -> assertThat(trackingWriteExecutor.getActiveCount(), equalTo(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this could still have a very rare race condition where we get a zero inbetween the tasks being executed, but given that the write pool defaults to # of cores, it seems unlikely we'd go to 0 active, then to non-zero again as we drain the queue.
I'm not sure if there's a more satisfying approach to knowing when the work is all done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this as a comment and then link the existing test failure in case the test fails again in X weeks for the reason mentioned above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2ec6995
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I'm a ++ for adding a comment to expedite future debugging, but that doesn't need a separate review
} | ||
Arrays.stream(threadsToJoin).forEach(thread -> assertFalse(thread.isAlive())); | ||
// Wait for the write executor to go idle | ||
assertBusy(() -> assertThat(trackingWriteExecutor.getActiveCount(), equalTo(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this as a comment and then link the existing test failure in case the test fails again in X weeks for the reason mentioned above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Two minor comments for you to pick and choose
- Since we now have to use
assertBusy
, maybe it is more straightforward to use that for things that we want to observe, e.g.MaxQueueLatencyInQueueMillis
, instead of asserting a secondary effect like the active thread count? - I think the previous "fix" (#134180) is no longer necessary and can be reverted?
It's a good point. The only thing I wonder is, because the polling is destructive and represents activity since last time we polled, polling in a tight loop with I'd love it if the metric reads weren't destructive and there was no "observer effect" instead, but that doesn't seem high priority at the moment.
I'll take a look. |
We do peek from the head of the queue and this part should not be destructive, right? Unless the task is not even queued when the polling happens. But in that case, Btw, if we keep checkpoint sync action running on the generic thread, we should be certain that all "write-related" tasks are submitted to the thread pool when the indexing requests return. |
If we assert busy on
I think there's a chance this can still happen, hence the comment. But because there's > 1 thread in the pool, it would seem for
I noticed a test failure fairly quickly when I reverted it. I think it's a necessary part of the fix. (Edit, confirmed, it breaks the test when it's in there) |
Ah ok thanks for explaining. I think what you have here is good. 👍 |
It's not a beautiful fix, but it seems to work
Closes: #134500