Skip to content

Conversation

nicktindall
Copy link
Contributor

It's not a beautiful fix, but it seems to work

Closes: #134500

@nicktindall nicktindall added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 13, 2025
}
Arrays.stream(threadsToJoin).forEach(thread -> assertFalse(thread.isAlive()));
// Wait for the write executor to go idle
assertBusy(() -> assertThat(trackingWriteExecutor.getActiveCount(), equalTo(0)));
Copy link
Contributor Author

@nicktindall nicktindall Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this could still have a very rare race condition where we get a zero inbetween the tasks being executed, but given that the write pool defaults to # of cores, it seems unlikely we'd go to 0 active, then to non-zero again as we drain the queue.

I'm not sure if there's a more satisfying approach to knowing when the work is all done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this as a comment and then link the existing test failure in case the test fails again in X weeks for the reason mentioned above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2ec6995

@nicktindall nicktindall requested a review from ywangd October 13, 2025 06:03
@nicktindall nicktindall marked this pull request as ready for review October 13, 2025 06:04
@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Oct 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@joshua-adams-1 joshua-adams-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'm a ++ for adding a comment to expedite future debugging, but that doesn't need a separate review

}
Arrays.stream(threadsToJoin).forEach(thread -> assertFalse(thread.isAlive()));
// Wait for the write executor to go idle
assertBusy(() -> assertThat(trackingWriteExecutor.getActiveCount(), equalTo(0)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this as a comment and then link the existing test failure in case the test fails again in X weeks for the reason mentioned above?

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Two minor comments for you to pick and choose

  1. Since we now have to use assertBusy, maybe it is more straightforward to use that for things that we want to observe, e.g. MaxQueueLatencyInQueueMillis, instead of asserting a secondary effect like the active thread count?
  2. I think the previous "fix" (#134180) is no longer necessary and can be reverted?

@nicktindall
Copy link
Contributor Author

nicktindall commented Oct 14, 2025

  1. Since we now have to use assertBusy, maybe it is more straightforward to use that for things that we want to observe, e.g. MaxQueueLatencyInQueueMillis, instead of asserting a secondary effect like the active thread count?

It's a good point. The only thing I wonder is, because the polling is destructive and represents activity since last time we polled, polling in a tight loop with assertBusy would probably yield zeros relatively quickly just because no tasks completed between this and the last poll. By polling on the thread pool, which is non destructive we're waiting for the tasks to finish and asserting that metric pipeline is working correctly when that occurs? if that makes sense? I just think due to the destructive nature of the metric reads, we'd risk a "false negative" by using assertBusy directly on the metrics.

I'd love it if the metric reads weren't destructive and there was no "observer effect" instead, but that doesn't seem high priority at the moment.

  1. I think the previous "fix" ([Test] Use generic for GlobalCheckpoingSyncAction #134180) is no longer necessary and can be reverted?

I'll take a look.

@ywangd
Copy link
Member

ywangd commented Oct 14, 2025

polling is destructive

We do peek from the head of the queue and this part should not be destructive, right? Unless the task is not even queued when the polling happens. But in that case, getActiveCount may also have a chance to return false positive?

Btw, if we keep checkpoint sync action running on the generic thread, we should be certain that all "write-related" tasks are submitted to the thread pool when the indexing requests return.

@nicktindall
Copy link
Contributor Author

nicktindall commented Oct 14, 2025

We do peek from the head of the queue and this part should not be destructive, right? Unless the task is not even queued when the polling happens.

If we assert busy onpeekMaxQueueLatencyInQueueMillis going to zero, it will finish as soon as the queue is empty I think? we want to block until all the jobs have completed and run afterExecute.

But in that case, getActiveCount may also have a chance to return false positive?

I think there's a chance this can still happen, hence the comment. But because there's > 1 thread in the pool, it would seem for getActiveCount to go to zero before the queue was drained, you'd need all the running tasks to finish at precisely the same moment? (I didn't look too deep on how this is tracked)

Btw, if we keep checkpoint sync action running on the generic thread, we should be certain that all "write-related" tasks are submitted to the thread pool when the indexing requests return.

I noticed a test failure fairly quickly when I reverted it. I think it's a necessary part of the fix. (Edit, confirmed, it breaks the test when it's in there)

@ywangd
Copy link
Member

ywangd commented Oct 14, 2025

Ah ok thanks for explaining. I think what you have here is good. 👍

@nicktindall nicktindall merged commit e0acfc3 into elastic:main Oct 14, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo failing

4 participants