Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461

nicktindall · 2025-10-13T05:44:01Z

It's not a beautiful fix, but it seems to work

nicktindall · 2025-10-13T06:02:42Z

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

            }
            Arrays.stream(threadsToJoin).forEach(thread -> assertFalse(thread.isAlive()));
+            // Wait for the write executor to go idle
+            assertBusy(() -> assertThat(trackingWriteExecutor.getActiveCount(), equalTo(0)));


I don't know if this could still have a very rare race condition where we get a zero inbetween the tasks being executed, but given that the write pool defaults to # of cores, it seems unlikely we'd go to 0 active, then to non-zero again as we drain the queue.

I'm not sure if there's a more satisfying approach to knowing when the work is all done.

Should we add this as a comment and then link the existing test failure in case the test fails again in X weeks for the reason mentioned above?

Done in 2ec6995

elasticsearchmachine · 2025-10-13T06:04:44Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

joshua-adams-1

LGTM - I'm a ++ for adding a comment to expedite future debugging, but that doesn't need a separate review

joshua-adams-1 · 2025-10-13T08:19:40Z

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

            }
            Arrays.stream(threadsToJoin).forEach(thread -> assertFalse(thread.isAlive()));
+            // Wait for the write executor to go idle
+            assertBusy(() -> assertThat(trackingWriteExecutor.getActiveCount(), equalTo(0)));


Should we add this as a comment and then link the existing test failure in case the test fails again in X weeks for the reason mentioned above?

…esInClusterInfo

ywangd

LGTM

Two minor comments for you to pick and choose

Since we now have to use assertBusy, maybe it is more straightforward to use that for things that we want to observe, e.g. MaxQueueLatencyInQueueMillis, instead of asserting a secondary effect like the active thread count?
I think the previous "fix" (#134180) is no longer necessary and can be reverted?

nicktindall · 2025-10-14T00:36:18Z

Since we now have to use assertBusy, maybe it is more straightforward to use that for things that we want to observe, e.g. MaxQueueLatencyInQueueMillis, instead of asserting a secondary effect like the active thread count?

It's a good point. The only thing I wonder is, because the polling is destructive and represents activity since last time we polled, polling in a tight loop with assertBusy would probably yield zeros relatively quickly just because no tasks completed between this and the last poll. By polling on the thread pool, which is non destructive we're waiting for the tasks to finish and asserting that metric pipeline is working correctly when that occurs? if that makes sense? I just think due to the destructive nature of the metric reads, we'd risk a "false negative" by using assertBusy directly on the metrics.

I'd love it if the metric reads weren't destructive and there was no "observer effect" instead, but that doesn't seem high priority at the moment.

I think the previous "fix" ([Test] Use generic for GlobalCheckpoingSyncAction #134180) is no longer necessary and can be reverted?

I'll take a look.

ywangd · 2025-10-14T01:16:11Z

polling is destructive

We do peek from the head of the queue and this part should not be destructive, right? Unless the task is not even queued when the polling happens. But in that case, getActiveCount may also have a chance to return false positive?

Btw, if we keep checkpoint sync action running on the generic thread, we should be certain that all "write-related" tasks are submitted to the thread pool when the indexing requests return.

nicktindall · 2025-10-14T01:23:15Z

We do peek from the head of the queue and this part should not be destructive, right? Unless the task is not even queued when the polling happens.

If we assert busy onpeekMaxQueueLatencyInQueueMillis going to zero, it will finish as soon as the queue is empty I think? we want to block until all the jobs have completed and run afterExecute.

But in that case, getActiveCount may also have a chance to return false positive?

I think there's a chance this can still happen, hence the comment. But because there's > 1 thread in the pool, it would seem for getActiveCount to go to zero before the queue was drained, you'd need all the running tasks to finish at precisely the same moment? (I didn't look too deep on how this is tracked)

Btw, if we keep checkpoint sync action running on the generic thread, we should be certain that all "write-related" tasks are submitted to the thread pool when the indexing requests return.

I noticed a test failure fairly quickly when I reverted it. I think it's a necessary part of the fix. (Edit, confirmed, it breaks the test when it's in there)

ywangd · 2025-10-14T01:37:52Z

Ah ok thanks for explaining. I think what you have here is good. 👍

Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo

b08ec09

Closes: elastic#134500

nicktindall added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 13, 2025

elasticsearchmachine added the v9.3.0 label Oct 13, 2025

nicktindall commented Oct 13, 2025

View reviewed changes

nicktindall requested a review from ywangd October 13, 2025 06:03

nicktindall marked this pull request as ready for review October 13, 2025 06:04

elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Oct 13, 2025

joshua-adams-1 approved these changes Oct 13, 2025

View reviewed changes

nicktindall added 3 commits October 14, 2025 09:39

Merge branch 'main' into fix_ClusterInfoServiceIT_testMaxQueueLatenci…

51cae70

…esInClusterInfo

Add comment explaining why we poll the thread pool active threads

2ec6995

Merge branch 'main' into fix_ClusterInfoServiceIT_testMaxQueueLatenci…

34ced01

…esInClusterInfo

ywangd approved these changes Oct 14, 2025

View reviewed changes

nicktindall merged commit e0acfc3 into elastic:main Oct 14, 2025
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461

Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461

nicktindall commented Oct 13, 2025

Uh oh!

nicktindall Oct 13, 2025 •

edited

Loading

Uh oh!

joshua-adams-1 Oct 13, 2025

Uh oh!

nicktindall Oct 13, 2025

Uh oh!

elasticsearchmachine commented Oct 13, 2025

Uh oh!

joshua-adams-1 left a comment

Uh oh!

joshua-adams-1 Oct 13, 2025

Uh oh!

ywangd left a comment

Uh oh!

nicktindall commented Oct 14, 2025 •

edited

Loading

Uh oh!

ywangd commented Oct 14, 2025

Uh oh!

nicktindall commented Oct 14, 2025 •

edited

Loading

Uh oh!

ywangd commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461

Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461

Conversation

nicktindall commented Oct 13, 2025

Uh oh!

nicktindall Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Oct 13, 2025

Uh oh!

joshua-adams-1 left a comment

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Oct 14, 2025

Uh oh!

nicktindall commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicktindall Oct 13, 2025 •

edited

Loading

nicktindall commented Oct 14, 2025 •

edited

Loading

nicktindall commented Oct 14, 2025 •

edited

Loading