[Test] Use generic for GlobalCheckpoingSyncAction #134180

ywangd · 2025-09-04T23:44:37Z

Redirect the GlobalCheckpoingSyncAction to the generic threadpool so that we have precise control over the write threadpool for load and latency assertions.

Resolves: #134088

Redirect the GlobalCheckpoingSyncAction to the generic threadpool so that we have precise control over the write threadpool for load and latency assertions. Resolves: elastic#134088

elasticsearchmachine · 2025-09-04T23:45:01Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-09-04T23:50:46Z

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

-            // Wait for async post replication actions to complete
-            final var checkpointsSyncLatch = new CountDownLatch(numShards);
-            for (int i = 0; i < numShards; ++i) {
-                final var indexShard = indexService.getShard(i);
-                final long expectedGlobalCheckpoint = indexShard.seqNoStats().getGlobalCheckpoint();
-                logger.info("--> shard [{}] waiting for global checkpoint {}", i, expectedGlobalCheckpoint);
-                indexShard.addGlobalCheckpointListener(expectedGlobalCheckpoint, new GlobalCheckpointListeners.GlobalCheckpointListener() {
-                    @Override
-                    public Executor executor() {
-                        return EsExecutors.DIRECT_EXECUTOR_SERVICE;
-                    }
-
-                    @Override
-                    public void accept(long globalCheckpoint, Exception e) {
-                        assertNull(e); // should have no error
-                        logger.info("--> shard [{}] global checkpoint updated to {}", indexShard.shardId().id(), globalCheckpoint);
-                        checkpointsSyncLatch.countDown();
-                    }
-                }, TimeValue.THIRTY_SECONDS);
-            }
-            safeAwait(checkpointsSyncLatch);


This is the previous attempt for fixing the issue. Unfortunately it does not work because:

The listener is called when the task is still running on the write threadpool. There is no guarantee on when the task completely finishes its lifecycle, i.e. going through the afterExecute phase.

A earlier checkpoint update action can sometimes see the latest in-memory value and update to it. When this happens, the listener is called while later checkpoint update actions are still queued. The later actions will basically be noop. This is fine for checkpoint update. But it breaks our test assumption.

I tried different ways to determine when the sync actions are completely off the thread pool but didn't manage to find a solution. Therefore, I went with redirecting them to a different threadpool. Hence this PR.

nicktindall · 2025-09-05T00:05:51Z

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

+                    true,
+                    true
+                )
+            );


Is it possible we're fixing the wrong problem here, i.e. if the test is sensitive to other things happening in the write pool, is it likely to flap any time someone does some new work on the write pool? Perhaps we could instead isolate the thing we're measuring, or change the assertion somehow?

I think it is not very feasible to "isolate the thing we're measuring" at a single thread pool level. The alternative is assertBusy. But based on the original discussion, the preference is to avoid it and be explicit about the other activities. Therefore, I would consider it somewhat a "feature" if it fails in future for new "write" activities, as in we should be aware of exactly what's going on with the write thread pool since every action matters in production.

nicktindall

LGTM

ywangd · 2025-09-05T02:22:50Z

@elasticmachine update branch

ywangd · 2025-09-05T07:00:10Z

@elasticmachine update branch

ywangd · 2025-09-05T09:12:41Z

@elasticmachine update branch

[Test] Use generic for GlobalCheckpoingSyncAction

1a46028

Redirect the GlobalCheckpoingSyncAction to the generic threadpool so that we have precise control over the write threadpool for load and latency assertions. Resolves: elastic#134088

ywangd requested a review from nicktindall September 4, 2025 23:44

ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.2.0 labels Sep 4, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 4, 2025

ywangd commented Sep 4, 2025

View reviewed changes

nicktindall reviewed Sep 5, 2025

View reviewed changes

nicktindall approved these changes Sep 5, 2025

View reviewed changes

ywangd added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 5, 2025

Merge branch 'main' into es-134088-fix

963dc60

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 5, 2025

Merge branch 'main' into es-134088-fix

921d7dc

Merge branch 'main' into es-134088-fix

fc74a99

elasticsearchmachine merged commit c05c61d into elastic:main Sep 5, 2025
33 checks passed

ywangd deleted the es-134088-fix branch September 5, 2025 10:22

ywangd mentioned this pull request Oct 14, 2025

Fix ClusterInfoServiceIT#testMaxQueueLatenciesInClusterInfo #136461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Test] Use generic for GlobalCheckpoingSyncAction #134180

[Test] Use generic for GlobalCheckpoingSyncAction #134180

Uh oh!

ywangd commented Sep 4, 2025

Uh oh!

elasticsearchmachine commented Sep 4, 2025

Uh oh!

ywangd Sep 4, 2025

Uh oh!

nicktindall Sep 5, 2025

Uh oh!

ywangd Sep 5, 2025

Uh oh!

nicktindall left a comment

Uh oh!

ywangd commented Sep 5, 2025

Uh oh!

ywangd commented Sep 5, 2025

Uh oh!

ywangd commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Test] Use generic for GlobalCheckpoingSyncAction #134180

[Test] Use generic for GlobalCheckpoingSyncAction #134180

Uh oh!

Conversation

ywangd commented Sep 4, 2025

Uh oh!

elasticsearchmachine commented Sep 4, 2025

Uh oh!

ywangd Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Sep 5, 2025

Uh oh!

ywangd commented Sep 5, 2025

Uh oh!

ywangd commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants