Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Sep 4, 2025

Redirect the GlobalCheckpoingSyncAction to the generic threadpool so that we have precise control over the write threadpool for load and latency assertions.

Resolves: #134088

Redirect the GlobalCheckpoingSyncAction to the generic threadpool so
that we have precise control over the write threadpool for load and
latency assertions.

Resolves: elastic#134088
@ywangd ywangd requested a review from nicktindall September 4, 2025 23:44
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.2.0 labels Sep 4, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Comment on lines -499 to -519
// Wait for async post replication actions to complete
final var checkpointsSyncLatch = new CountDownLatch(numShards);
for (int i = 0; i < numShards; ++i) {
final var indexShard = indexService.getShard(i);
final long expectedGlobalCheckpoint = indexShard.seqNoStats().getGlobalCheckpoint();
logger.info("--> shard [{}] waiting for global checkpoint {}", i, expectedGlobalCheckpoint);
indexShard.addGlobalCheckpointListener(expectedGlobalCheckpoint, new GlobalCheckpointListeners.GlobalCheckpointListener() {
@Override
public Executor executor() {
return EsExecutors.DIRECT_EXECUTOR_SERVICE;
}

@Override
public void accept(long globalCheckpoint, Exception e) {
assertNull(e); // should have no error
logger.info("--> shard [{}] global checkpoint updated to {}", indexShard.shardId().id(), globalCheckpoint);
checkpointsSyncLatch.countDown();
}
}, TimeValue.THIRTY_SECONDS);
}
safeAwait(checkpointsSyncLatch);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the previous attempt for fixing the issue. Unfortunately it does not work because:

  1. The listener is called when the task is still running on the write threadpool. There is no guarantee on when the task completely finishes its lifecycle, i.e. going through the afterExecute phase.
  2. A earlier checkpoint update action can sometimes see the latest in-memory value and update to it. When this happens, the listener is called while later checkpoint update actions are still queued. The later actions will basically be noop. This is fine for checkpoint update. But it breaks our test assumption.

I tried different ways to determine when the sync actions are completely off the thread pool but didn't manage to find a solution. Therefore, I went with redirecting them to a different threadpool. Hence this PR.

true,
true
)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we're fixing the wrong problem here, i.e. if the test is sensitive to other things happening in the write pool, is it likely to flap any time someone does some new work on the write pool? Perhaps we could instead isolate the thing we're measuring, or change the assertion somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not very feasible to "isolate the thing we're measuring" at a single thread pool level. The alternative is assertBusy. But based on the original discussion, the preference is to avoid it and be explicit about the other activities. Therefore, I would consider it somewhat a "feature" if it fails in future for new "write" activities, as in we should be aware of exactly what's going on with the write thread pool since every action matters in production.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd ywangd added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 5, 2025
@ywangd
Copy link
Member Author

ywangd commented Sep 5, 2025

@elasticmachine update branch

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 5, 2025
@ywangd
Copy link
Member Author

ywangd commented Sep 5, 2025

@elasticmachine update branch

@ywangd
Copy link
Member Author

ywangd commented Sep 5, 2025

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit c05c61d into elastic:main Sep 5, 2025
33 checks passed
@ywangd ywangd deleted the es-134088-fix branch September 5, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo failing

4 participants