-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Test] Use generic for GlobalCheckpoingSyncAction #134180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Redirect the GlobalCheckpoingSyncAction to the generic threadpool so that we have precise control over the write threadpool for load and latency assertions. Resolves: elastic#134088
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| // Wait for async post replication actions to complete | ||
| final var checkpointsSyncLatch = new CountDownLatch(numShards); | ||
| for (int i = 0; i < numShards; ++i) { | ||
| final var indexShard = indexService.getShard(i); | ||
| final long expectedGlobalCheckpoint = indexShard.seqNoStats().getGlobalCheckpoint(); | ||
| logger.info("--> shard [{}] waiting for global checkpoint {}", i, expectedGlobalCheckpoint); | ||
| indexShard.addGlobalCheckpointListener(expectedGlobalCheckpoint, new GlobalCheckpointListeners.GlobalCheckpointListener() { | ||
| @Override | ||
| public Executor executor() { | ||
| return EsExecutors.DIRECT_EXECUTOR_SERVICE; | ||
| } | ||
|
|
||
| @Override | ||
| public void accept(long globalCheckpoint, Exception e) { | ||
| assertNull(e); // should have no error | ||
| logger.info("--> shard [{}] global checkpoint updated to {}", indexShard.shardId().id(), globalCheckpoint); | ||
| checkpointsSyncLatch.countDown(); | ||
| } | ||
| }, TimeValue.THIRTY_SECONDS); | ||
| } | ||
| safeAwait(checkpointsSyncLatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the previous attempt for fixing the issue. Unfortunately it does not work because:
- The listener is called when the task is still running on the
writethreadpool. There is no guarantee on when the task completely finishes its lifecycle, i.e. going through theafterExecutephase. - A earlier checkpoint update action can sometimes see the latest in-memory value and update to it. When this happens, the listener is called while later checkpoint update actions are still queued. The later actions will basically be noop. This is fine for checkpoint update. But it breaks our test assumption.
I tried different ways to determine when the sync actions are completely off the thread pool but didn't manage to find a solution. Therefore, I went with redirecting them to a different threadpool. Hence this PR.
| true, | ||
| true | ||
| ) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible we're fixing the wrong problem here, i.e. if the test is sensitive to other things happening in the write pool, is it likely to flap any time someone does some new work on the write pool? Perhaps we could instead isolate the thing we're measuring, or change the assertion somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not very feasible to "isolate the thing we're measuring" at a single thread pool level. The alternative is assertBusy. But based on the original discussion, the preference is to avoid it and be explicit about the other activities. Therefore, I would consider it somewhat a "feature" if it fails in future for new "write" activities, as in we should be aware of exactly what's going on with the write thread pool since every action matters in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@elasticmachine update branch |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
Redirect the GlobalCheckpoingSyncAction to the generic threadpool so that we have precise control over the write threadpool for load and latency assertions.
Resolves: #134088