-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add FailedShardEntry info to shard-failed task source string #125520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @JeremyDahlgren, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading through this code and verifying the change I was using ClusterDisruptionIT.testSendingShardFailure() to see the task being submitted in ShardFailedTransportHandler.messageReceived() with the updated source string. I didn't see an easy way to modify the test to directly verify the change.
I ended up adding a unit test case to collect the submitted task and inspect the source string, 8cada48.
Refactored to use an integration test case, per Yang's review comment. Commit 0f7b047.
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Outdated
Show resolved
Hide resolved
Appends the FailedShardEntry request to the 'shard-failed' task source string in ShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in the ShardStartedTransportHandler. Closes elastic#102606.
8cada48 to
25d46b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have comment on how the changes should be tested.
| } | ||
| } | ||
|
|
||
| public void testShardFailedTransportHandlerSubmitTaskSourceStringIncludesRequestInfo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer an integration test for this change which will more directly get the information from MasterService instead of mocking the queue and bypassing MasterService. Technically, what an user sees is the source field of PendingClusterTask and that is what we want to fix. Currently it is indeed copied all the way from the source argument of submitTask method. But I'd rather to not make that implementation assumption in the test.
Concretely, I think we can add a test to ShardStateIT that does the following:
- Create an index and find its associated node and
IndicesServicesimilar to this. For simplicity, the index can have just 1 shard and no replica. - Create a blocking task queue on the masterService and submit a task to ensure it is blocked similar to this
- Fail the shard similar to this
- While the MasterService is blocked, assert that it receive a new pending task for shard failure and check its source string, e.g. something like
assertThat(clusterService.getMasterService().pendingTasks().stream().anyMatch(t -> t.getSource()...)wrapped in anassertBusy. - Unblock MasterService
- Wait for the index to recover and finish the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the test per our call earlier, commit c3698fa. I'll use a separate branch to investigate the possible race condition in the version that attempts to block and wait for the shard-started task.
be33147 to
0f7b047
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| safeAwait(barrier); | ||
| batchExecutionContext.taskContexts().forEach(c -> c.success(() -> {})); | ||
| return batchExecutionContext.initialState(); | ||
| }).submitTask("initial-block", ignored -> {}, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it's useful to assert that the onFailure of a task is never called, e.g.:
| }).submitTask("initial-block", ignored -> {}, null); | |
| }).submitTask("initial-block", e -> fail(e, "unexpected"), null); |
| safeAwait(barrier); | ||
|
|
||
| // Obtain a reference to the IndexShard for shard 0. | ||
| final var state = clusterAdmin().prepareState(TEST_REQUEST_TIMEOUT).get().getState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you can get the state with masterNodeClusterService.state().
|
|
||
| // Unblock the master service from the blocked executor and allow the failed shard task to get processed. | ||
| safeAwait(barrier); | ||
| assertBusy(() -> assertTrue(masterService.pendingTasks().isEmpty())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to drop this assertion since it is not really relevant. It may also add up to the total test time unnecessarily and has a potential (though tiny) to fail if the CI machine is really slow or the cluster does something rare.
…#125520) Appends the FailedShardEntry request to the 'shard-failed' task source string in ShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in the ShardStartedTransportHandler. Closes elastic#102606.
Appends the
FailedShardEntryrequest to the 'shard-failed' task source string inShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in theShardStartedTransportHandler.Closes #102606.