Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

Appends the FailedShardEntry request to the 'shard-failed' task source string in ShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in the ShardStartedTransportHandler.

Closes #102606.

@JeremyDahlgren JeremyDahlgren added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Coordination Meta label for Distributed Coordination team labels Mar 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @JeremyDahlgren, I've created a changelog YAML for you.

Copy link
Contributor Author

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading through this code and verifying the change I was using ClusterDisruptionIT.testSendingShardFailure() to see the task being submitted in ShardFailedTransportHandler.messageReceived() with the updated source string. I didn't see an easy way to modify the test to directly verify the change.

I ended up adding a unit test case to collect the submitted task and inspect the source string, 8cada48.

Refactored to use an integration test case, per Yang's review comment. Commit 0f7b047.

Appends the FailedShardEntry request to the 'shard-failed'
task source string in ShardFailedTransportHandler.messageReceived().
This information will now be available in the 'source' string for
shard failed task entries in the Cluster Pending Tasks API response.
This source string change matches what is done in the
ShardStartedTransportHandler.

Closes elastic#102606.
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have comment on how the changes should be tested.

}
}

public void testShardFailedTransportHandlerSubmitTaskSourceStringIncludesRequestInfo() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer an integration test for this change which will more directly get the information from MasterService instead of mocking the queue and bypassing MasterService. Technically, what an user sees is the source field of PendingClusterTask and that is what we want to fix. Currently it is indeed copied all the way from the source argument of submitTask method. But I'd rather to not make that implementation assumption in the test.

Concretely, I think we can add a test to ShardStateIT that does the following:

  1. Create an index and find its associated node and IndicesService similar to this. For simplicity, the index can have just 1 shard and no replica.
  2. Create a blocking task queue on the masterService and submit a task to ensure it is blocked similar to this
  3. Fail the shard similar to this
  4. While the MasterService is blocked, assert that it receive a new pending task for shard failure and check its source string, e.g. something like assertThat(clusterService.getMasterService().pendingTasks().stream().anyMatch(t -> t.getSource()...) wrapped in an assertBusy.
  5. Unblock MasterService
  6. Wait for the index to recover and finish the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ywangd, I've switched to an integration test in 0f7b047 per your outline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the test per our call earlier, commit c3698fa. I'll use a separate branch to investigate the possible race condition in the version that attempts to block and wait for the shard-started task.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

safeAwait(barrier);
batchExecutionContext.taskContexts().forEach(c -> c.success(() -> {}));
return batchExecutionContext.initialState();
}).submitTask("initial-block", ignored -> {}, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's useful to assert that the onFailure of a task is never called, e.g.:

Suggested change
}).submitTask("initial-block", ignored -> {}, null);
}).submitTask("initial-block", e -> fail(e, "unexpected"), null);

safeAwait(barrier);

// Obtain a reference to the IndexShard for shard 0.
final var state = clusterAdmin().prepareState(TEST_REQUEST_TIMEOUT).get().getState();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can get the state with masterNodeClusterService.state().


// Unblock the master service from the blocked executor and allow the failed shard task to get processed.
safeAwait(barrier);
assertBusy(() -> assertTrue(masterService.pendingTasks().isEmpty()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to drop this assertion since it is not really relevant. It may also add up to the total test time unnecessarily and has a potential (though tiny) to fail if the CI machine is really slow or the cluster does something rare.

@JeremyDahlgren JeremyDahlgren merged commit 89467b8 into elastic:main Mar 28, 2025
17 checks passed
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
…#125520)

Appends the FailedShardEntry request to the 'shard-failed'
task source string in ShardFailedTransportHandler.messageReceived().
This information will now be available in the 'source' string for
shard failed task entries in the Cluster Pending Tasks API response.
This source string change matches what is done in the
ShardStartedTransportHandler.

Closes elastic#102606.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Pending Cluster Tasks][Allocation] Add context on 'shard-failed'

3 participants