Add FailedShardEntry info to shard-failed task source string #125520

JeremyDahlgren · 2025-03-24T17:37:55Z

Appends the FailedShardEntry request to the 'shard-failed' task source string in ShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in the ShardStartedTransportHandler.

Closes #102606.

elasticsearchmachine · 2025-03-24T17:38:19Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-03-24T17:38:41Z

Hi @JeremyDahlgren, I've created a changelog YAML for you.

JeremyDahlgren

While reading through this code and verifying the change I was using ClusterDisruptionIT.testSendingShardFailure() to see the task being submitted in ShardFailedTransportHandler.messageReceived() with the updated source string. I didn't see an easy way to modify the test to directly verify the change.

~~I ended up adding a unit test case to collect the submitted task and inspect the source string, 8cada48.~~

Refactored to use an integration test case, per Yang's review comment. Commit 0f7b047.

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

Appends the FailedShardEntry request to the 'shard-failed' task source string in ShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in the ShardStartedTransportHandler. Closes elastic#102606.

ywangd

I have comment on how the changes should be tested.

ywangd · 2025-03-27T04:15:56Z

server/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java

        }
    }

+    public void testShardFailedTransportHandlerSubmitTaskSourceStringIncludesRequestInfo() {


I'd prefer an integration test for this change which will more directly get the information from MasterService instead of mocking the queue and bypassing MasterService. Technically, what an user sees is the source field of PendingClusterTask and that is what we want to fix. Currently it is indeed copied all the way from the source argument of submitTask method. But I'd rather to not make that implementation assumption in the test.

Concretely, I think we can add a test to ShardStateIT that does the following:

Create an index and find its associated node and IndicesService similar to this. For simplicity, the index can have just 1 shard and no replica.

Create a blocking task queue on the masterService and submit a task to ensure it is blocked similar to this

Fail the shard similar to this

While the MasterService is blocked, assert that it receive a new pending task for shard failure and check its source string, e.g. something like assertThat(clusterService.getMasterService().pendingTasks().stream().anyMatch(t -> t.getSource()...) wrapped in an assertBusy.

Unblock MasterService

Wait for the index to recover and finish the test

Thanks @ywangd, I've switched to an integration test in 0f7b047 per your outline.

I simplified the test per our call earlier, commit c3698fa. I'll use a separate branch to investigate the possible race condition in the version that attempts to block and wait for the shard-started task.

ywangd

LGTM

ywangd · 2025-03-27T23:49:26Z

.../src/internalClusterTest/java/org/elasticsearch/cluster/routing/allocation/ShardStateIT.java

+            safeAwait(barrier);
+            batchExecutionContext.taskContexts().forEach(c -> c.success(() -> {}));
+            return batchExecutionContext.initialState();
+        }).submitTask("initial-block", ignored -> {}, null);


Nit: I think it's useful to assert that the onFailure of a task is never called, e.g.:

Suggested change

}).submitTask("initial-block", ignored -> {}, null);

}).submitTask("initial-block", e -> fail(e, "unexpected"), null);

ywangd · 2025-03-27T23:50:22Z

.../src/internalClusterTest/java/org/elasticsearch/cluster/routing/allocation/ShardStateIT.java

+        safeAwait(barrier);
+
+        // Obtain a reference to the IndexShard for shard 0.
+        final var state = clusterAdmin().prepareState(TEST_REQUEST_TIMEOUT).get().getState();


Alternatively, you can get the state with masterNodeClusterService.state().

ywangd · 2025-03-27T23:55:17Z

.../src/internalClusterTest/java/org/elasticsearch/cluster/routing/allocation/ShardStateIT.java

+
+        // Unblock the master service from the blocked executor and allow the failed shard task to get processed.
+        safeAwait(barrier);
+        assertBusy(() -> assertTrue(masterService.pendingTasks().isEmpty()));


I'd prefer to drop this assertion since it is not really relevant. It may also add up to the total test time unnecessarily and has a potential (though tiny) to fail if the CI machine is really slow or the cluster does something rare.

…#125520) Appends the FailedShardEntry request to the 'shard-failed' task source string in ShardFailedTransportHandler.messageReceived(). This information will now be available in the 'source' string for shard failed task entries in the Cluster Pending Tasks API response. This source string change matches what is done in the ShardStartedTransportHandler. Closes elastic#102606.

elasticsearchmachine added the v9.1.0 label Mar 24, 2025

JeremyDahlgren commented Mar 24, 2025

View reviewed changes

ywangd reviewed Mar 25, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java Outdated Show resolved Hide resolved

JeremyDahlgren added 3 commits March 25, 2025 13:22

Update docs/changelog/125520.yaml

064b2fe

Don't include stack trace in source string, add unit test case

25d46b5

JeremyDahlgren force-pushed the fix/102606 branch from 8cada48 to 25d46b5 Compare March 25, 2025 17:22

JeremyDahlgren self-assigned this Mar 26, 2025

JeremyDahlgren requested a review from ywangd March 26, 2025 22:56

ywangd reviewed Mar 27, 2025

View reviewed changes

Switch to adding a new test case in ShardStateIT

0f7b047

JeremyDahlgren force-pushed the fix/102606 branch from be33147 to 0f7b047 Compare March 27, 2025 20:57

JeremyDahlgren added 2 commits March 27, 2025 17:04

Merge branch 'main' into fix/102606

0f5fb12

Simplify integration test case

c3698fa

ywangd approved these changes Mar 28, 2025

View reviewed changes

JeremyDahlgren added 2 commits March 28, 2025 08:46

Address code review comments

f3b0720

Merge branch 'main' into fix/102606

6a41914

JeremyDahlgren merged commit 89467b8 into elastic:main Mar 28, 2025
17 checks passed

	}).submitTask("initial-block", ignored -> {}, null);
	}).submitTask("initial-block", e -> fail(e, "unexpected"), null);

Add FailedShardEntry info to shard-failed task source string #125520

Add FailedShardEntry info to shard-failed task source string #125520

Uh oh!

Conversation

JeremyDahlgren commented Mar 24, 2025

Uh oh!

elasticsearchmachine commented Mar 24, 2025

Uh oh!

elasticsearchmachine commented Mar 24, 2025

Uh oh!

JeremyDahlgren left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JeremyDahlgren left a comment •

edited

Loading