allocation: clarify RestoreInProgressAllocationDecider failure message #132307

schase-es · 2025-08-01T04:47:03Z

The RestoreInProgressAllocationDecider can issue a grave message about shard restoration failure, when declining a shard allocation. Sometimes, this is because of restoration failure, but sometimes it is because another decider has declined the allocation. This change adds a check of UnassignedInfo to make this message appropriate.

Fixes ES-11809
Fixes #100233

The RestoreInProgressAllocationDecider can issue a grave message about shard restoration failure, when declining a shard allocation. Sometimes, this is because of restoration failure, but sometimes it is because another decider has declined the allocation. This change adds a check of UnassignedInfo to make this message appropriate. Fixes ES-11809

elasticsearchmachine · 2025-08-01T04:47:27Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

schase-es · 2025-08-01T04:47:32Z

I'm not sure about the first test case... it issues a message saying that the other allocation deciders did it. But there are no other allocation deciders...

mhl-b · 2025-08-06T02:13:51Z

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java

+            return allocation.decision(
+                Decision.NO,
+                NAME,
+                "shard was prevented from being allocated on all nodes because of other allocation deciders"
+            );


I dont think decider should assume anything about other deciders, by design. Each decider speaks for itself. If you assume that there is at least one that tells NO, then return YES here.

Also it's a bit hard to read all the prerequisites for the last "else" statement. As I understand it might be restoreInProgress with state.completed==true and/or failed allocations == 0. Why it should say NO to this?

A comment line would help.

I agree, I'm not 100% clear on why we should be returning NO if the specific thing we're responsible for is not in play here.

Am I right in understanding that the shard might have exhausted its retries because either

There was some failure to restore the shard (e.g. file corruption)

The shard could not be assigned anywhere (i.e. this case)

And we want to distinguish the former from the latter? i.e. the RestoreInProgressAllocationDecider is right to say no, because we have stopped attempting to restore the shard, but there were no failed allocations so we shouldn't indicate that there will be failures in the logs.

I think there's a lot of assumed knowledge about the restore-in-progress lifecycle here and it would benefit from some comments or potentially restructuring.

Thanks for these comments -- they are helpful in clarifying what the issues are.

This feature is rather small, but because it manages to touch snapshot restores, cluster state, allocation, and index state changes it is a lot of reading to understand how everything connects together.

I agree that the decider should say yes when there is not a failure. The issue of speaking for other allocation deciders came from two places. First, I was trying to preserve the decision of the allocation decider; the first patch changed the message but not the decision. Second, I was trying to interpret David's comments without understanding the lifecycle of a restoration or that much about deciders.

Here is what I've worked out about the state changes.

ShardRouting and RestoreInProgress change during snapshot recovery with four events:

startRestore: RestoreService.startRestore starts a recovery (API-driven).

allocation: BalancedShardAllocator.allocateUnassigned assigns a shard (routingNodes.initializeShard).

failure: AllocationService.applyFailedShards fails a shard.

restoreComplete: AllocationService.applyStartedShards completes a shard recovery.

After startRestore, the ShardRoutingState is UNASSIGNED and the RestoreInProgress state is INIT.
After allocation, the ShardRoutingState is INITIALIZING and the RestoreInProgress state is STARTED.
After failure, the ShardRoutingState is UNASSIGNED and the RestoreInProgress state is FAILURE. The failedAllocations count on UnassignedInfo is incremented to at least one, and a message and exception are attached.
After restoreComplete, the ShardRoutingState is STARTED and the RestoreInProgress state is SUCCESS.

The decider is supposed to refuse to allocate a shard anywhere once shapshot recovery fails.

in startRestore and allocation, the decision is YES: the shardRestoreStatus is set, and it's not completed (not SUCCESS or FAILURE).

in failure, UnassignedInfo's failedAllocation sets a NO.

in restoreComplete (ShardRouting.moveToStarted), this is screened out in the first test because the recovery source is null.

I'm unclear on when this last YES will be reached.

One detail I noticed is that in the failure transition, AllocationService.applyFailedShards does not update the cluster state and RestoreInProgress passed through allocation until after it runs AllocationService.reroute. If the ShardRoutingState is newly UNASSIGNED, and the RestoreInProgress is stale as STARTED (transitioning from allocation), then it will continue to be assigned for an allocation round.

Ah I missed this conversation. Here's a test that shows the problem:

diff --git a/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java b/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java index 953cddba0ab7..3741562deb40 100644 --- a/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java +++ b/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java @@ -12,6 +12,8 @@ package org.elasticsearch.snapshots; import org.apache.logging.log4j.Level; import org.elasticsearch.action.ActionFuture; import org.elasticsearch.action.ActionRequestBuilder; +import org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainRequest; +import org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction; import org.elasticsearch.action.admin.cluster.snapshots.restore.RestoreSnapshotResponse; import org.elasticsearch.action.admin.indices.settings.get.GetSettingsResponse; import org.elasticsearch.action.admin.indices.template.get.GetIndexTemplatesResponse; @@ -20,6 +22,8 @@ import org.elasticsearch.client.internal.Client; import org.elasticsearch.cluster.block.ClusterBlocks; import org.elasticsearch.cluster.metadata.IndexMetadata; import org.elasticsearch.cluster.metadata.MappingMetadata; +import org.elasticsearch.cluster.routing.allocation.decider.Decision; +import org.elasticsearch.common.Strings; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.common.unit.ByteSizeUnit; import org.elasticsearch.core.TimeValue; @@ -41,6 +45,7 @@ import java.util.Collections; import java.util.List; import java.util.Locale; import java.util.Map; +import java.util.Set; import java.util.concurrent.TimeUnit; import java.util.stream.Collectors; import java.util.stream.IntStream; @@ -1025,4 +1030,65 @@ public class RestoreSnapshotIT extends AbstractSnapshotIntegTestCase { mockLog.assertAllExpectationsMatched(); } } + + public void testExplainUnassigableDuringRestore() { + final String repoName = "repo-" + randomIdentifier(); + createRepository(repoName, FsRepository.TYPE); + final String indexName = "index-" + randomIdentifier(); + createIndexWithContent(indexName); + final String snapshotName = "snapshot-" + randomIdentifier(); + createSnapshot(repoName, snapshotName, List.of(indexName)); + assertAcked(indicesAdmin().prepareDelete(indexName)); + + final RestoreSnapshotResponse restoreSnapshotResponse = clusterAdmin().prepareRestoreSnapshot( + TEST_REQUEST_TIMEOUT, + repoName, + snapshotName + ) + .setIndices(indexName) + .setRestoreGlobalState(false) + .setWaitForCompletion(true) + .setIndexSettings( + Settings.builder().put(IndexMetadata.INDEX_ROUTING_REQUIRE_GROUP_PREFIX + "._name", "not-a-node-" + randomIdentifier()) + ) + .get(); + + logger.info("--> restoreSnapshotResponse: {}", Strings.toString(restoreSnapshotResponse, true, true)); + assertThat(restoreSnapshotResponse.getRestoreInfo().failedShards(), greaterThan(0)); + + final var clusterExplainResponse1 = client().execute( + TransportClusterAllocationExplainAction.TYPE, + new ClusterAllocationExplainRequest(TEST_REQUEST_TIMEOUT).setIndex(indexName).setShard(0).setPrimary(true) + ).actionGet(); + + logger.info("--> clusterExplainResponse1: {}", Strings.toString(clusterExplainResponse1, true, true)); + for (var nodeDecision : clusterExplainResponse1.getExplanation() + .getShardAllocationDecision() + .getAllocateDecision() + .getNodeDecisions()) { + assertEquals( + Set.of("restore_in_progress", "filter"), + nodeDecision.getCanAllocateDecision().getDecisions().stream().map(Decision::label).collect(Collectors.toSet()) + ); + } + + updateIndexSettings(Settings.builder().putNull(IndexMetadata.INDEX_ROUTING_REQUIRE_GROUP_PREFIX + "._name"), indexName); + + final var clusterExplainResponse2 = client().execute( + TransportClusterAllocationExplainAction.TYPE, + new ClusterAllocationExplainRequest(TEST_REQUEST_TIMEOUT).setIndex(indexName).setShard(0).setPrimary(true) + ).actionGet(); + + logger.info("--> clusterExplainResponse2: {}", Strings.toString(clusterExplainResponse2, true, true)); + for (var nodeDecision : clusterExplainResponse2.getExplanation() + .getShardAllocationDecision() + .getAllocateDecision() + .getNodeDecisions()) { + assertEquals( + Set.of("restore_in_progress"), + nodeDecision.getCanAllocateDecision().getDecisions().stream().map(Decision::label).collect(Collectors.toSet()) + ); + } + + } }

Note that the shard couldn't be allocated because of the filter, but then when the filter is removed we still mustn't allocate the shard because the restore is no longer in progress.

The message in this latter case is:

shard has failed to be restored from the snapshot [default:repo-sdvofzvibks:snapshot-iqqndiwoes/HWiV1JnTSRKor0KieINqWA] - manually close or delete the index [index-djtuceqgi] in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard. Details: [restore_source[repo-sdvofzvibks/snapshot-iqqndiwoes]]

The issue is that this message sends folks off trying to look for problems with the snapshot repository when in fact we never even tried to contact the snapshot repository, the obstacle was elsewhere, and may no longer exist. Moreover the Details: bit at the end is pointless, it's actually less detailed than the snapshot ID default:repo-sdvofzvibks:snapshot-iqqndiwoes/HWiV1JnTSRKor0KieINqWA logged earlier.

I think I understand now -- this is more of an issue with how a cluster state change that voids the context of the restore (the index underneath it changes) cascades through the allocator.

I'm seeing where this happens in RestoreService.updateRestoreStateWithDeletedIndices. And where failed/completed restores are cleaned out in RestoreService.removeCompletedRestoresFromClusterState.

It seems like it might be more accurate in the allocation failure message to state that some aspect of the cluster state context has changed.

- decider says yes for irrelevant cases - tests are updated - comment explains the state being checked at the end

…ure-message

mhl-b

LGTM, thanks for thorough explanation, it helps. I would change last YES message. Don't need second review from me.

mhl-b · 2025-09-13T03:01:25Z

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java

-            source.snapshot(),
-            shardRouting.getIndexName(),
-            shardRouting.unassignedInfo().details()
+            "shard was prevented from being allocated on all nodes because of other allocation deciders"


Thats a strange way to say YES, and not always true. Looking at the code above it should be "YES/shard successfully restored".

mhl-b · 2025-09-13T03:07:56Z

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java

+        UnassignedInfo unassignedInfo = shardRouting.unassignedInfo();
+        if (unassignedInfo.failedAllocations() > 0) {


looking good, matches with what David said in the #100233

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java

nicktindall · 2025-09-15T00:39:01Z

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java

+
+        /**
+         * POST: the RestoreInProgress.ShardRestoreStatus is either failed or succeeded. This section turns a
+         * turn a shard failure into a NO decision to allocate. See {@link AllocationService.applyFailedShards}


"turns a turn a shard..." typo in javadoc?

…d restore faults.

nicktindall

This LGTM, pending clarification on the message

nicktindall · 2025-09-22T23:31:15Z

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java

+            return allocation.decision(
+                Decision.NO,
+                NAME,
+                "shard was prevented from being allocated on all nodes because of other allocation deciders in previous rounds"


Could this be something like
"Restore from snapshot failed because the configured constraints prevented allocation on any of the available nodes. Please check constraints applied in index and cluster settings, then retry the restore"

Just to avoid the talk about other allocation deciders (that seems like an abstract concept, but I might be misunderstanding who the audience is here) and also the tip to check the constraints and retry might be helpful?

Happy to go with consensus here

That sounds like a better message.

I'm also wondering if there's something like "go look for this earlier in the logs?"

Perhaps we could instruct them to use the allocation explain (even though the reason for the allocation being prevented might have since disappeared)

I can reference the allocation explain API... the reason disappearing seems like a lousy way to respond

Yes it does, maybe let's not be explicit about that. The allocation explain API may shed some light in many cases though I think?

I'm sure -- maybe we can create a separate task for later to pend the reasoning in RestoreInProgress to somewhere in UnassignedInfo/ClusterState

server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java

…ure-message

Comment on decider has been resolved -- unsure how to remove this otherwise.

* upstream/main: (22 commits) Fix InternalCategorizationAggregationTests.testReduceRandom (elastic#135533) [DOCS] GeoIP processor: add clarification about using a reverse proxy endpoint (elastic#135534) Move `ProjectRoutingInfo` and related classes (elastic#135586) Refactor IndexAbstractionResolver (elastic#135587) Simplify returnLocalAll handling in ES|QL (elastic#135353) Reapply "Add an option to return early from an allocate call" (elastic#135589) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#134407 Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT testAggTooManyMvLongs elastic#135585 Mute org.elasticsearch.multiproject.test.XpackWithMultipleProjectsClientYamlTestSuiteIT test {yaml=esql/60_usage/Basic ESQL usage output (telemetry) snapshot version} elastic#135579 Mute org.elasticsearch.search.ccs.KnnVectorQueryBuilderCrossClusterSearchIT testKnnQueryWithCcsMinimizeRoundTripsFalse elastic#135573 Mute org.elasticsearch.xpack.esql.inference.textembedding.TextEmbeddingOperatorTests testSimpleCircuitBreaking elastic#135569 Add telemetry for `TS` command (elastic#135471) Mute org.elasticsearch.cluster.routing.allocation.decider.RestoreInProgressAllocationDeciderTests testCanAllocatePrimaryExistingInRestoreInProgress elastic#135566 allocation: clarify RestoreInProgressAllocationDecider failure message (elastic#132307) [ES|QL] Register AggregateMetricDoubleLiteral (elastic#135054) Validate Logstash pipeline ID when creating. (elastic#135378) Migrate transport versions 8841050 through 8841041 (elastic#135555) Mute org.elasticsearch.search.ccs.SparseVectorQueryBuilderCrossClusterSearchIT testSparseVectorQueryWithCcsMinimizeRoundTripsFalse elastic#135559 Mute org.elasticsearch.action.admin.cluster.stats.SearchUsageStatsTests testToXContent elastic#135558 Testing indices query cache memory stats (elastic#135298) ...

schase-es requested a review from nicktindall August 1, 2025 04:47

schase-es added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Aug 1, 2025

elasticsearchmachine added v9.2.0 Team:Distributed Coordination Meta label for Distributed Coordination team labels Aug 1, 2025

mhl-b reviewed Aug 6, 2025

View reviewed changes

schase-es added 2 commits September 11, 2025 23:25

Addressed review feedback:

2936779

- decider says yes for irrelevant cases - tests are updated - comment explains the state being checked at the end

Merge branch 'main' into ES-11809_restore-in-progress-allocation-fail…

a60a2a2

…ure-message

mhl-b approved these changes Sep 13, 2025

View reviewed changes

DaveCTurner previously requested changes Sep 13, 2025

View reviewed changes

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java Outdated Show resolved Hide resolved

nicktindall reviewed Sep 15, 2025

View reviewed changes

schase-es added 3 commits September 22, 2025 10:31

Reverting commit re: David's feedback

9920308

Adding David's test for differentiating between allocation failure an…

0a18d5a

…d restore faults.

Minor detail: fixing up a test error message

9ec98a1

nicktindall approved these changes Sep 22, 2025

View reviewed changes

nicktindall reviewed Sep 23, 2025

View reviewed changes

server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java Show resolved Hide resolved

schase-es and others added 7 commits September 23, 2025 13:03

Merge branch 'main' into ES-11809_restore-in-progress-allocation-fail…

b2cf46c

…ure-message

Fixing up error message, and testing for it in RestoreSnapshotIT

b099f5d

[CI] Auto commit changes from spotless

0bc74d5

Merge branch 'main' into ES-11809_restore-in-progress-allocation-fail…

f277ecb

…ure-message

Merge branch 'main' into ES-11809_restore-in-progress-allocation-fail…

829304a

…ure-message

Style fixes

168d94b

Merge branch 'main' into ES-11809_restore-in-progress-allocation-fail…

be39016

…ure-message

schase-es requested a review from DaveCTurner September 27, 2025 00:33

schase-es merged commit 2787546 into elastic:main Sep 27, 2025
34 checks passed

		UnassignedInfo unassignedInfo = shardRouting.unassignedInfo();
		if (unassignedInfo.failedAllocations() > 0) {

allocation: clarify RestoreInProgressAllocationDecider failure message #132307

allocation: clarify RestoreInProgressAllocationDecider failure message #132307

Conversation

schase-es commented Aug 1, 2025 • edited by nicktindall Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 1, 2025

Uh oh!

schase-es commented Aug 1, 2025

Uh oh!

mhl-b Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

schase-es commented Aug 1, 2025 •

edited by nicktindall

Loading

mhl-b Aug 6, 2025 •

edited

Loading