Skip to content

Conversation

@schase-es
Copy link
Contributor

@schase-es schase-es commented Aug 1, 2025

The RestoreInProgressAllocationDecider can issue a grave message about shard restoration failure, when declining a shard allocation. Sometimes, this is because of restoration failure, but sometimes it is because another decider has declined the allocation. This change adds a check of UnassignedInfo to make this message appropriate.

Fixes ES-11809
Fixes #100233

The RestoreInProgressAllocationDecider can issue a grave message about shard
restoration failure, when declining a shard allocation. Sometimes, this is
because of restoration failure, but sometimes it is because another decider has
declined the allocation. This change adds a check of UnassignedInfo to make this
message appropriate.

Fixes ES-11809
@schase-es schase-es requested a review from nicktindall August 1, 2025 04:47
@schase-es schase-es added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Aug 1, 2025
@elasticsearchmachine elasticsearchmachine added v9.2.0 Team:Distributed Coordination Meta label for Distributed Coordination team labels Aug 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@schase-es
Copy link
Contributor Author

I'm not sure about the first test case... it issues a message saying that the other allocation deciders did it. But there are no other allocation deciders...

Comment on lines 67 to 71
return allocation.decision(
Decision.NO,
NAME,
"shard was prevented from being allocated on all nodes because of other allocation deciders"
);
Copy link
Contributor

@mhl-b mhl-b Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think decider should assume anything about other deciders, by design. Each decider speaks for itself. If you assume that there is at least one that tells NO, then return YES here.

Also it's a bit hard to read all the prerequisites for the last "else" statement. As I understand it might be restoreInProgress with state.completed==true and/or failed allocations == 0. Why it should say NO to this?

A comment line would help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'm not 100% clear on why we should be returning NO if the specific thing we're responsible for is not in play here.

Am I right in understanding that the shard might have exhausted its retries because either

  • There was some failure to restore the shard (e.g. file corruption)
  • The shard could not be assigned anywhere (i.e. this case)

And we want to distinguish the former from the latter? i.e. the RestoreInProgressAllocationDecider is right to say no, because we have stopped attempting to restore the shard, but there were no failed allocations so we shouldn't indicate that there will be failures in the logs.

I think there's a lot of assumed knowledge about the restore-in-progress lifecycle here and it would benefit from some comments or potentially restructuring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these comments -- they are helpful in clarifying what the issues are.

This feature is rather small, but because it manages to touch snapshot restores, cluster state, allocation, and index state changes it is a lot of reading to understand how everything connects together.

I agree that the decider should say yes when there is not a failure. The issue of speaking for other allocation deciders came from two places. First, I was trying to preserve the decision of the allocation decider; the first patch changed the message but not the decision. Second, I was trying to interpret David's comments without understanding the lifecycle of a restoration or that much about deciders.

Here is what I've worked out about the state changes.

ShardRouting and RestoreInProgress change during snapshot recovery with four events:

  • startRestore: RestoreService.startRestore starts a recovery (API-driven).
  • allocation: BalancedShardAllocator.allocateUnassigned assigns a shard (routingNodes.initializeShard).
  • failure: AllocationService.applyFailedShards fails a shard.
  • restoreComplete: AllocationService.applyStartedShards completes a shard recovery.

After startRestore, the ShardRoutingState is UNASSIGNED and the RestoreInProgress state is INIT.
After allocation, the ShardRoutingState is INITIALIZING and the RestoreInProgress state is STARTED.
After failure, the ShardRoutingState is UNASSIGNED and the RestoreInProgress state is FAILURE. The failedAllocations count on UnassignedInfo is incremented to at least one, and a message and exception are attached.
After restoreComplete, the ShardRoutingState is STARTED and the RestoreInProgress state is SUCCESS.

The decider is supposed to refuse to allocate a shard anywhere once shapshot recovery fails.

  • in startRestore and allocation, the decision is YES: the shardRestoreStatus is set, and it's not completed (not SUCCESS or FAILURE).
  • in failure, UnassignedInfo's failedAllocation sets a NO.
  • in restoreComplete (ShardRouting.moveToStarted), this is screened out in the first test because the recovery source is null.

I'm unclear on when this last YES will be reached.

One detail I noticed is that in the failure transition, AllocationService.applyFailedShards does not update the cluster state and RestoreInProgress passed through allocation until after it runs AllocationService.reroute. If the ShardRoutingState is newly UNASSIGNED, and the RestoreInProgress is stale as STARTED (transitioning from allocation), then it will continue to be assigned for an allocation round.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed this conversation. Here's a test that shows the problem:

diff --git a/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java b/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java
index 953cddba0ab7..3741562deb40 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/snapshots/RestoreSnapshotIT.java
@@ -12,6 +12,8 @@ package org.elasticsearch.snapshots;
 import org.apache.logging.log4j.Level;
 import org.elasticsearch.action.ActionFuture;
 import org.elasticsearch.action.ActionRequestBuilder;
+import org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainRequest;
+import org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction;
 import org.elasticsearch.action.admin.cluster.snapshots.restore.RestoreSnapshotResponse;
 import org.elasticsearch.action.admin.indices.settings.get.GetSettingsResponse;
 import org.elasticsearch.action.admin.indices.template.get.GetIndexTemplatesResponse;
@@ -20,6 +22,8 @@ import org.elasticsearch.client.internal.Client;
 import org.elasticsearch.cluster.block.ClusterBlocks;
 import org.elasticsearch.cluster.metadata.IndexMetadata;
 import org.elasticsearch.cluster.metadata.MappingMetadata;
+import org.elasticsearch.cluster.routing.allocation.decider.Decision;
+import org.elasticsearch.common.Strings;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.common.unit.ByteSizeUnit;
 import org.elasticsearch.core.TimeValue;
@@ -41,6 +45,7 @@ import java.util.Collections;
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
+import java.util.Set;
 import java.util.concurrent.TimeUnit;
 import java.util.stream.Collectors;
 import java.util.stream.IntStream;
@@ -1025,4 +1030,65 @@ public class RestoreSnapshotIT extends AbstractSnapshotIntegTestCase {
             mockLog.assertAllExpectationsMatched();
         }
     }
+
+    public void testExplainUnassigableDuringRestore() {
+        final String repoName = "repo-" + randomIdentifier();
+        createRepository(repoName, FsRepository.TYPE);
+        final String indexName = "index-" + randomIdentifier();
+        createIndexWithContent(indexName);
+        final String snapshotName = "snapshot-" + randomIdentifier();
+        createSnapshot(repoName, snapshotName, List.of(indexName));
+        assertAcked(indicesAdmin().prepareDelete(indexName));
+
+        final RestoreSnapshotResponse restoreSnapshotResponse = clusterAdmin().prepareRestoreSnapshot(
+            TEST_REQUEST_TIMEOUT,
+            repoName,
+            snapshotName
+        )
+            .setIndices(indexName)
+            .setRestoreGlobalState(false)
+            .setWaitForCompletion(true)
+            .setIndexSettings(
+                Settings.builder().put(IndexMetadata.INDEX_ROUTING_REQUIRE_GROUP_PREFIX + "._name", "not-a-node-" + randomIdentifier())
+            )
+            .get();
+
+        logger.info("--> restoreSnapshotResponse: {}", Strings.toString(restoreSnapshotResponse, true, true));
+        assertThat(restoreSnapshotResponse.getRestoreInfo().failedShards(), greaterThan(0));
+
+        final var clusterExplainResponse1 = client().execute(
+            TransportClusterAllocationExplainAction.TYPE,
+            new ClusterAllocationExplainRequest(TEST_REQUEST_TIMEOUT).setIndex(indexName).setShard(0).setPrimary(true)
+        ).actionGet();
+
+        logger.info("--> clusterExplainResponse1: {}", Strings.toString(clusterExplainResponse1, true, true));
+        for (var nodeDecision : clusterExplainResponse1.getExplanation()
+            .getShardAllocationDecision()
+            .getAllocateDecision()
+            .getNodeDecisions()) {
+            assertEquals(
+                Set.of("restore_in_progress", "filter"),
+                nodeDecision.getCanAllocateDecision().getDecisions().stream().map(Decision::label).collect(Collectors.toSet())
+            );
+        }
+
+        updateIndexSettings(Settings.builder().putNull(IndexMetadata.INDEX_ROUTING_REQUIRE_GROUP_PREFIX + "._name"), indexName);
+
+        final var clusterExplainResponse2 = client().execute(
+            TransportClusterAllocationExplainAction.TYPE,
+            new ClusterAllocationExplainRequest(TEST_REQUEST_TIMEOUT).setIndex(indexName).setShard(0).setPrimary(true)
+        ).actionGet();
+
+        logger.info("--> clusterExplainResponse2: {}", Strings.toString(clusterExplainResponse2, true, true));
+        for (var nodeDecision : clusterExplainResponse2.getExplanation()
+            .getShardAllocationDecision()
+            .getAllocateDecision()
+            .getNodeDecisions()) {
+            assertEquals(
+                Set.of("restore_in_progress"),
+                nodeDecision.getCanAllocateDecision().getDecisions().stream().map(Decision::label).collect(Collectors.toSet())
+            );
+        }
+
+    }
 }

Note that the shard couldn't be allocated because of the filter, but then when the filter is removed we still mustn't allocate the shard because the restore is no longer in progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message in this latter case is:

shard has failed to be restored from the snapshot [default:repo-sdvofzvibks:snapshot-iqqndiwoes/HWiV1JnTSRKor0KieINqWA] - manually close or delete the index [index-djtuceqgi] in order to retry to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard. Details: [restore_source[repo-sdvofzvibks/snapshot-iqqndiwoes]]

The issue is that this message sends folks off trying to look for problems with the snapshot repository when in fact we never even tried to contact the snapshot repository, the obstacle was elsewhere, and may no longer exist. Moreover the Details: bit at the end is pointless, it's actually less detailed than the snapshot ID default:repo-sdvofzvibks:snapshot-iqqndiwoes/HWiV1JnTSRKor0KieINqWA logged earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now -- this is more of an issue with how a cluster state change that voids the context of the restore (the index underneath it changes) cascades through the allocator.

I'm seeing where this happens in RestoreService.updateRestoreStateWithDeletedIndices. And where failed/completed restores are cleaned out in RestoreService.removeCompletedRestoresFromClusterState.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might be more accurate in the allocation failure message to state that some aspect of the cluster state context has changed.

- decider says yes for irrelevant cases
- tests are updated
- comment explains the state being checked at the end
Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for thorough explanation, it helps. I would change last YES message. Don't need second review from me.

source.snapshot(),
shardRouting.getIndexName(),
shardRouting.unassignedInfo().details()
"shard was prevented from being allocated on all nodes because of other allocation deciders"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a strange way to say YES, and not always true. Looking at the code above it should be "YES/shard successfully restored".

Comment on lines +59 to +60
UnassignedInfo unassignedInfo = shardRouting.unassignedInfo();
if (unassignedInfo.failedAllocations() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, matches with what David said in the #100233


/**
* POST: the RestoreInProgress.ShardRestoreStatus is either failed or succeeded. This section turns a
* turn a shard failure into a NO decision to allocate. See {@link AllocationService.applyFailedShards}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"turns a turn a shard..." typo in javadoc?

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, pending clarification on the message

return allocation.decision(
Decision.NO,
NAME,
"shard was prevented from being allocated on all nodes because of other allocation deciders in previous rounds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be something like
"Restore from snapshot failed because the configured constraints prevented allocation on any of the available nodes. Please check constraints applied in index and cluster settings, then retry the restore"

Just to avoid the talk about other allocation deciders (that seems like an abstract concept, but I might be misunderstanding who the audience is here) and also the tip to check the constraints and retry might be helpful?

Happy to go with consensus here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a better message.

I'm also wondering if there's something like "go look for this earlier in the logs?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could instruct them to use the allocation explain (even though the reason for the allocation being prevented might have since disappeared)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reference the allocation explain API... the reason disappearing seems like a lousy way to respond

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does, maybe let's not be explicit about that. The allocation explain API may shed some light in many cases though I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure -- maybe we can create a separate task for later to pend the reasoning in RestoreInProgress to somewhere in UnassignedInfo/ClusterState

@schase-es schase-es dismissed DaveCTurner’s stale review September 27, 2025 00:34

Comment on decider has been resolved -- unsure how to remove this otherwise.

@schase-es schase-es merged commit 2787546 into elastic:main Sep 27, 2025
34 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Sep 29, 2025
* upstream/main: (22 commits)
  Fix InternalCategorizationAggregationTests.testReduceRandom (elastic#135533)
  [DOCS] GeoIP processor: add clarification about using a reverse proxy endpoint (elastic#135534)
  Move `ProjectRoutingInfo` and related classes (elastic#135586)
  Refactor IndexAbstractionResolver (elastic#135587)
  Simplify returnLocalAll handling in ES|QL (elastic#135353)
  Reapply "Add an option to return early from an allocate call"  (elastic#135589)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#134407
  Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT testAggTooManyMvLongs elastic#135585
  Mute org.elasticsearch.multiproject.test.XpackWithMultipleProjectsClientYamlTestSuiteIT test {yaml=esql/60_usage/Basic ESQL usage output (telemetry) snapshot version} elastic#135579
  Mute org.elasticsearch.search.ccs.KnnVectorQueryBuilderCrossClusterSearchIT testKnnQueryWithCcsMinimizeRoundTripsFalse elastic#135573
  Mute org.elasticsearch.xpack.esql.inference.textembedding.TextEmbeddingOperatorTests testSimpleCircuitBreaking elastic#135569
  Add telemetry for `TS` command (elastic#135471)
  Mute org.elasticsearch.cluster.routing.allocation.decider.RestoreInProgressAllocationDeciderTests testCanAllocatePrimaryExistingInRestoreInProgress elastic#135566
  allocation: clarify RestoreInProgressAllocationDecider failure message (elastic#132307)
  [ES|QL] Register AggregateMetricDoubleLiteral (elastic#135054)
  Validate Logstash pipeline ID when creating. (elastic#135378)
  Migrate transport versions 8841050 through 8841041 (elastic#135555)
  Mute org.elasticsearch.search.ccs.SparseVectorQueryBuilderCrossClusterSearchIT testSparseVectorQueryWithCcsMinimizeRoundTripsFalse elastic#135559
  Mute org.elasticsearch.action.admin.cluster.stats.SearchUsageStatsTests testToXContent elastic#135558
  Testing indices query cache memory stats (elastic#135298)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RestoreInProgressAllocationDecider doesn't really explain how to investigate why the shard was not restored

5 participants