`total_shards_per_node` can prevent cold phase searchable snapshots from mounting

### Elasticsearch Version

8.15.3

### Installed Plugins

_No response_

### Java Version

_bundled_

### OS Version

ESS

### Problem Description

Related to:

* https://github.com/elastic/elasticsearch/issues/90871
* https://github.com/elastic/elasticsearch/pull/97979

The [allocate](https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-allocate.html) action appears to take place after the [searchable snapshot](https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html) action. This means a setting like `total_shards_per_node` (e.g. set at index creation time) can cause an index to become stuck in the cold phase and unable to complete its searchable snapshot action if the number of cold nodes can't accommodate `total_shards_per_node`. 

### Steps to Reproduce

Set up a 4x node cluster: 3x hot nodes + 1x cold node.

```
# Set ILM poll interval to 10s for faster testing

PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "10s"
  }
}
```
```
# Create ILM policy

PUT _ilm/policy/test-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 3
          },
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      },
      "cold": {
        "min_age": "15s",
        "actions": {
          "allocate": {
            "total_shards_per_node": 3
          },
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      }
    }
  }
}
```
```
# Create index template w/ ILM policy attached.

PUT _index_template/test-template
{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "test-policy",
          "rollover_alias": "test"
        },
        "number_of_replicas": "0",
        "number_of_shards": "3",
        "routing.allocation.total_shards_per_node": "1"
      }
    }
  },
  "index_patterns": [
    "test-*"
  ],
  "composed_of": []
}
```
```
# Bootstrap the first index

PUT test-000001
{
  "aliases": {
    "test": {
      "is_write_index": true
    }
  }
}
```
```
# Index some data

POST _bulk?refresh=wait_for
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
```
```
# Confirm rollover after ~10s

GET _cat/indices/*test*?v
```
```
# Check allocation of our searchable snapshot

GET _cat/shards/restored-test-000001?v

index                shard prirep state      docs store dataset ip          node
restored-test-000001 0     p      STARTED       0  227b    227b 10.46.66.98 instance-0000000003
restored-test-000001 1     p      UNASSIGNED                                
restored-test-000001 2     p      UNASSIGNED                
```
```
# Check allocation explain

GET _cluster/allocation/explain
{
  "index": "restored-test-000001",
  "shard": 1,
  "primary": true
}
```
```JSON
{
  "index": "restored-test-000001",
  "shard": 1,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NEW_INDEX_RESTORED",
    "at": "2024-10-23T22:00:03.477Z",
    "details": "restore_source[found-snapshots/2024.10.23-test-000001-test-policy-74cxzb2lsrimys49turtmg]",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
  "node_allocation_decisions": [
    {
      "node_id": "29eoo3ieTSCKiO7UtSxMmA",
      "node_name": "instance-0000000002",
      "transport_address": "10.46.65.154:19583",
      "node_attributes": {
        "logical_availability_zone": "zone-2",
        "availability_zone": "us-east4-b",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000002.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id": "3idS6nroSrGp0AHT3HVsAg",
      "node_name": "instance-0000000001",
      "transport_address": "10.46.64.24:19733",
      "node_attributes": {
        "logical_availability_zone": "zone-1",
        "availability_zone": "us-east4-a",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000001.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "xpack.installed": "true",
        "transform.config_version": "10.0.0",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id": "aLXOkK2dShSR9AzI94EoGQ",
      "node_name": "instance-0000000003",
      "transport_address": "10.46.66.98:19272",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-east4-a",
        "instance_configuration": "gcp.es.datacold.n2.68x10x190",
        "region": "unknown-region",
        "server_name": "instance-0000000003.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "cold"
      },
      "roles": [
        "data_cold",
        "remote_cluster_client"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "shards_limit",
          "decision": "NO",
          "explanation": "too many shards [1] allocated to this node for index [restored-test-000001], index setting [index.routing.allocation.total_shards_per_node=1]"
        }
      ]
    },
    {
      "node_id": "tKgoJAjoTTar3do03IeJxw",
      "node_name": "instance-0000000000",
      "transport_address": "10.46.66.23:19721",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-east4-c",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000000.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    }
  ]
}
```

We can unstick the index by clearing `total_shards_per_node`
```
PUT restored-test-000001/_settings
{
  "index.routing.allocation.total_shards_per_node": null
}
```

We can also add an `allocate` --> `total_shards_per_node` in a warm phase as a workaround.


### Logs (if relevant)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

total_shards_per_node can prevent cold phase searchable snapshots from mounting #115479

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479