Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Jun 19, 2025

The RunningSnapshotIT upgrade test adds shutdown markers to all nodes and removes them once all nodes are upgraded. If an index gets created in a mixed cluster, for example by ILM or deprecation messages, the index cannot be allocated because all nodes are shutting down. Since the cluster ready check between node upgrades expects a yellow cluster, the unassigned index prevents the ready check to succeed and eventually timeout. This PR fixes it by removing shutdown marker for the 1st upgrade node to allow it hosting new indices.

Resolves: #129644
Resolves: #129645
Resolves: #129646

The RunningSnapshotIT upgrade test adds shutdown marker to all nodes and
removed them once all nodes are upgraded. If an index gets created in a
mixed cluster, for example by ILM or deprecation messages, the index
cannot be allocated because all nodes are shutting down. Since the
cluster ready check between node upgrades expects a yellow cluster, the
unassigned index prevents the ready check to succeed and eventually
timeout. This PR fixes it by removing shutdown marker for the 1st
upgrade node to allow it hosting new indices.

Resolves: elastic#129644
Resolves: elastic#129645
Resolves: elastic#129646
@ywangd ywangd requested a review from nicktindall June 19, 2025 04:21
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.1.0 labels Jun 19, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jun 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so I assume this still doesn't allow the snapshot to complete during the upgrade because there will be 2 shards that can't be assigned, because only 1 node has no shutdown marker?

@ywangd
Copy link
Member Author

ywangd commented Jun 19, 2025

there will be 2 shards that can't be assigned

Those two shards remain on their initial nodes. They are not unassigned because they are not new shards. The snapshot cannot complete because:

  1. We still have 2 nodes hosting shards that are shutting down
  2. The 2 shards cannot move anywhere because the index is created with 1 shard per node

So yeah the snapshot can completely only when all nodes are upgraded and remaining shutdown marker removed.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 19, 2025
@ywangd
Copy link
Member Author

ywangd commented Jun 19, 2025

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit 6858c32 into elastic:main Jun 19, 2025
27 checks passed
@ywangd ywangd deleted the es-129644-fix branch June 19, 2025 10:13
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
The RunningSnapshotIT upgrade test adds shutdown markers to all nodes
and removes them once all nodes are upgraded. If an index gets created
in a mixed cluster, for example by ILM or deprecation messages, the
index cannot be allocated because all nodes are shutting down. Since the
cluster ready check between node upgrades expects a yellow cluster, the
unassigned index prevents the ready check to succeed and eventually
timeout. This PR fixes it by removing shutdown marker for the 1st
upgrade node to allow it hosting new indices.

Resolves: elastic#129644 Resolves: elastic#129645 Resolves: elastic#129646
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
The RunningSnapshotIT upgrade test adds shutdown markers to all nodes
and removes them once all nodes are upgraded. If an index gets created
in a mixed cluster, for example by ILM or deprecation messages, the
index cannot be allocated because all nodes are shutting down. Since the
cluster ready check between node upgrades expects a yellow cluster, the
unassigned index prevents the ready check to succeed and eventually
timeout. This PR fixes it by removing shutdown marker for the 1st
upgrade node to allow it hosting new indices.

Resolves: elastic#129644 Resolves: elastic#129645 Resolves: elastic#129646
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jul 30, 2025
This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135
Resolves: elastic#132136
Resolves: elastic#132137
elasticsearchmachine pushed a commit that referenced this pull request Jul 31, 2025
This is the same failure as observed in #129644 for which the original
fix #129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: #132135 Resolves: #132136 Resolves: #132137
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jul 31, 2025
…2157)

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137
(cherry picked from commit f39ccb5)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Jul 31, 2025
…132233)

This is the same failure as observed in #129644 for which the original
fix #129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: #132135 Resolves: #132136 Resolves: #132137
(cherry picked from commit f39ccb5)

# Conflicts:
#	muted-tests.yml
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Jul 31, 2025
…2157)

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Jul 31, 2025
…2157)

This is the same failure as observed in elastic#129644 for which the original
fix elastic#129680 did not really work. It did not work because the the
ordering of checks. The shutdown marker is removed after the cluster
passes ready check so that new shards can be allocated. But the cluster
cannot pass the ready check before the shards are allocated. Hence the
circular dependency. In hindsight, there is no need to put shutdown
record for all nodes. It is only needed on the node that upgrades the
last to prevent snapshot from completion during the upgrade process.
This PR does that which ensures there are always 2 nodes for hosting new
shards.

Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0

Projects

None yet

4 participants