[Test] Allow allocation in mixed cluster #129680

ywangd · 2025-06-19T04:21:50Z

The RunningSnapshotIT upgrade test adds shutdown markers to all nodes and removes them once all nodes are upgraded. If an index gets created in a mixed cluster, for example by ILM or deprecation messages, the index cannot be allocated because all nodes are shutting down. Since the cluster ready check between node upgrades expects a yellow cluster, the unassigned index prevents the ready check to succeed and eventually timeout. This PR fixes it by removing shutdown marker for the 1st upgrade node to allow it hosting new indices.

Resolves: #129644
Resolves: #129645
Resolves: #129646

The RunningSnapshotIT upgrade test adds shutdown marker to all nodes and removed them once all nodes are upgraded. If an index gets created in a mixed cluster, for example by ILM or deprecation messages, the index cannot be allocated because all nodes are shutting down. Since the cluster ready check between node upgrades expects a yellow cluster, the unassigned index prevents the ready check to succeed and eventually timeout. This PR fixes it by removing shutdown marker for the 1st upgrade node to allow it hosting new indices. Resolves: elastic#129644 Resolves: elastic#129645 Resolves: elastic#129646

elasticsearchmachine · 2025-06-19T04:22:15Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall

Nice, so I assume this still doesn't allow the snapshot to complete during the upgrade because there will be 2 shards that can't be assigned, because only 1 node has no shutdown marker?

ywangd · 2025-06-19T05:59:33Z

there will be 2 shards that can't be assigned

Those two shards remain on their initial nodes. They are not unassigned because they are not new shards. The snapshot cannot complete because:

We still have 2 nodes hosting shards that are shutting down
The 2 shards cannot move anywhere because the index is created with 1 shard per node

So yeah the snapshot can completely only when all nodes are upgraded and remaining shutdown marker removed.

ywangd · 2025-06-19T08:43:25Z

@elasticmachine update branch

The RunningSnapshotIT upgrade test adds shutdown markers to all nodes and removes them once all nodes are upgraded. If an index gets created in a mixed cluster, for example by ILM or deprecation messages, the index cannot be allocated because all nodes are shutting down. Since the cluster ready check between node upgrades expects a yellow cluster, the unassigned index prevents the ready check to succeed and eventually timeout. This PR fixes it by removing shutdown marker for the 1st upgrade node to allow it hosting new indices. Resolves: elastic#129644 Resolves: elastic#129645 Resolves: elastic#129646

This is the same failure as observed in elastic#129644 for which the original fix elastic#129680 did not really work. It did not work because the the ordering of checks. The shutdown marker is removed after the cluster passes ready check so that new shards can be allocated. But the cluster cannot pass the ready check before the shards are allocated. Hence the circular dependency. In hindsight, there is no need to put shutdown record for all nodes. It is only needed on the node that upgrades the last to prevent snapshot from completion during the upgrade process. This PR does that which ensures there are always 2 nodes for hosting new shards. Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137

This is the same failure as observed in #129644 for which the original fix #129680 did not really work. It did not work because the the ordering of checks. The shutdown marker is removed after the cluster passes ready check so that new shards can be allocated. But the cluster cannot pass the ready check before the shards are allocated. Hence the circular dependency. In hindsight, there is no need to put shutdown record for all nodes. It is only needed on the node that upgrades the last to prevent snapshot from completion during the upgrade process. This PR does that which ensures there are always 2 nodes for hosting new shards. Resolves: #132135 Resolves: #132136 Resolves: #132137

…2157) This is the same failure as observed in elastic#129644 for which the original fix elastic#129680 did not really work. It did not work because the the ordering of checks. The shutdown marker is removed after the cluster passes ready check so that new shards can be allocated. But the cluster cannot pass the ready check before the shards are allocated. Hence the circular dependency. In hindsight, there is no need to put shutdown record for all nodes. It is only needed on the node that upgrades the last to prevent snapshot from completion during the upgrade process. This PR does that which ensures there are always 2 nodes for hosting new shards. Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137 (cherry picked from commit f39ccb5) # Conflicts: # muted-tests.yml

…132233) This is the same failure as observed in #129644 for which the original fix #129680 did not really work. It did not work because the the ordering of checks. The shutdown marker is removed after the cluster passes ready check so that new shards can be allocated. But the cluster cannot pass the ready check before the shards are allocated. Hence the circular dependency. In hindsight, there is no need to put shutdown record for all nodes. It is only needed on the node that upgrades the last to prevent snapshot from completion during the upgrade process. This PR does that which ensures there are always 2 nodes for hosting new shards. Resolves: #132135 Resolves: #132136 Resolves: #132137 (cherry picked from commit f39ccb5) # Conflicts: # muted-tests.yml

…2157) This is the same failure as observed in elastic#129644 for which the original fix elastic#129680 did not really work. It did not work because the the ordering of checks. The shutdown marker is removed after the cluster passes ready check so that new shards can be allocated. But the cluster cannot pass the ready check before the shards are allocated. Hence the circular dependency. In hindsight, there is no need to put shutdown record for all nodes. It is only needed on the node that upgrades the last to prevent snapshot from completion during the upgrade process. This PR does that which ensures there are always 2 nodes for hosting new shards. Resolves: elastic#132135 Resolves: elastic#132136 Resolves: elastic#132137

ywangd requested a review from nicktindall June 19, 2025 04:21

ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.1.0 labels Jun 19, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jun 19, 2025

simplify finding the 1st upgraded node

1836b8d

nicktindall approved these changes Jun 19, 2025

View reviewed changes

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 19, 2025

Merge branch 'main' into es-129644-fix

f285689

elasticsearchmachine merged commit 6858c32 into elastic:main Jun 19, 2025
27 checks passed

ywangd deleted the es-129644-fix branch June 19, 2025 10:13

ywangd mentioned this pull request Jul 30, 2025

[Test] Put shutdown marker on the last upgraded node only #132157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Test] Allow allocation in mixed cluster #129680

[Test] Allow allocation in mixed cluster #129680

Uh oh!

ywangd commented Jun 19, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jun 19, 2025

Uh oh!

nicktindall left a comment

Uh oh!

ywangd commented Jun 19, 2025

Uh oh!

ywangd commented Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Test] Allow allocation in mixed cluster #129680

[Test] Allow allocation in mixed cluster #129680

Uh oh!

Conversation

ywangd commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 19, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jun 19, 2025

Uh oh!

ywangd commented Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ywangd commented Jun 19, 2025 •

edited

Loading