allocation: create separate limit for frozen tier concurrent rebalance #135243

schase-es · 2025-09-23T01:04:22Z

The concurrent rebalance limit prevents too many shard relocations from happening at once. This works well in limiting active shards on the hot tier, so compute for indexing is reserved. Given that the frozen tier is an archive, compute separation is less of a concern and more rebalancing activity can occur without degradation. This change creates a separate variable for managing this.

Fixes: ES-11303

The concurrent rebalance limit prevents too many shard relocations from happening at once. This works well in limiting active shards on the hot tier, so compute for indexing is reserved. Given that the frozen tier is an archive, compute separation is less of a concern and more rebalancing activity can occur without degradation. This change creates a separate variable for managing this.

elasticsearchmachine · 2025-09-23T01:04:46Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall

Looking good, I think the approach is sound, using the different limits depending on the type of node.

nicktindall · 2025-09-24T00:30:14Z

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

            // Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
            // it is possible to artificially set above setting to 2 to avoid unnecessary moves in desired balance.
+            // Separately: keep overall limit in simulation to two including frozen shards
            return allocation.decision(Decision.THROTTLE, NAME, "allocation should move one shard at the time when simulating");


As discussed, this whole branch might be redundant (?) since #134786

This is helpful -- can we leave this to a separate PR? This may not be the only decider that needs to change.

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

nicktindall · 2025-09-24T01:33:59Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

-                .put("cluster.routing.allocation.cluster_concurrent_rebalance", 3)
-                .build()
-        );
+        boolean testFrozen = randomBoolean();


it's probably worth having a separate method to test frozen? see https://github.com/elastic/elasticsearch/blob/main/TESTING.asciidoc#use-randomized-testing-for-coverage

- per-shard decision has been implemented, using the cluster-wide logic from before - cluster-wide decision has been changed to allow progress whenever there is any budget

schase-es · 2025-09-24T02:55:26Z

Thanks Nick -- this was really helpful.

I updated the overall algorithm shape to match what we talked about, where the shard-level one implements the cluster-wide logic from before, but specific to frozen or non-frozen counts. The cluster-level algorithm checks for room in either limit, including one bit we didn't mention but I realize mattered: one limit having "unlimited" as a configuration status.

I have some more to do on optimizing the counts for frozen. I need to have a look at this and decide what to do, but I think I need to keep a shardId set in routing nodes, or set a flag on ShardRouting.

There are a few other comments you made that I didn't get to quite yet.

- Implemented count of relocatingShards inside RoutingNodes

schase-es · 2025-09-24T04:52:48Z

Hi Nick -- I added the count inside RoutingNodes, as discussed. It wasn't as bad as I thought.

I am wondering what you think about the cluster-wide logic. The check for unlimited makes things a little different than the addition we talked about, but I think what I have is simple enough.

It could do with a Decision refactor I guess.

nicktindall · 2025-09-25T01:27:27Z

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

+            return true;
+        }
+        return false;
+    }


This logic for determining what constitutes a frozen shard we should check with wiser heads. We are considering only shards moving off of a dedicated frozen node as being "frozen" shards.

I think this is an acceptable approximation because:

We recommend self managed deployments use dedicated frozen nodes and we only have dedicated frozen nodes in ECH (I believe), so "dedicated frozen node" is equivalent to just "frozen node" if things are configured as we recommend. I think it's OK to not allow more rebalancing in the event someone has gone outside our recommendations and created non-dedicated frozen nodes.

If I recall correctly, part of the justification for the change was that the frozen and non-frozen nodes are mutually exclusive

There is an edge case where we'd count shards relocating from dedicated to non-dedicated frozen nodes, but not vice versa, but again that seems outside of our recommendations and not worth going to great effort to remedy? if we want to we be strict about that we could also check the relocatingNodeId for dedicated frozen-nes?.

To determine whether a shard is a frozen shard we need to look at the index metadata, which we could do in canRebalance(shardRouting, allocation) via the cluster state and checking IndexMetadata#getTierPreference(), but it looks like knowledge of specific data tiers lives in x-pack as it stands, and I wonder if we want that to leak out. It looks like tier preference is largely opaque to the rest of the code base, I may have misinterpreted that.

I would like to check with someone else, too.

I wrote the frozen/non-frozen counts, so that frozen is a subset count of all relocations. If a relocation isn't counted as frozen, then it's counted as non-frozen. So if something is mistaken, it's just subjected to a tighter constraint.

I can add a flag to ShardRouting, but as we discussed (and you're reminding me here) it may be too far for the data tier concept to propagate.

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNode.java

nicktindall · 2025-09-25T01:53:00Z

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

        return relocatingShards;
    }

+    private boolean isFrozenNode(String nodeId) {


Nit: Perhaps we should be explicit that we're looking for "dedicated" frozen nodes in the method name?

- removed extraneous count method in RoutingNode - renamed RoutingNodes.isFrozenNode -> isDedicatedFrozenNode - centralized frozen node check: decider had own - reuse original "reached the limit" log message langauge; this got mangled

nicktindall

Looking good, mostly cosmetic comments

nicktindall · 2025-10-07T06:16:34Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+            assertThat(clusterState.routingTable().index("test").shard(i).shard(0).state(), equalTo(UNASSIGNED));
+            assertThat(clusterState.routingTable().index("test").shard(i).shard(1).state(), equalTo(UNASSIGNED));
+            assertThat(clusterState.routingTable().index("test").shard(i).shard(0).currentNodeId(), nullValue());
+            assertThat(clusterState.routingTable().index("test").shard(i).shard(1).currentNodeId(), nullValue());


Nit: could store clusterState.routingTable().index("test") in a variable?

nicktindall · 2025-10-07T06:43:11Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+        testClusterConcurrentInternal(true);
+    }
+
+    void testClusterConcurrentInternal(boolean testFrozen) {


Nit: I don't love the boolean flag parameter here, I wonder if it would be neater to pass in e.g. a cluster settings, a node factory and an index settings provider or something?

This could also just be a preference of mine that you or others might disagree with.

nicktindall · 2025-10-07T06:43:59Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+                    .add(newNode("node9", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
+                    .add(newNode("node10", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
+            )
+            .build();


I wonder if it'd be better to randomise the numbers here to make it clear that 8 isn't in any way significant

nicktindall · 2025-10-07T06:52:41Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+            Settings.builder()
+                .put("cluster.routing.allocation.node_concurrent_recoveries", 10)
+                .put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", -1)
+                .build()


Would it be better to set non-frozen recoveries to something very restrictive (like 1) so it's clear those limits are not in play here?

nudge on on this one, could set it to zero maybe to disable non-frozen movement altogether?

nicktindall · 2025-10-07T06:53:20Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+                    .add(newNode("node7", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
+                    .add(newNode("node8", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
+                    .add(newNode("node9", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
+                    .add(newNode("node10", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))


Nit: Perhaps de-dup Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)

nicktindall · 2025-10-07T06:59:26Z

.../java/org/elasticsearch/xpack/cluster/routing/allocation/DataTierAllocationDeciderTests.java

+        for (int i = 0; i < clusterState.routingTable().index("test_frozen").size(); i++) {
+            assertThat(clusterState.routingTable().index("test_frozen").shard(i).size(), equalTo(2));
+            assertThat(clusterState.routingTable().index("test_frozen").shard(i).primaryShard().state(), equalTo(STARTED));
+            assertThat(clusterState.routingTable().index("test_frozen").shard(i).replicaShards().get(0).state(), equalTo(INITIALIZING));


Nit: try and reduce repetition in these blocks (e.g. clusterState.routingTable().index("test_frozen").shard(i))

nicktindall · 2025-10-07T07:03:32Z

.../java/org/elasticsearch/xpack/cluster/routing/allocation/DataTierAllocationDeciderTests.java

+        assertThat(shardsWithState(clusterState.getRoutingNodes(), "test", RELOCATING).size(), equalTo(3));
+
+        assertThat(shardsWithState(clusterState.getRoutingNodes(), "test_frozen", STARTED).size(), equalTo(10));
+        assertThat(shardsWithState(clusterState.getRoutingNodes(), "test_frozen", RELOCATING).size(), equalTo(0));


Could we have these frozen and hot rebalances happening concurrently to make it clear that the limits can both be maxed out concurrently perhaps? Or is there a reason they're done sequentially?

It turns out I was incorrectly using DATA_ROLE at the top instead of DATA_HOT_ROLE.

schase-es · 2025-10-08T20:26:08Z

Thanks for the comments on refactoring the tests, Nick. Sometimes I find boilerplate more readable, but in some of those situations it was getting hard to read.

The issue with the test was a good one to point out. I looked over the shard assignments earlier, and thought everything was fine. It turns out it was still able to mix frozen -> data nodes (but not data -> frozen), so doing those runs separately was the only way things worked. It works concurrently with the right data label.

I am wondering a little about the default config value of the frozen tier concurrent setting. It's at 10 now. Should I keep it at 2 or 3? This way, it will work the same as it did before (well, maybe not if the cluster has a different-than-default concurrent rebalance limit).

The right behavior in my mind would be to set it to the frozen limit's default value of the non-frozen limit's current value, then let them modify it from there.

nicktindall · 2025-10-09T00:19:48Z

I am wondering a little about the default config value of the frozen tier concurrent setting. It's at 10 now. Should I keep it at 2 or 3? This way, it will work the same as it did before (well, maybe not if the cluster has a different-than-default concurrent rebalance limit).

The right behavior in my mind would be to set it to the frozen limit's default value of the non-frozen limit's current value, then let them modify it from there.

Yeah I think defaulting frozen to the same value as non-frozen makes sense. If we decide it needs to be higher in the future we can adjust that.

nicktindall · 2025-10-09T04:45:36Z

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

+                "below threshold [%d] for concurrent rebalances, current rebalance shard count [%d]",
                relocatingShards,
                CLUSTER_ROUTING_ALLOCATION_CLUSTER_CONCURRENT_REBALANCE_SETTING.getKey(),
                clusterConcurrentRebalance


I think the string template doesn't match the parameters now? we used to have three placeholders, now we only have two. I wonder if we should add a unit test for the decider, there doesn't seem to be one already, but I have seen them implemented for other deciders, it would make it easy to test all these branches. See org.elasticsearch.cluster.routing.allocation.decider.WriteLoadConstraintDeciderTests#testWriteLoadDeciderCanAllocate for example?

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

nicktindall · 2025-10-09T04:51:34Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+    }
+
+    public void testClusterConcurrentRebalanceFrozen() {
+        AllocationService strategy = createAllocationService(


Nit: can we call the variable allocationService ? I think the strategy naming is a hangover from a previous refactor?

nicktindall

This LGTM with some nits

nicktindall · 2025-10-14T06:06:56Z

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

+ * re-balance (shard relocation) operations and restricts node allocations
+ * if the configured threshold is reached. Frozen and non-frozen shards are
+ * considered separately. The default number of concurrent rebalance operations
+ * is set to {@code 2} for non-frozen shards, and {@code 10} for frozen shards.


This doc is out of date with the implementation.

nicktindall · 2025-10-14T06:12:51Z

.../org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceAllocationDeciderTests.java

+ */
+public class ConcurrentRebalanceAllocationDeciderTests extends ESAllocationTestCase {
+
+    public void testConcurrentUnlimited() {


Nit: naming suggests this test is for unlimited only, but it seems to be when we're unlimited and when we're below the limit?

nicktindall · 2025-10-14T06:13:19Z

.../org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceAllocationDeciderTests.java

+        );
+    }
+
+    public void testFrozenConcurrentUnlimited() {


Also this name

nicktindall · 2025-10-14T06:14:26Z

.../org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceAllocationDeciderTests.java

+        settings = Settings.builder()
+            .put("cluster.routing.allocation.cluster_concurrent_rebalance", 6)
+            .put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", 0)
+            .build();


Nit: can we do random value for rebalance with random less than that value in progress?

nicktindall · 2025-10-14T06:14:49Z

.../org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceAllocationDeciderTests.java

+        settings = Settings.builder()
+            .put("cluster.routing.allocation.cluster_concurrent_rebalance", 0)
+            .put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", 6)
+            .build();


and random this one please

nicktindall · 2025-10-14T06:15:16Z

.../org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceAllocationDeciderTests.java

+        ClusterState clusterState = setupConcurrentRelocations(2);
+
+        Settings settings = Settings.builder()
+            .put("cluster.routing.allocation.cluster_concurrent_rebalance", 2)


random please

nicktindall · 2025-10-14T06:19:19Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+            Settings.builder()
+                .put("cluster.routing.allocation.node_concurrent_recoveries", 10)
+                .put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", -1)
+                .build()


nudge on on this one, could set it to zero maybe to disable non-frozen movement altogether?

nicktindall · 2025-10-14T06:25:40Z

.../test/java/org/elasticsearch/cluster/routing/allocation/ConcurrentRebalanceRoutingTests.java

+        logger.info("start the replica shards, rebalancing should start, but with a limit " + nodeCount + " should be rebalancing");
+        clusterState = startInitializingShardsAndReroute(allocationService, clusterState);
+
+        // we only allow any number of relocations at a time


Nit: comment wording?

perhaps mention that 8 will be migrating off to spread shards evenly

nicktindall · 2025-10-14T06:27:16Z

.../java/org/elasticsearch/xpack/cluster/routing/allocation/DataTierAllocationDeciderTests.java

+            assertThat(shardRouting.size(), equalTo(2));
+            assertThat(shardRouting.primaryShard().state(), equalTo(STARTED));
+            assertThat(shardRouting.replicaShards().get(0).state(), equalTo(INITIALIZING));
+        }


This loop looks repeated a lot, I wonder if we could make a test utility to do these repeated assertions?

schase-es assigned nicktindall Sep 23, 2025

schase-es added the >non-issue label Sep 23, 2025

schase-es requested a review from a team as a code owner September 23, 2025 01:04

schase-es added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Sep 23, 2025

elasticsearchmachine and others added 5 commits September 23, 2025 01:12

[CI] Update transport version definitions

7f683ba

[CI] Auto commit changes from spotless

56b76c8

Merge branch 'main' into ES-11303_concurrent-frozen-tier-activity_rebase

763bf97

Formatting fixes

2ffa4d3

Adding unlimited test

c8810f2

nicktindall reviewed Sep 24, 2025

View reviewed changes

Addressed or vs. and logic in the context of the allocation algorithm

9638803

- per-shard decision has been implemented, using the cluster-wide logic from before - cluster-wide decision has been changed to allow progress whenever there is any budget

elasticsearchmachine and others added 3 commits September 24, 2025 02:58

[CI] Auto commit changes from spotless

9d111e9

- Fixed up cluster-wide algorithm for applying frozen tier limits

3478f43

- Implemented count of relocatingShards inside RoutingNodes

Adding separate test for frozen

ce8accd

elasticsearchmachine and others added 3 commits September 24, 2025 04:58

[CI] Auto commit changes from spotless

1ffda8e

Check RoutingNode.node() for null in frozenNode test

cb754db

Merge branch 'main' into ES-11303_concurrent-frozen-tier-activity_rebase

eed71f4

nicktindall reviewed Sep 25, 2025

View reviewed changes

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java Show resolved Hide resolved

nicktindall reviewed Sep 25, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNode.java Outdated Show resolved Hide resolved

nicktindall reviewed Sep 25, 2025

View reviewed changes

schase-es added 3 commits September 24, 2025 21:28

Addressed review/extra feedback

ddf7eee

- removed extraneous count method in RoutingNode - renamed RoutingNodes.isFrozenNode -> isDedicatedFrozenNode - centralized frozen node check: decider had own - reuse original "reached the limit" log message langauge; this got mangled

Merge branch 'main' into ES-11303_concurrent-frozen-tier-activity_rebase

ecde747

Adding in mixed frozen and hot test

3dde0a6

Merge branch 'main' into ES-11303_concurrent-frozen-tier-activity_rebase

7db6845

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

nicktindall reviewed Oct 7, 2025

View reviewed changes

schase-es added 2 commits October 8, 2025 00:15

Merge branch 'main' into ES-11303_concurrent-frozen-tier-activity_rebase

39b9d6b

Test fixes

6681750

schase-es added 2 commits October 8, 2025 17:25

Set default concurrent rebalance for frozen to 2

03fbb83

Default to normal setting

05cbcf2

nicktindall reviewed Oct 9, 2025

View reviewed changes

schase-es added 4 commits October 13, 2025 00:27

Renaming strategy to allocationService

84bc726

Fixed up decision message formatting, and added test

c798ad8

Merge branch 'main' into ES-11303_concurrent-frozen-tier-activity_rebase

3d52ec1

Slight improvements to the test structure

84148e7

nicktindall approved these changes Oct 14, 2025

View reviewed changes

allocation: create separate limit for frozen tier concurrent rebalance #135243

Are you sure you want to change the base?

allocation: create separate limit for frozen tier concurrent rebalance #135243

Uh oh!

Conversation

schase-es commented Sep 23, 2025

Uh oh!

elasticsearchmachine commented Sep 23, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schase-es commented Sep 24, 2025

Uh oh!

schase-es commented Sep 24, 2025

Uh oh!

nicktindall Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schase-es commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicktindall commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 25, 2025 •

edited

Loading

schase-es commented Oct 8, 2025 •

edited

Loading