Skip to content

Conversation

schase-es
Copy link
Contributor

The concurrent rebalance limit prevents too many shard relocations from happening at once. This works well in limiting active shards on the hot tier, so compute for indexing is reserved. Given that the frozen tier is an archive, compute separation is less of a concern and more rebalancing activity can occur without degradation. This change creates a separate variable for managing this.

Fixes: ES-11303

The concurrent rebalance limit prevents too many shard relocations from
happening at once. This works well in limiting active shards on the hot tier, so
compute for indexing is reserved. Given that the frozen tier is an archive,
compute separation is less of a concern and more rebalancing activity can occur
without degradation. This change creates a separate variable for managing this.
@schase-es schase-es requested a review from a team as a code owner September 23, 2025 01:04
@schase-es schase-es added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Sep 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think the approach is sound, using the different limits depending on the type of node.

// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
// it is possible to artificially set above setting to 2 to avoid unnecessary moves in desired balance.
// Separately: keep overall limit in simulation to two including frozen shards
return allocation.decision(Decision.THROTTLE, NAME, "allocation should move one shard at the time when simulating");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this whole branch might be redundant (?) since #134786

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful -- can we leave this to a separate PR? This may not be the only decider that needs to change.

.put("cluster.routing.allocation.cluster_concurrent_rebalance", 3)
.build()
);
boolean testFrozen = randomBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- per-shard decision has been implemented, using the cluster-wide logic from before
- cluster-wide decision has been changed to allow progress whenever there is any budget
@schase-es
Copy link
Contributor Author

Thanks Nick -- this was really helpful.

I updated the overall algorithm shape to match what we talked about, where the shard-level one implements the cluster-wide logic from before, but specific to frozen or non-frozen counts. The cluster-level algorithm checks for room in either limit, including one bit we didn't mention but I realize mattered: one limit having "unlimited" as a configuration status.

I have some more to do on optimizing the counts for frozen. I need to have a look at this and decide what to do, but I think I need to keep a shardId set in routing nodes, or set a flag on ShardRouting.

There are a few other comments you made that I didn't get to quite yet.

@schase-es
Copy link
Contributor Author

Hi Nick -- I added the count inside RoutingNodes, as discussed. It wasn't as bad as I thought.

I am wondering what you think about the cluster-wide logic. The check for unlimited makes things a little different than the addition we talked about, but I think what I have is simple enough.

It could do with a Decision refactor I guess.

return true;
}
return false;
}
Copy link
Contributor

@nicktindall nicktindall Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic for determining what constitutes a frozen shard we should check with wiser heads. We are considering only shards moving off of a dedicated frozen node as being "frozen" shards.

I think this is an acceptable approximation because:

  • We recommend self managed deployments use dedicated frozen nodes and we only have dedicated frozen nodes in ECH (I believe), so "dedicated frozen node" is equivalent to just "frozen node" if things are configured as we recommend. I think it's OK to not allow more rebalancing in the event someone has gone outside our recommendations and created non-dedicated frozen nodes.
    • If I recall correctly, part of the justification for the change was that the frozen and non-frozen nodes are mutually exclusive
    • There is an edge case where we'd count shards relocating from dedicated to non-dedicated frozen nodes, but not vice versa, but again that seems outside of our recommendations and not worth going to great effort to remedy? if we want to we be strict about that we could also check the relocatingNodeId for dedicated frozen-nes?.
  • To determine whether a shard is a frozen shard we need to look at the index metadata, which we could do in canRebalance(shardRouting, allocation) via the cluster state and checking IndexMetadata#getTierPreference(), but it looks like knowledge of specific data tiers lives in x-pack as it stands, and I wonder if we want that to leak out. It looks like tier preference is largely opaque to the rest of the code base, I may have misinterpreted that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to check with someone else, too.

I wrote the frozen/non-frozen counts, so that frozen is a subset count of all relocations. If a relocation isn't counted as frozen, then it's counted as non-frozen. So if something is mistaken, it's just subjected to a tighter constraint.

I can add a flag to ShardRouting, but as we discussed (and you're reminding me here) it may be too far for the data tier concept to propagate.

return relocatingShards;
}

private boolean isFrozenNode(String nodeId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Perhaps we should be explicit that we're looking for "dedicated" frozen nodes in the method name?

- removed extraneous count method in RoutingNode
- renamed RoutingNodes.isFrozenNode -> isDedicatedFrozenNode
- centralized frozen node check: decider had own
- reuse original "reached the limit" log message langauge; this got mangled
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, mostly cosmetic comments

assertThat(clusterState.routingTable().index("test").shard(i).shard(0).state(), equalTo(UNASSIGNED));
assertThat(clusterState.routingTable().index("test").shard(i).shard(1).state(), equalTo(UNASSIGNED));
assertThat(clusterState.routingTable().index("test").shard(i).shard(0).currentNodeId(), nullValue());
assertThat(clusterState.routingTable().index("test").shard(i).shard(1).currentNodeId(), nullValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could store clusterState.routingTable().index("test") in a variable?

testClusterConcurrentInternal(true);
}

void testClusterConcurrentInternal(boolean testFrozen) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't love the boolean flag parameter here, I wonder if it would be neater to pass in e.g. a cluster settings, a node factory and an index settings provider or something?

This could also just be a preference of mine that you or others might disagree with.

.add(newNode("node9", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
.add(newNode("node10", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be better to randomise the numbers here to make it clear that 8 isn't in any way significant

Settings.builder()
.put("cluster.routing.allocation.node_concurrent_recoveries", 10)
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", -1)
.build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to set non-frozen recoveries to something very restrictive (like 1) so it's clear those limits are not in play here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nudge on on this one, could set it to zero maybe to disable non-frozen movement altogether?

.add(newNode("node7", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
.add(newNode("node8", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
.add(newNode("node9", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
.add(newNode("node10", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Perhaps de-dup Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)

for (int i = 0; i < clusterState.routingTable().index("test_frozen").size(); i++) {
assertThat(clusterState.routingTable().index("test_frozen").shard(i).size(), equalTo(2));
assertThat(clusterState.routingTable().index("test_frozen").shard(i).primaryShard().state(), equalTo(STARTED));
assertThat(clusterState.routingTable().index("test_frozen").shard(i).replicaShards().get(0).state(), equalTo(INITIALIZING));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: try and reduce repetition in these blocks (e.g. clusterState.routingTable().index("test_frozen").shard(i))

assertThat(shardsWithState(clusterState.getRoutingNodes(), "test", RELOCATING).size(), equalTo(3));

assertThat(shardsWithState(clusterState.getRoutingNodes(), "test_frozen", STARTED).size(), equalTo(10));
assertThat(shardsWithState(clusterState.getRoutingNodes(), "test_frozen", RELOCATING).size(), equalTo(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have these frozen and hot rebalances happening concurrently to make it clear that the limits can both be maxed out concurrently perhaps? Or is there a reason they're done sequentially?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out I was incorrectly using DATA_ROLE at the top instead of DATA_HOT_ROLE.

@schase-es
Copy link
Contributor Author

schase-es commented Oct 8, 2025

Thanks for the comments on refactoring the tests, Nick. Sometimes I find boilerplate more readable, but in some of those situations it was getting hard to read.

The issue with the test was a good one to point out. I looked over the shard assignments earlier, and thought everything was fine. It turns out it was still able to mix frozen -> data nodes (but not data -> frozen), so doing those runs separately was the only way things worked. It works concurrently with the right data label.

I am wondering a little about the default config value of the frozen tier concurrent setting. It's at 10 now. Should I keep it at 2 or 3? This way, it will work the same as it did before (well, maybe not if the cluster has a different-than-default concurrent rebalance limit).

The right behavior in my mind would be to set it to the frozen limit's default value of the non-frozen limit's current value, then let them modify it from there.

@nicktindall
Copy link
Contributor

I am wondering a little about the default config value of the frozen tier concurrent setting. It's at 10 now. Should I keep it at 2 or 3? This way, it will work the same as it did before (well, maybe not if the cluster has a different-than-default concurrent rebalance limit).

The right behavior in my mind would be to set it to the frozen limit's default value of the non-frozen limit's current value, then let them modify it from there.

Yeah I think defaulting frozen to the same value as non-frozen makes sense. If we decide it needs to be higher in the future we can adjust that.

"below threshold [%d] for concurrent rebalances, current rebalance shard count [%d]",
relocatingShards,
CLUSTER_ROUTING_ALLOCATION_CLUSTER_CONCURRENT_REBALANCE_SETTING.getKey(),
clusterConcurrentRebalance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the string template doesn't match the parameters now? we used to have three placeholders, now we only have two. I wonder if we should add a unit test for the decider, there doesn't seem to be one already, but I have seen them implemented for other deciders, it would make it easy to test all these branches. See org.elasticsearch.cluster.routing.allocation.decider.WriteLoadConstraintDeciderTests#testWriteLoadDeciderCanAllocate for example?

}

public void testClusterConcurrentRebalanceFrozen() {
AllocationService strategy = createAllocationService(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we call the variable allocationService ? I think the strategy naming is a hangover from a previous refactor?

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM with some nits

* re-balance (shard relocation) operations and restricts node allocations
* if the configured threshold is reached. Frozen and non-frozen shards are
* considered separately. The default number of concurrent rebalance operations
* is set to {@code 2} for non-frozen shards, and {@code 10} for frozen shards.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is out of date with the implementation.

*/
public class ConcurrentRebalanceAllocationDeciderTests extends ESAllocationTestCase {

public void testConcurrentUnlimited() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: naming suggests this test is for unlimited only, but it seems to be when we're unlimited and when we're below the limit?

);
}

public void testFrozenConcurrentUnlimited() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this name

settings = Settings.builder()
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 6)
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", 0)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we do random value for rebalance with random less than that value in progress?

settings = Settings.builder()
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 0)
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", 6)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and random this one please

ClusterState clusterState = setupConcurrentRelocations(2);

Settings settings = Settings.builder()
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random please

Settings.builder()
.put("cluster.routing.allocation.node_concurrent_recoveries", 10)
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", -1)
.build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nudge on on this one, could set it to zero maybe to disable non-frozen movement altogether?

logger.info("start the replica shards, rebalancing should start, but with a limit " + nodeCount + " should be rebalancing");
clusterState = startInitializingShardsAndReroute(allocationService, clusterState);

// we only allow any number of relocations at a time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment wording?

perhaps mention that 8 will be migrating off to spread shards evenly

assertThat(shardRouting.size(), equalTo(2));
assertThat(shardRouting.primaryShard().state(), equalTo(STARTED));
assertThat(shardRouting.replicaShards().get(0).state(), equalTo(INITIALIZING));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop looks repeated a lot, I wonder if we could make a test utility to do these repeated assertions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants