-
Notifications
You must be signed in to change notification settings - Fork 25.5k
allocation: create separate limit for frozen tier concurrent rebalance #135243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
allocation: create separate limit for frozen tier concurrent rebalance #135243
Conversation
The concurrent rebalance limit prevents too many shard relocations from happening at once. This works well in limiting active shards on the hot tier, so compute for indexing is reserved. Given that the frozen tier is an archive, compute separation is less of a concern and more rebalancing activity can occur without degradation. This change creates a separate variable for managing this.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I think the approach is sound, using the different limits depending on the type of node.
// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation | ||
// it is possible to artificially set above setting to 2 to avoid unnecessary moves in desired balance. | ||
// Separately: keep overall limit in simulation to two including frozen shards | ||
return allocation.decision(Decision.THROTTLE, NAME, "allocation should move one shard at the time when simulating"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, this whole branch might be redundant (?) since #134786
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful -- can we leave this to a separate PR? This may not be the only decider that needs to change.
...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java
Show resolved
Hide resolved
...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java
Outdated
Show resolved
Hide resolved
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 3) | ||
.build() | ||
); | ||
boolean testFrozen = randomBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably worth having a separate method to test frozen? see https://github.com/elastic/elasticsearch/blob/main/TESTING.asciidoc#use-randomized-testing-for-coverage
- per-shard decision has been implemented, using the cluster-wide logic from before - cluster-wide decision has been changed to allow progress whenever there is any budget
Thanks Nick -- this was really helpful. I updated the overall algorithm shape to match what we talked about, where the shard-level one implements the cluster-wide logic from before, but specific to frozen or non-frozen counts. The cluster-level algorithm checks for room in either limit, including one bit we didn't mention but I realize mattered: one limit having "unlimited" as a configuration status. I have some more to do on optimizing the counts for frozen. I need to have a look at this and decide what to do, but I think I need to keep a shardId set in routing nodes, or set a flag on ShardRouting. There are a few other comments you made that I didn't get to quite yet. |
- Implemented count of relocatingShards inside RoutingNodes
Hi Nick -- I added the count inside RoutingNodes, as discussed. It wasn't as bad as I thought. I am wondering what you think about the cluster-wide logic. The check for unlimited makes things a little different than the addition we talked about, but I think what I have is simple enough. It could do with a Decision refactor I guess. |
return true; | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for determining what constitutes a frozen shard we should check with wiser heads. We are considering only shards moving off of a dedicated frozen node as being "frozen" shards.
I think this is an acceptable approximation because:
- We recommend self managed deployments use dedicated frozen nodes and we only have dedicated frozen nodes in ECH (I believe), so "dedicated frozen node" is equivalent to just "frozen node" if things are configured as we recommend. I think it's OK to not allow more rebalancing in the event someone has gone outside our recommendations and created non-dedicated frozen nodes.
- If I recall correctly, part of the justification for the change was that the frozen and non-frozen nodes are mutually exclusive
- There is an edge case where we'd count shards relocating from dedicated to non-dedicated frozen nodes, but not vice versa, but again that seems outside of our recommendations and not worth going to great effort to remedy? if we want to we be strict about that we could also check the
relocatingNodeId
for dedicated frozen-nes?.
- To determine whether a shard is a frozen shard we need to look at the index metadata, which we could do in
canRebalance(shardRouting, allocation)
via the cluster state and checkingIndexMetadata#getTierPreference()
, but it looks like knowledge of specific data tiers lives in x-pack as it stands, and I wonder if we want that to leak out. It looks like tier preference is largely opaque to the rest of the code base, I may have misinterpreted that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to check with someone else, too.
I wrote the frozen/non-frozen counts, so that frozen is a subset count of all relocations. If a relocation isn't counted as frozen, then it's counted as non-frozen. So if something is mistaken, it's just subjected to a tighter constraint.
I can add a flag to ShardRouting, but as we discussed (and you're reminding me here) it may be too far for the data tier concept to propagate.
...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/RoutingNode.java
Outdated
Show resolved
Hide resolved
return relocatingShards; | ||
} | ||
|
||
private boolean isFrozenNode(String nodeId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Perhaps we should be explicit that we're looking for "dedicated" frozen nodes in the method name?
- removed extraneous count method in RoutingNode - renamed RoutingNodes.isFrozenNode -> isDedicatedFrozenNode - centralized frozen node check: decider had own - reuse original "reached the limit" log message langauge; this got mangled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, mostly cosmetic comments
assertThat(clusterState.routingTable().index("test").shard(i).shard(0).state(), equalTo(UNASSIGNED)); | ||
assertThat(clusterState.routingTable().index("test").shard(i).shard(1).state(), equalTo(UNASSIGNED)); | ||
assertThat(clusterState.routingTable().index("test").shard(i).shard(0).currentNodeId(), nullValue()); | ||
assertThat(clusterState.routingTable().index("test").shard(i).shard(1).currentNodeId(), nullValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could store clusterState.routingTable().index("test")
in a variable?
testClusterConcurrentInternal(true); | ||
} | ||
|
||
void testClusterConcurrentInternal(boolean testFrozen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't love the boolean flag parameter here, I wonder if it would be neater to pass in e.g. a cluster settings, a node factory and an index settings provider or something?
This could also just be a preference of mine that you or others might disagree with.
.add(newNode("node9", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE))) | ||
.add(newNode("node10", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE))) | ||
) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be better to randomise the numbers here to make it clear that 8 isn't in any way significant
Settings.builder() | ||
.put("cluster.routing.allocation.node_concurrent_recoveries", 10) | ||
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", -1) | ||
.build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to set non-frozen recoveries to something very restrictive (like 1) so it's clear those limits are not in play here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nudge on on this one, could set it to zero maybe to disable non-frozen movement altogether?
.add(newNode("node7", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE))) | ||
.add(newNode("node8", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE))) | ||
.add(newNode("node9", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE))) | ||
.add(newNode("node10", Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Perhaps de-dup Collections.singleton(DiscoveryNodeRole.DATA_FROZEN_NODE_ROLE)
for (int i = 0; i < clusterState.routingTable().index("test_frozen").size(); i++) { | ||
assertThat(clusterState.routingTable().index("test_frozen").shard(i).size(), equalTo(2)); | ||
assertThat(clusterState.routingTable().index("test_frozen").shard(i).primaryShard().state(), equalTo(STARTED)); | ||
assertThat(clusterState.routingTable().index("test_frozen").shard(i).replicaShards().get(0).state(), equalTo(INITIALIZING)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: try and reduce repetition in these blocks (e.g. clusterState.routingTable().index("test_frozen").shard(i)
)
assertThat(shardsWithState(clusterState.getRoutingNodes(), "test", RELOCATING).size(), equalTo(3)); | ||
|
||
assertThat(shardsWithState(clusterState.getRoutingNodes(), "test_frozen", STARTED).size(), equalTo(10)); | ||
assertThat(shardsWithState(clusterState.getRoutingNodes(), "test_frozen", RELOCATING).size(), equalTo(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have these frozen and hot rebalances happening concurrently to make it clear that the limits can both be maxed out concurrently perhaps? Or is there a reason they're done sequentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out I was incorrectly using DATA_ROLE
at the top instead of DATA_HOT_ROLE
.
Thanks for the comments on refactoring the tests, Nick. Sometimes I find boilerplate more readable, but in some of those situations it was getting hard to read. The issue with the test was a good one to point out. I looked over the shard assignments earlier, and thought everything was fine. It turns out it was still able to mix frozen -> data nodes (but not data -> frozen), so doing those runs separately was the only way things worked. It works concurrently with the right data label. I am wondering a little about the default config value of the frozen tier concurrent setting. It's at 10 now. Should I keep it at 2 or 3? This way, it will work the same as it did before (well, maybe not if the cluster has a different-than-default concurrent rebalance limit). The right behavior in my mind would be to set it to the frozen limit's default value of the non-frozen limit's current value, then let them modify it from there. |
Yeah I think defaulting frozen to the same value as non-frozen makes sense. If we decide it needs to be higher in the future we can adjust that. |
"below threshold [%d] for concurrent rebalances, current rebalance shard count [%d]", | ||
relocatingShards, | ||
CLUSTER_ROUTING_ALLOCATION_CLUSTER_CONCURRENT_REBALANCE_SETTING.getKey(), | ||
clusterConcurrentRebalance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the string template doesn't match the parameters now? we used to have three placeholders, now we only have two. I wonder if we should add a unit test for the decider, there doesn't seem to be one already, but I have seen them implemented for other deciders, it would make it easy to test all these branches. See org.elasticsearch.cluster.routing.allocation.decider.WriteLoadConstraintDeciderTests#testWriteLoadDeciderCanAllocate
for example?
...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java
Show resolved
Hide resolved
} | ||
|
||
public void testClusterConcurrentRebalanceFrozen() { | ||
AllocationService strategy = createAllocationService( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we call the variable allocationService
? I think the strategy
naming is a hangover from a previous refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM with some nits
* re-balance (shard relocation) operations and restricts node allocations | ||
* if the configured threshold is reached. Frozen and non-frozen shards are | ||
* considered separately. The default number of concurrent rebalance operations | ||
* is set to {@code 2} for non-frozen shards, and {@code 10} for frozen shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is out of date with the implementation.
*/ | ||
public class ConcurrentRebalanceAllocationDeciderTests extends ESAllocationTestCase { | ||
|
||
public void testConcurrentUnlimited() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: naming suggests this test is for unlimited only, but it seems to be when we're unlimited and when we're below the limit?
); | ||
} | ||
|
||
public void testFrozenConcurrentUnlimited() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this name
settings = Settings.builder() | ||
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 6) | ||
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", 0) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we do random value for rebalance with random less than that value in progress?
settings = Settings.builder() | ||
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 0) | ||
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", 6) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and random this one please
ClusterState clusterState = setupConcurrentRelocations(2); | ||
|
||
Settings settings = Settings.builder() | ||
.put("cluster.routing.allocation.cluster_concurrent_rebalance", 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random please
Settings.builder() | ||
.put("cluster.routing.allocation.node_concurrent_recoveries", 10) | ||
.put("cluster.routing.allocation.cluster_concurrent_frozen_rebalance", -1) | ||
.build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nudge on on this one, could set it to zero maybe to disable non-frozen movement altogether?
logger.info("start the replica shards, rebalancing should start, but with a limit " + nodeCount + " should be rebalancing"); | ||
clusterState = startInitializingShardsAndReroute(allocationService, clusterState); | ||
|
||
// we only allow any number of relocations at a time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment wording?
perhaps mention that 8 will be migrating off to spread shards evenly
assertThat(shardRouting.size(), equalTo(2)); | ||
assertThat(shardRouting.primaryShard().state(), equalTo(STARTED)); | ||
assertThat(shardRouting.replicaShards().get(0).state(), equalTo(INITIALIZING)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop looks repeated a lot, I wonder if we could make a test utility to do these repeated assertions?
The concurrent rebalance limit prevents too many shard relocations from happening at once. This works well in limiting active shards on the hot tier, so compute for indexing is reserved. Given that the frozen tier is an archive, compute separation is less of a concern and more rebalancing activity can occur without degradation. This change creates a separate variable for managing this.
Fixes: ES-11303