Implement move non preferred phase in allocator #134429

nicktindall · 2025-09-10T07:21:24Z

This is an attempt at implementing the moveNonPreferred option for dealing with moving hot-spots.

I opted to make the iteration order pluggable instead of providing a pluggable prioritisation mechanism. It strikes me that given we can only make a single move at a time we probably want that to be cheap as possible.

If we work out all the shards that could move then ask which one to move, we do a lot of work up front for a single movement, especially in a large cluster. If instead (as in this PR) we iterate through the shards in priority order we can stop as soon as we find a move that we can make, having hopefully only assessed a few shards for canRemain and move-abililty.

The DefaultNonPreferredShardIteratorFactory is very much over-fitted to the problem of moving shards off of hot-spotted nodes, we can make that more general when we have other reasons for being NOT_PREFERRED. It lazily populates the iterator, returning the shards from a single hot-spotted node at a time, this is because we take the first move-able shard then discard the rest of the iterator.

…n_preferred_iteration

nicktindall · 2025-09-11T01:29:18Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+            Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
+            if ((canRemain.type() == Type.NOT_PREFERRED || canRemain.type() == Type.NO) == false) {
+                return MoveDecision.remain(canRemain);
+            }


We will consider NO and NOT_PREFERRED here, because it may be that a NO is really a NOT_PREFERRED that's also a NO.

I don't quite follow here. I'd appreciate if you could help me understand it better. Do you mean the decision could be an overall NO because some other decider may say NO while the writeLoad decider says NOT_PREFERRED? Since we run moveShards first, do we still need to consider NO here?

In Allocation#withDeciders we return the "most negative" decision, which when one decider says NOT_PREFERRED and another says NO will be NO. Because we're iterating in the order of most-desirable-to-move first, if we see either of these values returned it makes sense to assume there was a NOT_PREFERRED in there and make the move anyway. The alternative would be to assume there was no NOT_PREFERRED when there is a NO and potentially moving a less-preferred shard.

This will come into play now, as @DiannaHohensee and I discussed this morning it's probably better to run moveNotPreferred first because otherwise we risk moving a sub-optimal shard when NO and NOT_PREFERRED intersect.

Can you elaborate on that argument about moving non-preferred first. I would naively think we want to ensure we move all hard-rules first - to vacate nodes - and then move the non-preferred after.

I think that also avoids this slightly confusing check.

I'll not that moveShards should try to move shards to places where canAllocate says YES over places where it says NOT_PREFERRED. Which seems to solve the sub-optimal shard movement issue?

Can you elaborate on that argument about moving non-preferred first.

The ShardMovementWriteLoadSimulator will simulate the end of a hot-spot as soon as a single shard leaves the node that is hot-spotting. So if moveShards runs first, it could eliminate the hot-spot before we reach moveNonPreferred and have the opportunity to select a sensible shard.

I would naively think we want to ensure we move all hard-rules first - to vacate nodes - and then move the non-preferred after.

The Balanced/DesiredBalanceShardsAllocators do not pick order of shard movement. The Reconciler does that -- the allocator and reconciler happen to have the same order, but I think the priority for the allocator is to make the best choices, not consider shard movement priority. The reconciler behavior is actually in my balancer changes patch.

The exception is allocateUnassigned for primaries, for which there's an early exit from the allocators to publish the DesiredBalance ASAP.

nicktindall · 2025-09-11T01:31:18Z

...org/elasticsearch/cluster/routing/allocation/allocator/NonPreferredShardIteratorFactory.java

+     * @return An iterator containing shards we'd like to move to a preferred allocation
+     */
+    Iterator<ShardRouting> createNonPreferredShardIterator(RoutingAllocation allocation);
+}


We could also just use Function<RoutingAllocation, Iterable<ShardRouting>> but having a specific interface gives us somewhere to put the documentation and NOOP implementation? Not particularly tied to the existence of this interface, but that's why I've put it here.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

mhl-b · 2025-09-11T01:53:04Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+            boolean movedAShard;
+            do {
+                // Any time we move a shard, we need to update the cluster info and ask again for the non-preferred shards
+                // as they may have changed
+                movedAShard = false;
+                for (Iterator<ShardRouting> nonPreferredShards = nonPreferredShardIteratorFactory.createNonPreferredShardIterator(
+                    allocation
+                ); nonPreferredShards.hasNext();) {
+                    if (tryMoveShardIfNonPreferred(nonPreferredShards.next())) {
+                        movedAShard = true;
+                        break;
+                    }
+                }
+                // TODO: Update cluster info
+            } while (movedAShard);


maybe return Iterable, then whole thing would be:

for (var shard : nonPreferredIterable(allocation)) { if (tryMoveShardIfNonPreferred(shard) { return; } }

The code here is what the ultimate solution should look like, where we are able to do multiple moves in a single allocate call. We stop iterating when no moves are made, but each time we make a move we refresh the shard iterator, because we may have resolved a hot-spot, which could change the order or contents of the list.

For example, if there are two hot-spotted nodes (N and M), the first time we call for the iterator it will be:

N1, N2, N3, M1, M2, M3, ...

then if we successfully move N2 and it resolve the hot-spot we'll ask again and get

M1, M2, M3, ...

Hence the nested loop, but true, the inner loop could be tidier with an Iterable

Switched to Iterable in cebf3a9

mhl-b · 2025-09-11T02:01:34Z

...sticsearch/cluster/routing/allocation/allocator/DefaultNonPreferredShardIteratorFactory.java

+                hotSpottedNodes.add(new NodeShardIterable(allocation, node, writeThreadPoolStats.maxThreadPoolQueueLatencyMillis()));
+            }
+        }
+        return new NodeShardIterator(hotSpottedNodes.iterator());


hotSpottedNodes seems all nodes with stats available, not necessarily hot spot
missing maxQueueLatency threshold check?

No this is intentional and perhaps more of a naming issue, ideally I think this iterator factory just produces the iterator and doesn't do any filtering at all (that's the job of the deciders). It just returns shards in an order where the most desirable to move are presented first.

I realise I am doing some filtering by excluding nodes with no utilisation and shards with no write load, but that goes to my earlier comment about it being over-fitted to the write load use case. We can change that if things change, but currently there's no sense in investigating those shards.

mhl-b · 2025-09-11T02:12:14Z

...sticsearch/cluster/routing/allocation/allocator/DefaultNonPreferredShardIteratorFactory.java

+        private Iterator<ShardRouting> createShardIterator() {
+            final var shardWriteLoads = allocation.clusterInfo().getShardWriteLoads();
+            final List<ShardRouting> sortedRoutings = new ArrayList<>();
+            double totalWriteLoad = 0;
+            for (ShardRouting shard : routingNode) {
+                Double shardWriteLoad = shardWriteLoads.get(shard.shardId());
+                if (shardWriteLoad != null) {
+                    sortedRoutings.add(shard);
+                    totalWriteLoad += shardWriteLoad;
+                }
+            }
+            // TODO: Work out what this order should be
+            // Sort by distance-from-mean-write-load
+            double meanWriteLoad = totalWriteLoad / sortedRoutings.size();
+            sortedRoutings.sort(Comparator.comparing(sr -> Math.abs(shardWriteLoads.get(sr.shardId()) - meanWriteLoad)));
+            return sortedRoutings.iterator();
+        }


If I recall correctly, Henning mentioned picking a shard somewhere in the middle. I think we dont need sort (strong order) but a set of average shards. For example create two partitions - preferable and not. Everything that 0.5-0.8 of maxShardLoad goes to preferable.

Yeah, we'd like to do that and I find that slightly harder with returning a list (though possibly doable).

I was thinking about this, an iterator that returns average shards first, if nothing worked then heavy shards(>0.8), then light (<0.5). In worst case traverse shards 4 times: find max load, then any average, then any heavy, then light.

private Stream<ShardRouting> shardsStream(){ return StreamSupport.stream(routingNode.spliterator(),false); } ... var maxLoad = shardsStream().mapToDouble(ShardRouting::load).max().orElse(1.0); var avg = shardsStream().filter(s -> s.load() / maxLoad >= 0.5 && s.load() / maxLoad <= 0.8); var heavy = shardsStream().filter(s -> s.load() / maxLoad > 0.8); var light = shardsStream().filter(s -> s.load() / maxLoad < 0.5); return concat(concat(avg, heavy), light).iterator(); //Stream.concat()

PS there is no ShardRouting::load, used here for brevity

Attempted in de052a3

:'-) ohboi. I would still opt-in for lazy sequence, rather than sorted list. I believe expected case is to have some average shards to move, hence allocating and sorting list seems redundant, a single pass with filter should suffice. Especially in context of 10k shards node, there is high probability of having a good average shard.

Yes, good point, made lazier in d59ea2b

Also didn't bother sorting inside the low/medium/high ranges but can easily add that if we think its worth it for the determinism.

DiannaHohensee

I took a spin through and left some thoughts.

DiannaHohensee · 2025-09-12T18:36:17Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+        );
        balancer.allocateUnassigned();
        balancer.moveShards();
+        balancer.moveNonPreferred();


What are your thoughts on this ordering? I figured we'd need to run the new logic before moveShards, since moveShards could trigger the simulator to consider the hot-spot addressed, before we check in moveNonPreferred.

I think it depends on whether you think fixing hot-spots or shutting down nodes is more important. The naming would suggest it's more important to move canRemain=NO than canRemain=NOT_PREFERRED shards, therefore moveShards should get first priority at movement, but like you say because moveNotPreferred prioritises movements, we may make sub-optimal moves in the event of an intersection between NO and NOT_PREFERRED. If there is an intersection, that would (most likely) suggest currently that there is a shutting down node that is also hot-spotting. In which case we have to evacuate all the shards either way.

I don't have strong preference here because I hope it's rare enough to not matter, I'm inclined to follow the naming and prioritise moving NOs, perhaps we need to apply our prioritisation there too?

Since this currently breaks the shard moves simulator, can we run moveNonPreferred before moveShards? Otherwise we could have eliminated the hot-spot in the simulator, with moveShard shard relocations, by the time the code gets here.

I don't think the ordering will have a significant impact on shard movement. Especially since, to fix a hot-spot, we're moving a single shard per node per 30 second stats refresh cycle: I'd expect it to be irrelevant noise compared to the number of shards moved away from a shutting down node. This is also the allocator, the decisions we make here don't affect the order of shard movement. You'd have to do something with the Reconciler if you wanted to affect shard movement order.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

DiannaHohensee · 2025-09-12T22:25:32Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+        }
+
+        private boolean moveASingleNonPreferredShard() {
+            for (ShardRouting shardRouting : nonPreferredShardIteratorFactory.createNonPreferredShardIterator(allocation)) {


I expect you might be trying to be too generic. We know that we're dealing with the write load decider, and every shard will return not-preferred when there is a hot-spot. So we want a list of shards on a particular node (that's hot-spotting) ordered by write load estimate.

Not a concrete suggestion, rather a general thought on the implementation approach.

I was more explicit about that in my first attempt at this, but I don't think the approach was well received as it's something of a departure from the way deciders work currently.

What you suggest does actually happen in the DefaultNonPreferredShardIteratorFactory, up-front we build a list of nodes ordered by queue latency, we then return the shards from those nodes ordered by preference-for-moving. At this stage it's distance from mean write load, but that's subject to change.

I build this list lazily because the hope is we don't have to iterate too far through it to find a shard that's movable.

So the IteratorFactory interface is generic, but the default implementation is very tailored to what we know about the hot-spot decider.

I haven't taken a good look at your first attempt. My first thought would be to avoid using the deciders until we've got a list of ordered shards for a hot node to try to relocate. Start by filtering down to the nodes exceeding the queue latency threshold, discard the other nodes. Then the allocation deciders only come into play to select a new node assignment.

We would be ignoring the WriteLoadDecider's canRemain method... It's not obvious to me how to not ignore it 🤔 To move moveNonPreferred before moveShards, we'd have NO answers covering NOT_PREFERRED answers, which is another problem with using canRemain.

up-front we build a list of nodes ordered by queue latency

We only need to look at hot-spotting nodes, and there's no need to create a relative order for the nodes.

I think my initial thought would be to run through all the nodes, call canRemain on all shards from that node and collect those with NOT_PREFERRED result that have a YES result elsewhere. Then call the strategy to pick the one shard to move.

I think we've discussed this, but maybe it was discarded?

I think this prepares us better for multiple deciders saying not-preferred.

Sorry @henningandersen I did deviate a little from what we discussed, but thinking only from an optimisation perspective, I figured it would be conceptually the same structure.

i.e my implementation does

build iterator according to prioritisation logic

work our way down it calling canRemain/decideMove to determine the first one that can move and then move it

repeat

only because, if I understand correctly, you've advocated for

call canRemain/decideMove to determine the set of shards that want to move, and can move

pick one using the prioritisation logic and move it

repeat

My thinking was that the latter approach would do loads of work up front (e.g. in a cluster with ~10,000 shards on each of multiple hot-spotted nodes) only to then move a single shard. The decideMove logic is ~O(n^2), whereas the prioritisation logic is almost certainly cheaper than that (just a sort) and currently able to be performed lazily one node at a time.

I think you mentioned there may be cases where we could implement special logic if we knew the full set of shards that were moveable in that prioritisation logic, but it seems to me we should defer that cost until we identify some such scenarios?

I think my initial thought would be to run through all the nodes, call canRemain on all shards from that node and collect those with NOT_PREFERRED result that have a YES result elsewhere. Then call the strategy to pick the one shard to move.

If a node is write load hot-spotting, then canRemain will return NOT_PREFERRED for every shard because queue_latency > queue_latency_threshold will always be true. Any not-preferred decider will run on node-level resources, that return the same canRemain answer for all shards, I think?

canAllocate YES sounds like a nice filter.

I think this prepares us better for multiple deciders saying not-preferred.

I can't see a way for the balancer not to know about individual deciders for not-preferred / hotspots. Suppose the heap usage returned not-preferred (it doesn't, but for sake of discussion).

If the balancer checks all the deciders for canRemain NOT_PREFERRED, and finds a hot-spot, we move on to correcting the hot-spot. However, to correct the hot-spot, we need to know which resource is hot spotting because the shard order prioritization will be different for write load vs heap usage.

I think the balancer needs to know about individual deciders to address hot-spots, in order to prioritize the shards for relocation. Alternatively, a decider would need to be responsible for providing a strategy for ordering shards -- the AllocationDeciders would return a list of strategies, and the balancer runs a strategy per resource hot-spot.

I think the main case I want to add next is the index anti-affinity and there I think the strategy of picking a relevant loaded shard of the candidates is still good. But I agree we may want a more advanced strategy. It could however also look at the base data again, determining out of the moveable shards which one to pick based on the known dimensions. That could be as simple as "if the node has a queue latency go by write-load, otherwise pick one, does not matter which (some determinism may be preferable though)".

DiannaHohensee · 2025-09-12T22:27:02Z

...sticsearch/cluster/routing/allocation/allocator/DefaultNonPreferredShardIteratorFactory.java

+        }
+
+        private Iterator<ShardRouting> createShardIterator() {
+            final var shardWriteLoads = allocation.clusterInfo().getShardWriteLoads();


It looks like you’re creating a list of shards across all nodes. I wonder if instead, we could first collect a list of nodes that are hot spotting, then create separate lists of shards (with their write loads, skip any shards with 0 load) for each hot spotting node from the allocation.clusterInfo().getShardWriteLoads(), and finally sort and iterate each shard list in the order we prefer, checking whether we can move each shard until we find one that’s movable for each node. Still need an iterator to sort and manage a list of shards, but it might be simpler just iterating at that level? Then the nodes don't need iterators.

We only want to move one shard per node. Not obvious to me how to easily achieve that when iterating all shards at once.

The approach here is it's an iterator that returns the shards we'd like to move next, in order of preference. Once we move a shard we ask again for this list. We have to do this because every time we move a shard it can change the list of shards we want to move (e.g. if a shard movement resolves a hot-spot, the shards from that node might appear further down the list in the subsequent iterator, and a lesser-hot-spotted node might appear at the front of it instead).

I tried to not do any filtering here, because it's supposed to be the prioritisation logic, where the deciders themselves decide whether we canRemain (it would seem to be duplicating logic to do it also here).

If we go through one of these iterators and don't find any shard we want to move, we break out of the loop and continue to balancing.

We have to do this because every time we move a shard it can change the list of shards we want to move

But moving a single shard resolves the hot spot. Even if we move one shard off of a NodeA, the priority order for further shards to move away from NodeA shouldn't be dynamic 🤔

if a shard movement resolves a hot-spot, the shards from that node might appear further down the list in the subsequent iterator, and a lesser-hot-spotted node might appear at the front of it instead

IIUC, you're trying to fairly spread node hot-spot resolution? Like pick a shard for NodeA, then pick a shard for NodeB, before coming back to NodeA. I don't think that matters for the allocator, which comes up with the final allocation, not the plan for which shards to move first. NodeA is hot-spotting, and we can focus on NodeA's shards to resolve the hot spot, before moving on the NodeB's shards. We wouldn't be assigning any of NodeA or NodeB's shards to NodeA or NodeB because they are hot / not-preferred, so there's no interaction there, and no need for evenness / fairness in selection order.

IIUC, you're trying to fairly spread node hot-spot resolution?

No, as discussed on zoom the iterator represents our preference for the next move. e.g. if there are three nodes (M, N, O) with queue latencies (100, 50, 0) the shards will be iterated in the order

M1, M2, M3, M4, N1, N2, N3, O1, O2

where Mx denotes the shard on node M that is the xth most desirable to move.

So we'll iterate through that list finding the first of those shards that can move somewhere, then execute the move, then we'll ask for that list again in the next iteration.

Say we moved a shard from M to O and now our latencies for (M, N, O) are (0, 50, 0), the next iterator will look like

N1, N2, N3, M1, M2, O1, O2, O3

because N is the most likely to be hot-spotted, so it goes to the front of the list

Then we move a shard off of N and the new latencies change to (M, N, O) = (0, 0, 0)

Then the iterator would look something like (although M, N, O could be in any order because they're all equal):

N1, N2, M1, M2, O1, O2, O3, O4

Which we'd iterate through and find no shard with canRemain = NOT_PREFERRED so we'd make no movements and move on to the next phase.

This ties in with my prior comment about actually calling canRemain first.

DiannaHohensee

I think a theme I see is the idea of trying to prioritize movements. But the allocator does not control that. The Reconciler controls move prioritization. The exception is unassigned shards, which has an early exit out of the DesiredBalanceShardsAllocator. Otherwise, in the existing code, it makes sense to fix the NO decisions of the allocators before rebalancing while continuing to obey NO decisions: addressing NOs after rebalancing would potentially unbalance the cluster, undoing that work. From this perspective, we just need moveNonPreferred to be in a place that it can run properly.

It’d be great to verbally discuss my comment threads with you, when you have time. To fast track the thread resolution.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

DiannaHohensee · 2025-09-15T20:11:03Z

...sticsearch/cluster/routing/allocation/allocator/DefaultNonPreferredShardIteratorFactory.java

+        }
+
+        private Iterator<ShardRouting> createShardIterator() {
+            final var shardWriteLoads = allocation.clusterInfo().getShardWriteLoads();


We have to do this because every time we move a shard it can change the list of shards we want to move

But moving a single shard resolves the hot spot. Even if we move one shard off of a NodeA, the priority order for further shards to move away from NodeA shouldn't be dynamic 🤔

if a shard movement resolves a hot-spot, the shards from that node might appear further down the list in the subsequent iterator, and a lesser-hot-spotted node might appear at the front of it instead

IIUC, you're trying to fairly spread node hot-spot resolution? Like pick a shard for NodeA, then pick a shard for NodeB, before coming back to NodeA. I don't think that matters for the allocator, which comes up with the final allocation, not the plan for which shards to move first. NodeA is hot-spotting, and we can focus on NodeA's shards to resolve the hot spot, before moving on the NodeB's shards. We wouldn't be assigning any of NodeA or NodeB's shards to NodeA or NodeB because they are hot / not-preferred, so there's no interaction there, and no need for evenness / fairness in selection order.

DiannaHohensee · 2025-09-15T20:18:43Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

+                balancerSettings,
+                writeLoadForecaster,
+                balancingWeightsFactory,
+                nonPreferredShardIteratorFactory


Rather than passing the factory implementation through the BalancedShardsAllocator and Balancer constructors, could we directly add the logic to the Balancer in the first place? Avoid the factory. The other objects passed through the constructors are usually shared with other components, whereas the new logic only runs in the Balancer.

The moveNonPreferred could be gated by the WRITE_LOAD_DECIDER_ENABLED_SETTING. An alternative to the NOOP implementation. I don’t think tests would even be able to exercise moveNonPreferred without some hot-spot mocking to get to a 5 second queue latency, even if the new logic were enabled by default.

Though perhaps there was some other reason for the NOOP / adding it here that I'm missing. Factories seem to come into play often for stateful vs stateless impls, but we don't have an alternative real implementation.

The idea here is just to put a boundary on the responsibilities of the two classes, the BalancedShardsAllocator doesn't care about the iteration order of the shards - as long as the iterator contains all the shards this logic will work.

Similarly to how the BalancedShardsAllocator doesn't care what the individual deciders do, it just knows about YES/NO/THROTTLE/NOT_PREFERRED.

In my opinion the interface delineates responsibilities, and allows the reader to not concern themselves with the implementation details of the iteration order when grok-ing the BalancedShardsAllocator. It also frees us up to bake in all kinds of knowledge about the configured deciders into the our implementation without that knowledge leaking into the BalancedShardsAllocator.

The default implementation could equally be

allocation -> allocation.routingNodes().nodeInterleavedShardIterator()

DiannaHohensee · 2025-09-15T21:18:19Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+        }
+
+        private boolean moveASingleNonPreferredShard() {
+            for (ShardRouting shardRouting : nonPreferredShardIteratorFactory.createNonPreferredShardIterator(allocation)) {


I haven't taken a good look at your first attempt. My first thought would be to avoid using the deciders until we've got a list of ordered shards for a hot node to try to relocate. Start by filtering down to the nodes exceeding the queue latency threshold, discard the other nodes. Then the allocation deciders only come into play to select a new node assignment.

We would be ignoring the WriteLoadDecider's canRemain method... It's not obvious to me how to not ignore it 🤔 To move moveNonPreferred before moveShards, we'd have NO answers covering NOT_PREFERRED answers, which is another problem with using canRemain.

up-front we build a list of nodes ordered by queue latency

We only need to look at hot-spotting nodes, and there's no need to create a relative order for the nodes.

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

…moveShards

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

henningandersen

Left a few initial comments. Maybe we need to POC the direction of doing canRemain first in a separate PR to figure out the direction (or maybe I am the only one thinking that is how it should work)?

henningandersen · 2025-09-16T12:02:08Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

-        collectAndRecordNodeWeightStats(balancer, balancingWeights, allocation);
+        try {
+            balancer.allocateUnassigned();
+            if (balancer.moveNonPreferred()) {


I think this should go after moveShards (but before balance)? It seems more important to use the incoming recovery budget on a target node for handling shutting down nodes or other rules than non-preference?

henningandersen · 2025-09-16T12:07:31Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+        }
+
+        private boolean moveASingleNonPreferredShard() {
+            for (ShardRouting shardRouting : nonPreferredShardIteratorFactory.createNonPreferredShardIterator(allocation)) {


I think my initial thought would be to run through all the nodes, call canRemain on all shards from that node and collect those with NOT_PREFERRED result that have a YES result elsewhere. Then call the strategy to pick the one shard to move.

I think we've discussed this, but maybe it was discarded?

I think this prepares us better for multiple deciders saying not-preferred.

henningandersen · 2025-09-16T12:11:17Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+            Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
+            if ((canRemain.type() == Type.NOT_PREFERRED || canRemain.type() == Type.NO) == false) {
+                return MoveDecision.remain(canRemain);
+            }


Can you elaborate on that argument about moving non-preferred first. I would naively think we want to ensure we move all hard-rules first - to vacate nodes - and then move the non-preferred after.

I think that also avoids this slightly confusing check.

henningandersen · 2025-09-16T12:12:22Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+            Decision canRemain = allocation.deciders().canRemain(shardRouting, routingNode, allocation);
+            if ((canRemain.type() == Type.NOT_PREFERRED || canRemain.type() == Type.NO) == false) {
+                return MoveDecision.remain(canRemain);
+            }


I'll not that moveShards should try to move shards to places where canAllocate says YES over places where it says NOT_PREFERRED. Which seems to solve the sub-optimal shard movement issue?

henningandersen · 2025-09-16T12:13:20Z

...sticsearch/cluster/routing/allocation/allocator/DefaultNonPreferredShardIteratorFactory.java

+        }
+
+        private Iterator<ShardRouting> createShardIterator() {
+            final var shardWriteLoads = allocation.clusterInfo().getShardWriteLoads();


This ties in with my prior comment about actually calling canRemain first.

henningandersen · 2025-09-16T12:14:09Z

...sticsearch/cluster/routing/allocation/allocator/DefaultNonPreferredShardIteratorFactory.java

+        private Iterator<ShardRouting> createShardIterator() {
+            final var shardWriteLoads = allocation.clusterInfo().getShardWriteLoads();
+            final List<ShardRouting> sortedRoutings = new ArrayList<>();
+            double totalWriteLoad = 0;
+            for (ShardRouting shard : routingNode) {
+                Double shardWriteLoad = shardWriteLoads.get(shard.shardId());
+                if (shardWriteLoad != null) {
+                    sortedRoutings.add(shard);
+                    totalWriteLoad += shardWriteLoad;
+                }
+            }
+            // TODO: Work out what this order should be
+            // Sort by distance-from-mean-write-load
+            double meanWriteLoad = totalWriteLoad / sortedRoutings.size();
+            sortedRoutings.sort(Comparator.comparing(sr -> Math.abs(shardWriteLoads.get(sr.shardId()) - meanWriteLoad)));
+            return sortedRoutings.iterator();
+        }


Yeah, we'd like to do that and I find that slightly harder with returning a list (though possibly doable).

…n_preferred_iteration

nicktindall · 2025-09-21T23:42:40Z

Superseded by #135058

nicktindall added 11 commits September 8, 2025 15:53

Implement AllocationDeciders#findNonPreferred

53bfa9b

Merge branch 'main' into ES-12739_select_hot_shard_to_move_off_data_node

971a395

Fix assertion

f0f9f77

Implement prioritisable problems

71232d5

Javadoc

c31b05a

Example for write load constraint decider

967e76e

Fix text

d147a18

Tidy

4f7b519

Fix boolean logic

91ee197

Introduce pluggable non-preferred iteration

1d3b08e

Implement NonPreferredShardIteratorFactory for resolving hot-spots

687c6e2

elasticsearchmachine added the v9.2.0 label Sep 10, 2025

nicktindall added 8 commits September 10, 2025 17:23

Remove unused default implementation

1a4c85a

Merge remote-tracking branch 'origin/main' into ES-12739_pluggable_no…

a14af62

…n_preferred_iteration

Get rid of remnants of prior approach

3196d21

Remove cruft

d63012c

Improve naming/javadoc

7918b3b

Improve wiring

d349668

Fix infinite loop

0c34875

Merge branch 'main' into ES-12739_pluggable_non_preferred_iteration

22ad4d9

nicktindall commented Sep 11, 2025

View reviewed changes

nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Sep 11, 2025

nicktindall requested review from ywangd, henningandersen, DiannaHohensee and mhl-b September 11, 2025 01:39

nicktindall commented Sep 11, 2025

View reviewed changes

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java Outdated Show resolved Hide resolved

mhl-b reviewed Sep 11, 2025

View reviewed changes

nicktindall added 6 commits September 11, 2025 14:36

Test/fix iterator logic

b7fcc4a

Test shard iteration order

ab73569

Use Iterable instead of Iterator

cebf3a9

Comment

e215a55

Test when decider not fully enabled

53c7b75

Naming

0d4a5c7

DiannaHohensee reviewed Sep 12, 2025

View reviewed changes

DiannaHohensee reviewed Sep 15, 2025

View reviewed changes

Only move a single non-preferred shard, do move non-preferred before …

af93b87

…moveShards

ywangd reviewed Sep 16, 2025

View reviewed changes

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java Show resolved Hide resolved

henningandersen reviewed Sep 16, 2025

View reviewed changes

nicktindall added 13 commits September 17, 2025 18:57

Sort shards correctly

de052a3

Use streams instead of sorting shards up-front

d59ea2b

Merge remote-tracking branch 'origin/main' into ES-12739_pluggable_no…

ab127a9

…n_preferred_iteration

Fix javadoc

6093a7a

Use record class

cf03477

Test that all shards are returned

e79a84a

Merge remote-tracking branch 'origin/main' into ES-12739_pluggable_no…

a433918

…n_preferred_iteration

Add NODE_INTERLEAVED as an iteration order

ac0ec29

Javadoc for NonPreferredShardIteratorFactory

c2ee39f

Javadoc

69a545a

Try to simplify condition

c76af05

in-line tryMoveShardIfNonPreferred

630d06d

Move new behaviour together

5103231

nicktindall mentioned this pull request Sep 18, 2025

Allow deciders to nominate shards to move #134280

Closed

Comment on NOOP default

e888265

nicktindall closed this Sep 21, 2025

nicktindall deleted the ES-12739_pluggable_non_preferred_iteration branch October 7, 2025 00:07

Implement move non preferred phase in allocator #134429

Implement move non preferred phase in allocator #134429

Uh oh!

Conversation

nicktindall commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Sep 10, 2025 •

edited

Loading

nicktindall Sep 11, 2025 •

edited

Loading

nicktindall Sep 11, 2025 •

edited

Loading

nicktindall Sep 11, 2025 •

edited

Loading

mhl-b Sep 16, 2025 •

edited

Loading

nicktindall Sep 15, 2025 •

edited

Loading

DiannaHohensee Sep 15, 2025 •

edited

Loading

nicktindall Sep 15, 2025 •

edited

Loading

nicktindall Sep 16, 2025 •

edited

Loading

nicktindall Sep 15, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading