Log when a shard is immovable for longer than a configured threshold #136997

nicktindall · 2025-10-23T05:49:41Z

When shards have been in undesired allocation for longer than some configurable threshold, log the results of canAllocate for every node in the desired balance to aid in troubleshooting.

Implemented by recording a relative timestamp when we first notice a shard is in an undesired allocation, which is cleared when it is relocated, or the allocation it's in becomes "desired" again.

Relates ES-11928

ywangd

Is it viable to take a similar approach as how unassigned shard is logged in the computer? That is, keep track of an undesired shard (or 3 if you prefer, but 1 is probably good enough?) and log when it is first observed and no logging if it keeps being undesired and not moving. Also logs a message when the tracked shard is moved and start tracking another one?

ywangd

I had more thoughts on this and I'd like to check whether we are doing the right thing. Or maybe we have somewhat different perspective.

In the context of write load decider, I think what we are mostly interested is moveShards instead of balance. That is, if balancer decides a shard allocation is NOT_PREFERRED and also finds it a new target node (canAllocate decision should be YES), we expect the Reconciler to be able to actually move this shard to fix hotspot. But if reconciler cannot move the shard for any reason that is not concurrent recovery related (here and here), it's worth reporting since it is both unexpected and invalidaes our assumption that moving 1 shard is sufficent to mitigate the hotspot till next ClusterInfo poll. Similar reasoning applies to a canRemain = NO and canAllocate = YES/NOT_PREFERRED shard as well for which we also want to report if reconciler cannot move it.

I think the balance part is less interesting in this context. It might be useful for more broad tracking but seems not really relevant for hotspot mitgation?

nicktindall · 2025-10-24T04:09:17Z

In the context of write load decider, I think what we are mostly interested is moveShards instead of balance.

Yes definitely. In general I think we are interested in both though. i.e. if we are persistently unable to move a shard whether it be to balance the cluster or move shards that cannot remain and the reason is something other than THROTTLE then it's worth reporting.

Do you think the approach in general is OK (keeping a map of immovable shards and when they started being immovable and clearing them when they move). If so I'll add the additional logic to cover the moveShards cases and write some tests.

ywangd · 2025-10-24T04:38:38Z

Can we report already when we observed an unmovable shard the first time? It should not happen unless it's throttled. Do we really need to track it and report after certain threshold? We can frequency cap the logging but seems like we don't need the map for tracking which is simpler?

nicktindall · 2025-10-24T04:43:48Z

Can we report already when we observed an unmovable shard the first time? It should not happen unless it's throttled. Do we really need to track it and report after certain threshold? We can frequency cap the logging but seems like we don't need the map for tracking which is simpler?

We could, but I think it might be potentially noisy? for example if WriteLoadDecider returns NOT_PREFERRED for a shard and it gets moved by the DesiredBalanceAllocator, a snapshot might kick-off which might delay the move for a few reconciler iterations I think? but we are only interested if that immovability persists for a long time aren't we?

ywangd · 2025-10-24T05:43:07Z

Snapshot decider returns THROTTLE while the current PR tracks only NO decision, right? So reporting straight away should only be for NO decision which is genuinely exceptional? Or do you think we also want to wait for No decision?

I am a little concerned of tracking all unmovable shards. It's unclear how large it could get. It could also leak if an unmovable shard is removed? In summary, I am thinking about something like the following:

Report straightaway for one or a few unmovable shards due to No or null with explanations. We can frequency cap it.
Publish metrics on number of unmovable shards and label them with Decision type, source node (and maybe shard role?) This can be a separate PR.

But I could also be OK if we bound the map and ensure entries eventually get deleted.

PS: A bit wild idea is to track number of unmovable attempts with ShardRouting itself. We have other metadata attached to it as well, e.g. RelocationFailureInfo (basically a counter) and expectedShardSize.

…rogress

elasticsearchmachine · 2025-11-06T05:06:20Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd

LGTM

ywangd · 2025-11-06T05:19:28Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/UndesiredAllocationsTracker.java

+     * Track an allocation as being undesired
+     */
+    public void trackUndesiredAllocation(ShardRouting shardRouting) {
+        assert shardRouting.unassigned() == false : "Shouldn't record unassigned shards as undesired allocations";


Nit: I think we can assert shardRouting.started()

This was really just to protect against getting a null allocation ID (we need it to track the allocation), I think it might be possible to have a shard in initializing state that's undesired, or relocating? I think started might be overly restrictive.

ywangd · 2025-11-06T05:24:34Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/UndesiredAllocationsTracker.java

+        long earliestUndesiredTimestamp = Long.MAX_VALUE;
+        for (var allocation : undesiredAllocations) {
+            if (allocation.value < earliestUndesiredTimestamp) {
+                earliestUndesiredTimestamp = allocation.value;
+            }
+        }


OK. Can we wrap it inside undesiredAllocationDurationLogInterval.maybeExecute? It still seems a bit wasteful to me if we end up not logging it at all?

ywangd · 2025-11-06T05:32:50Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/UndesiredAllocationsTracker.java

+        if (undesiredAllocations.size() < maxUndesiredAllocationsToTrack) {
+            final var allocationId = shardRouting.allocationId().getId();
+            if (undesiredAllocations.containsKey(allocationId) == false) {
+                undesiredAllocations.put(
+                    allocationId,
+                    new UndesiredAllocation(shardRouting.shardId(), timeProvider.relativeTimeInMillis())
+                );


I wonder whether there is need to prioritize primary shards in case it somehow filled by search shards which are still interesting but less so for the time being. We can defer it until we collect the initial logs.

ywangd · 2025-11-06T05:34:01Z

...java/org/elasticsearch/cluster/routing/allocation/allocator/UndesiredAllocationsTracker.java

+    public static final Setting<TimeValue> UNDESIRED_ALLOCATION_DURATION_LOG_THRESHOLD_SETTING = Setting.timeSetting(
+        "cluster.routing.allocation.desired_balance.undesired_duration_logging.threshold",
+        FIVE_MINUTES,
+        Setting.Property.Dynamic,
+        Setting.Property.NodeScope


Should this have a reasonable min?

Added in 566f5d5

ywangd · 2025-11-06T05:39:35Z

...org/elasticsearch/cluster/routing/allocation/allocator/UndesiredAllocationsTrackerTests.java

+        // move to started
+        shardRouting = shardRouting.moveToStarted(randomNonNegativeLong());
+        undesiredAllocationsTracker.trackUndesiredAllocation(shardRouting);
+        assertEquals(1, undesiredAllocationsTracker.getUndesiredAllocations().size());
+
+        // start a relocation
+        shardRouting = shardRouting.relocate(randomIdentifier(), randomNonNegativeLong());
+        undesiredAllocationsTracker.trackUndesiredAllocation(shardRouting);
+        assertEquals(1, undesiredAllocationsTracker.getUndesiredAllocations().size());
+
+        // cancel that relocation
+        shardRouting = shardRouting.cancelRelocation();
+        undesiredAllocationsTracker.removeTracking(shardRouting);
+        assertEquals(0, undesiredAllocationsTracker.getUndesiredAllocations().size());


I see the point of the test. But in practice this should not happen right? If a tracked shard moves, it should be removed from the tracking before change?

Ah yes, I believe that is true for the scenarios I used in the test, but it was really just because the equals method for ShardRouting includes everything. This was just to demonstrate that identity/tracking is tied to the allocationId and not all the other metadata in the ShardRouting.

ywangd · 2025-11-06T05:43:27Z

...va/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceReconcilerTests.java

+                reconcileAndBuildNewState(
+                    reconciler,
+                    initialClusterState,
+                    new DesiredBalance(1, allShardsDesiredOnDataNode1),


We could add one more variant where desired balance is not computed for the shards or some shards and we should see no log.

Added in feaca10

…lastic#136997) Closes: ES-11928

Log when no progress is made towards the desired balance for some time

affdae5

nicktindall requested a review from ywangd October 23, 2025 05:49

elasticsearchmachine added the v9.3.0 label Oct 23, 2025

ywangd reviewed Oct 23, 2025

View reviewed changes

Track immovable shards individually

5249be9

nicktindall changed the title ~~Log when no progress is made towards the desired balance for some time~~ Log when a shard is immovable for some time Oct 24, 2025

nicktindall changed the title ~~Log when a shard is immovable for some time~~ Log when a shard is immovable for longer than a configured threshold Oct 24, 2025

nicktindall added 5 commits October 24, 2025 11:54

Naming, initial delay

14772ab

Remove dead code

d8186d9

Tidy

ad5c028

Javadoc

7a98137

Clear immovable shard in moveShards

e84efa6

nicktindall requested a review from ywangd October 24, 2025 01:08

Merge branch 'main' into log_on_no_balancing_progress

c901665

ywangd reviewed Oct 24, 2025

View reviewed changes

nicktindall added 11 commits October 27, 2025 16:04

Try tracking undesired state in shard routing

4194739

Fix setting names

f050316

Naming

c7b14fa

Naming

5cf44fc

ShardRouting#equals/hashCode

1472ec2

Javadoc

7652369

Fix logic to handle Long.MAX_VALUE

7996801

javadoc

2056c4e

javadoc

6e37d04

Merge remote-tracking branch 'origin/main' into log_on_no_balancing_p…

9590ce3

…rogress

Naming, RoutingNodes updates

904d597

nicktindall added 3 commits November 6, 2025 10:51

Check for capacity before we check for existing record

0af8050

Discard excess tracking if the limit is reduced

6db1bc3

Fix test after log message changed

5fd0770

nicktindall force-pushed the log_on_no_balancing_progress branch from 55384bd to 5fd0770 Compare November 6, 2025 01:25

nicktindall added 8 commits November 6, 2025 12:27

Default max to track to zero

7dcb690

Add more unit tests

ef8f45b

Make resilient to metadata changes

19b7475

Explicitly configure max tracking in test

ed7caa4

de-dupe default time value

53f95c0

Remove conditional and leave assertion

be8a38b

Reduce maximum max-to-track

061e2ad

Merge branch 'main' into log_on_no_balancing_progress

17fd729

nicktindall marked this pull request as ready for review November 6, 2025 05:05

nicktindall requested a review from a team as a code owner November 6, 2025 05:05

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 6, 2025

nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed needs:triage Requires assignment of a team area label labels Nov 6, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 6, 2025

nicktindall requested a review from ywangd November 6, 2025 05:06

ywangd approved these changes Nov 6, 2025

View reviewed changes

nicktindall added 3 commits November 6, 2025 16:52

Give duration threshold minimum of one minute

566f5d5

Use ordered map to quickly determine earliest entry

655e352

Add test cases where there is no desired balance

feaca10

nicktindall force-pushed the log_on_no_balancing_progress branch from f8b0913 to feaca10 Compare November 6, 2025 23:14

Merge branch 'main' into log_on_no_balancing_progress

d532523

nicktindall merged commit 4c38246 into elastic:main Nov 7, 2025
35 checks passed

Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025

Log when a shard is immovable for longer than a configured threshold (e…

6a83211

…lastic#136997) Closes: ES-11928

nicktindall deleted the log_on_no_balancing_progress branch November 12, 2025 04:34

Log when a shard is immovable for longer than a configured threshold #136997

Log when a shard is immovable for longer than a configured threshold #136997

Uh oh!

Conversation

nicktindall commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Oct 24, 2025

Uh oh!

ywangd commented Oct 24, 2025

Uh oh!

nicktindall commented Oct 24, 2025

Uh oh!

ywangd commented Oct 24, 2025

Uh oh!

elasticsearchmachine commented Nov 6, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicktindall commented Oct 23, 2025 •

edited

Loading

nicktindall Nov 6, 2025 •

edited

Loading