Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Oct 23, 2025

When shards have been in undesired allocation for longer than some configurable threshold, log the results of canAllocate for every node in the desired balance to aid in troubleshooting.

Implemented by recording a relative timestamp when we first notice a shard is in an undesired allocation, which is cleared when it is relocated, or the allocation it's in becomes "desired" again.

Relates ES-11928

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it viable to take a similar approach as how unassigned shard is logged in the computer? That is, keep track of an undesired shard (or 3 if you prefer, but 1 is probably good enough?) and log when it is first observed and no logging if it keeps being undesired and not moving. Also logs a message when the tracked shard is moved and start tracking another one?

@nicktindall nicktindall changed the title Log when no progress is made towards the desired balance for some time Log when a shard is immovable for some time Oct 24, 2025
@nicktindall nicktindall changed the title Log when a shard is immovable for some time Log when a shard is immovable for longer than a configured threshold Oct 24, 2025
@nicktindall nicktindall requested a review from ywangd October 24, 2025 01:08
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had more thoughts on this and I'd like to check whether we are doing the right thing. Or maybe we have somewhat different perspective.

In the context of write load decider, I think what we are mostly interested is moveShards instead of balance. That is, if balancer decides a shard allocation is NOT_PREFERRED and also finds it a new target node (canAllocate decision should be YES), we expect the Reconciler to be able to actually move this shard to fix hotspot. But if reconciler cannot move the shard for any reason that is not concurrent recovery related (here and here), it's worth reporting since it is both unexpected and invalidaes our assumption that moving 1 shard is sufficent to mitigate the hotspot till next ClusterInfo poll. Similar reasoning applies to a canRemain = NO and canAllocate = YES/NOT_PREFERRED shard as well for which we also want to report if reconciler cannot move it.

I think the balance part is less interesting in this context. It might be useful for more broad tracking but seems not really relevant for hotspot mitgation?

@nicktindall
Copy link
Contributor Author

In the context of write load decider, I think what we are mostly interested is moveShards instead of balance.

Yes definitely. In general I think we are interested in both though. i.e. if we are persistently unable to move a shard whether it be to balance the cluster or move shards that cannot remain and the reason is something other than THROTTLE then it's worth reporting.

Do you think the approach in general is OK (keeping a map of immovable shards and when they started being immovable and clearing them when they move). If so I'll add the additional logic to cover the moveShards cases and write some tests.

@ywangd
Copy link
Member

ywangd commented Oct 24, 2025

Can we report already when we observed an unmovable shard the first time? It should not happen unless it's throttled. Do we really need to track it and report after certain threshold? We can frequency cap the logging but seems like we don't need the map for tracking which is simpler?

@nicktindall
Copy link
Contributor Author

Can we report already when we observed an unmovable shard the first time? It should not happen unless it's throttled. Do we really need to track it and report after certain threshold? We can frequency cap the logging but seems like we don't need the map for tracking which is simpler?

We could, but I think it might be potentially noisy? for example if WriteLoadDecider returns NOT_PREFERRED for a shard and it gets moved by the DesiredBalanceAllocator, a snapshot might kick-off which might delay the move for a few reconciler iterations I think? but we are only interested if that immovability persists for a long time aren't we?

@ywangd
Copy link
Member

ywangd commented Oct 24, 2025

Snapshot decider returns THROTTLE while the current PR tracks only NO decision, right? So reporting straight away should only be for NO decision which is genuinely exceptional? Or do you think we also want to wait for No decision?

I am a little concerned of tracking all unmovable shards. It's unclear how large it could get. It could also leak if an unmovable shard is removed? In summary, I am thinking about something like the following:

  1. Report straightaway for one or a few unmovable shards due to No or null with explanations. We can frequency cap it.
  2. Publish metrics on number of unmovable shards and label them with Decision type, source node (and maybe shard role?) This can be a separate PR.

But I could also be OK if we bound the map and ensure entries eventually get deleted.

PS: A bit wild idea is to track number of unmovable attempts with ShardRouting itself. We have other metadata attached to it as well, e.g. RelocationFailureInfo (basically a counter) and expectedShardSize.

@nicktindall nicktindall force-pushed the log_on_no_balancing_progress branch from 55384bd to 5fd0770 Compare November 6, 2025 01:25
@nicktindall nicktindall marked this pull request as ready for review November 6, 2025 05:05
@nicktindall nicktindall requested a review from a team as a code owner November 6, 2025 05:05
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 6, 2025
@nicktindall nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed needs:triage Requires assignment of a team area label labels Nov 6, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@nicktindall nicktindall requested a review from ywangd November 6, 2025 05:06
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Track an allocation as being undesired
*/
public void trackUndesiredAllocation(ShardRouting shardRouting) {
assert shardRouting.unassigned() == false : "Shouldn't record unassigned shards as undesired allocations";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can assert shardRouting.started()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really just to protect against getting a null allocation ID (we need it to track the allocation), I think it might be possible to have a shard in initializing state that's undesired, or relocating? I think started might be overly restrictive.

Comment on lines 125 to 130
long earliestUndesiredTimestamp = Long.MAX_VALUE;
for (var allocation : undesiredAllocations) {
if (allocation.value < earliestUndesiredTimestamp) {
earliestUndesiredTimestamp = allocation.value;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we wrap it inside undesiredAllocationDurationLogInterval.maybeExecute? It still seems a bit wasteful to me if we end up not logging it at all?

Comment on lines +102 to +108
if (undesiredAllocations.size() < maxUndesiredAllocationsToTrack) {
final var allocationId = shardRouting.allocationId().getId();
if (undesiredAllocations.containsKey(allocationId) == false) {
undesiredAllocations.put(
allocationId,
new UndesiredAllocation(shardRouting.shardId(), timeProvider.relativeTimeInMillis())
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether there is need to prioritize primary shards in case it somehow filled by search shards which are still interesting but less so for the time being. We can defer it until we collect the initial logs.

Comment on lines 47 to 51
public static final Setting<TimeValue> UNDESIRED_ALLOCATION_DURATION_LOG_THRESHOLD_SETTING = Setting.timeSetting(
"cluster.routing.allocation.desired_balance.undesired_duration_logging.threshold",
FIVE_MINUTES,
Setting.Property.Dynamic,
Setting.Property.NodeScope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a reasonable min?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 566f5d5

Comment on lines +163 to +176
// move to started
shardRouting = shardRouting.moveToStarted(randomNonNegativeLong());
undesiredAllocationsTracker.trackUndesiredAllocation(shardRouting);
assertEquals(1, undesiredAllocationsTracker.getUndesiredAllocations().size());

// start a relocation
shardRouting = shardRouting.relocate(randomIdentifier(), randomNonNegativeLong());
undesiredAllocationsTracker.trackUndesiredAllocation(shardRouting);
assertEquals(1, undesiredAllocationsTracker.getUndesiredAllocations().size());

// cancel that relocation
shardRouting = shardRouting.cancelRelocation();
undesiredAllocationsTracker.removeTracking(shardRouting);
assertEquals(0, undesiredAllocationsTracker.getUndesiredAllocations().size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point of the test. But in practice this should not happen right? If a tracked shard moves, it should be removed from the tracking before change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I believe that is true for the scenarios I used in the test, but it was really just because the equals method for ShardRouting includes everything. This was just to demonstrate that identity/tracking is tied to the allocationId and not all the other metadata in the ShardRouting.

reconcileAndBuildNewState(
reconciler,
initialClusterState,
new DesiredBalance(1, allShardsDesiredOnDataNode1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add one more variant where desired balance is not computed for the shards or some shards and we should see no log.

Copy link
Contributor Author

@nicktindall nicktindall Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in feaca10

@nicktindall nicktindall force-pushed the log_on_no_balancing_progress branch from f8b0913 to feaca10 Compare November 6, 2025 23:14
@nicktindall nicktindall merged commit 4c38246 into elastic:main Nov 7, 2025
35 checks passed
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
@nicktindall nicktindall deleted the log_on_no_balancing_progress branch November 12, 2025 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants