-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Log when a shard is immovable for longer than a configured threshold #136997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log when a shard is immovable for longer than a configured threshold #136997
Conversation
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it viable to take a similar approach as how unassigned shard is logged in the computer? That is, keep track of an undesired shard (or 3 if you prefer, but 1 is probably good enough?) and log when it is first observed and no logging if it keeps being undesired and not moving. Also logs a message when the tracked shard is moved and start tracking another one?
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had more thoughts on this and I'd like to check whether we are doing the right thing. Or maybe we have somewhat different perspective.
In the context of write load decider, I think what we are mostly interested is moveShards instead of balance. That is, if balancer decides a shard allocation is NOT_PREFERRED and also finds it a new target node (canAllocate decision should be YES), we expect the Reconciler to be able to actually move this shard to fix hotspot. But if reconciler cannot move the shard for any reason that is not concurrent recovery related (here and here), it's worth reporting since it is both unexpected and invalidaes our assumption that moving 1 shard is sufficent to mitigate the hotspot till next ClusterInfo poll. Similar reasoning applies to a canRemain = NO and canAllocate = YES/NOT_PREFERRED shard as well for which we also want to report if reconciler cannot move it.
I think the balance part is less interesting in this context. It might be useful for more broad tracking but seems not really relevant for hotspot mitgation?
Yes definitely. In general I think we are interested in both though. i.e. if we are persistently unable to move a shard whether it be to balance the cluster or move shards that cannot remain and the reason is something other than Do you think the approach in general is OK (keeping a map of immovable shards and when they started being immovable and clearing them when they move). If so I'll add the additional logic to cover the |
|
Can we report already when we observed an unmovable shard the first time? It should not happen unless it's throttled. Do we really need to track it and report after certain threshold? We can frequency cap the logging but seems like we don't need the map for tracking which is simpler? |
We could, but I think it might be potentially noisy? for example if |
|
Snapshot decider returns I am a little concerned of tracking all unmovable shards. It's unclear how large it could get. It could also leak if an unmovable shard is removed? In summary, I am thinking about something like the following:
But I could also be OK if we bound the map and ensure entries eventually get deleted. PS: A bit wild idea is to track number of unmovable attempts with ShardRouting itself. We have other metadata attached to it as well, e.g. |
55384bd to
5fd0770
Compare
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| * Track an allocation as being undesired | ||
| */ | ||
| public void trackUndesiredAllocation(ShardRouting shardRouting) { | ||
| assert shardRouting.unassigned() == false : "Shouldn't record unassigned shards as undesired allocations"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can assert shardRouting.started()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was really just to protect against getting a null allocation ID (we need it to track the allocation), I think it might be possible to have a shard in initializing state that's undesired, or relocating? I think started might be overly restrictive.
| long earliestUndesiredTimestamp = Long.MAX_VALUE; | ||
| for (var allocation : undesiredAllocations) { | ||
| if (allocation.value < earliestUndesiredTimestamp) { | ||
| earliestUndesiredTimestamp = allocation.value; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Can we wrap it inside undesiredAllocationDurationLogInterval.maybeExecute? It still seems a bit wasteful to me if we end up not logging it at all?
| if (undesiredAllocations.size() < maxUndesiredAllocationsToTrack) { | ||
| final var allocationId = shardRouting.allocationId().getId(); | ||
| if (undesiredAllocations.containsKey(allocationId) == false) { | ||
| undesiredAllocations.put( | ||
| allocationId, | ||
| new UndesiredAllocation(shardRouting.shardId(), timeProvider.relativeTimeInMillis()) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether there is need to prioritize primary shards in case it somehow filled by search shards which are still interesting but less so for the time being. We can defer it until we collect the initial logs.
| public static final Setting<TimeValue> UNDESIRED_ALLOCATION_DURATION_LOG_THRESHOLD_SETTING = Setting.timeSetting( | ||
| "cluster.routing.allocation.desired_balance.undesired_duration_logging.threshold", | ||
| FIVE_MINUTES, | ||
| Setting.Property.Dynamic, | ||
| Setting.Property.NodeScope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have a reasonable min?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 566f5d5
| // move to started | ||
| shardRouting = shardRouting.moveToStarted(randomNonNegativeLong()); | ||
| undesiredAllocationsTracker.trackUndesiredAllocation(shardRouting); | ||
| assertEquals(1, undesiredAllocationsTracker.getUndesiredAllocations().size()); | ||
|
|
||
| // start a relocation | ||
| shardRouting = shardRouting.relocate(randomIdentifier(), randomNonNegativeLong()); | ||
| undesiredAllocationsTracker.trackUndesiredAllocation(shardRouting); | ||
| assertEquals(1, undesiredAllocationsTracker.getUndesiredAllocations().size()); | ||
|
|
||
| // cancel that relocation | ||
| shardRouting = shardRouting.cancelRelocation(); | ||
| undesiredAllocationsTracker.removeTracking(shardRouting); | ||
| assertEquals(0, undesiredAllocationsTracker.getUndesiredAllocations().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point of the test. But in practice this should not happen right? If a tracked shard moves, it should be removed from the tracking before change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I believe that is true for the scenarios I used in the test, but it was really just because the equals method for ShardRouting includes everything. This was just to demonstrate that identity/tracking is tied to the allocationId and not all the other metadata in the ShardRouting.
| reconcileAndBuildNewState( | ||
| reconciler, | ||
| initialClusterState, | ||
| new DesiredBalance(1, allShardsDesiredOnDataNode1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add one more variant where desired balance is not computed for the shards or some shards and we should see no log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in feaca10
f8b0913 to
feaca10
Compare
When shards have been in undesired allocation for longer than some configurable threshold, log the results of
canAllocatefor every node in the desired balance to aid in troubleshooting.Implemented by recording a relative timestamp when we first notice a shard is in an undesired allocation, which is cleared when it is relocated, or the allocation it's in becomes "desired" again.
Relates ES-11928