-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Limit number of shard snapshot in INIT state per node #131592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With the limit enabled, shard snapshots that would have been in the INIT state are now put into assigned QUEUED state when the limit is reached on the node. This new state is transitioned into INIT when a shard snapshot completes and gives back capacity to the node. The shard snapshot that completes can be either from the same snapshot or a different snapshot in either the same or different repo.
* Treat queued with gen as started that blocks deletion to run * Aborted queued with gen is failed with a separate cluster state update to simulate how abort for init works.
Generation can still be null
Some other tweaks. Still WIP
|
Hi @ywangd, I've created a changelog YAML for you. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, I much prefer this approach.
But the change is not propagated to following snapshots
If all running shards snapshots are after the snapshot that contains the assigned-queued shard.
| final SnapshotsInProgress.Entry abortedEntry = existing.abort( | ||
| currentState.nodes().getLocalNodeId(), | ||
| ((shardId, shardSnapshotStatus) -> completeAbortedAssignedQueuedRunnables.add( | ||
| () -> innerUpdateSnapshotState(existing.snapshot(), shardId, shardSnapshotStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply updates as ShardSnapshotUpdate tasks. I think this is a better approach than trying to update them locally which would require duplicate logics from the update task executors. We already have some duplication in processExternalChanges and there is a TODO suggesting refactoring it away.
| // We cannot directly update its status here because there maybe another snapshot for | ||
| // the same shard that is QUEUED which must be updated as well, i.e. vertical update. | ||
| // So we submit the status update to let it be processed in a future cluster state update. | ||
| shardStatusUpdateConsumer.apply(entry.snapshot(), shardId, newShardSnapshotStatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly we send out shard snapshot update so the state change is propagated correctly by the task executor.
| // Check horizontally within the snapshot to see whether any previously limited shard snapshots can now start | ||
| maybeStartAssignedQueuedShardSnapshots( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a side effect that a snapshot will kickoff assigned-queued shards belong to the same snapshot before starting shard snapshot of the same shard but belongs to a later snapshot. I think this is desirable. It means once a shard finishes its snapshot, it does not immediately get another snapshot in INIT state, i.e. something we previously wanted to prevent with limiting the number of concurrent snapshots. With limiting number of INIT shards, it is less important now but still nice to have.
| for (var notUpdatedRepo : Sets.difference(existing.repos(), updatesByRepo.keySet())) { | ||
| if (perNodeShardSnapshotCounter.hasCapacityOnAnyNode() == false) { | ||
| break; | ||
| } | ||
| updated = maybeStartAssignedQueuedShardSnapshotsForRepo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An optimization is to order the snapshots by start time and start assigned-queued shards for the earlier entries.
server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Outdated
Show resolved
Hide resolved
|
@DaveCTurner The stress IT has been running for more than 24 hours successfully after fixing the last bug. I also pondered on the new logic for quite sometime. I think it is sufficiently correct and we can start polishing for a production version. I think one important thing is to enhance the stress IT to enable relocation and maybe dynamically changing the limits. But before I proceed further, what will your suggestion for this PR? Should we treat it as a PoC so that we close it (and associated ticket) and have a new task for the production PR? Or should we keep working on it towards the production version? |
|
@elasticmachine update branch |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still too uncomfortable with the assigned-queued state, but it's taken me a while to really get to the heart of that discomfort. Fundamentally I think it's that there should be no need to keep track of these assignments in the cluster state itself, and that using the cluster state for this is going to be something we have to maintain forevermore.
I recognise the value of tracking the assignments somewhere: if we don't, we have to do a rather naive search for shards to activate on each cluster state update. Could we explore tracking these things using an in-memory data structure on the elected master node instead? On a master failover, the new master can populate its own local tracking from the routing table, but from then on it should be able to maintain it alongside the cluster state.
More generally, thats the direction I think we should take this area of the system: we're already doing rather a lot of work in the cluster state updates that relate to snapshots because of the tension between wanting a normalized data structure for efficient replication vs wanting a denormalized data structure for efficient update computations. This seems like a good opportunity to resolve that tension by separating those two concerns. In the short term I think we can continue to compute updates to both data structures on the master thread, but in future we may want to do more background work instead.
I opened #134795 to describe my thinking about the purpose of SnapshotsInProgress in its Javadocs.
|
Thanks for the feedback, David. I really appreciate it. Your suggestion makes sense to me. In practice, it feels like more or less a middle groud between the current "assigned-queued" approach and treating them entirely as "queued". Therefore potentially have benefits from both sides. Your articulation for mixing data structure of two separate concerns is a great one. I hadn't thought of it that way. Thanks for sharing your insights. I'll go ahead and experiment with your suggestion. I may raise it as a new PR since it's rather difficult to shift direction on this one and it might still be useful as a comparison once the other one is up. |
I have experimented this idea in this draft PR (#134977). It does not work yet. But I am afraid it might not be the direction that we want to take. Though it does feel theoretically great to keep the tracking outside, it has quite a few downsides:
With this PR, I also started with trying to treat the node-capacity-limited shards entirely as In summary, I think the initial version of the |
|
@DaveCTurner What is your take on my last comment? I wonder whether we should continue allocate sometime for this issue in the upcoming iteration or should we reconsider it? Thanks! |
|
I think these are valid points but none are strong enough to be a blocker IMO.
I think it would be best to take the other approach and make stateless snapshots more stateless instead. |
|
David and I synced and agreed to not move further with this PR and instead take the other approach for stateless snapshots. See also ES-12377 for details. |
This PR adds a new feature to allow configuring the max number of shard snapshot in
INITstate per node. Since shard snapshots inINITstate prevent relocation, limiting the number allows relocation for more shards. Some important implementation details are:QUEUEDstate is reused to for the new state where a shard should have been inINITstate but limited due to the new configuration. To differentiate from the existing unassigned-queued state, the new state has a non-nullnodeIdsimilar toINIT. This new state is called assigned-queued.The new setting is dynamically configurable with a default of
0meaning no limit, i.e. no behaviour change by default. The stress test has been updated to randomly (1) configure and update the limit and (2) force relocation.Resolves: ES-12377