-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add new recovery source for reshard split target shards #129159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
078e6c6
2c520c8
b928faf
27f905c
42dd895
7e18c51
58112aa
de01dfd
691ec62
4af45d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,7 @@ | |
| import org.elasticsearch.cluster.node.DiscoveryNode; | ||
| import org.elasticsearch.cluster.node.DiscoveryNodes; | ||
| import org.elasticsearch.cluster.routing.IndexShardRoutingTable; | ||
| import org.elasticsearch.cluster.routing.RecoverySource; | ||
| import org.elasticsearch.cluster.routing.RecoverySource.Type; | ||
| import org.elasticsearch.cluster.routing.RoutingNode; | ||
| import org.elasticsearch.cluster.routing.RoutingTable; | ||
|
|
@@ -701,6 +702,13 @@ private void createShard(ShardRouting shardRouting, ClusterState state) { | |
| logger.trace("ignoring initializing shard {} - no source node can be found.", shardId); | ||
| return; | ||
| } | ||
| } else if (shardRouting.recoverySource().getType() == Type.RESHARD_SPLIT_TARGET) { | ||
|
||
| ShardId sourceShardId = ((RecoverySource.SplitTargetRecoverySource) shardRouting.recoverySource()).getSourceShardId(); | ||
| sourceNode = findSourceNodeForSplitTargetRecovery(state.routingTable(project.id()), state.nodes(), sourceShardId); | ||
| if (sourceNode == null) { | ||
| logger.trace("ignoring initializing reshard target shard {} - no source node can be found.", shardId); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we assert false here? I think this situation should be invalid, the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Want to make sure i understand - are you saying that if routing was updated for the target shard then the same cluster state should contain routing for the source shard as well?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added asserts inside
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is mostly that having a target shard recovery without an active source shard is not sound. If we fail the source shard we should probably also fail the target shard. At least that is how peer recovery works, if the primary fails, we fail the replicas too in the same cluster state update. I am also not fond of the way we just ignore recovery here. If we accept this case we need a test case to verify that we resume it later. Hence the asserts to clarify the situation. We may refine later. |
||
| return; | ||
| } | ||
| } else { | ||
| sourceNode = null; | ||
| } | ||
|
|
@@ -988,6 +996,29 @@ private static DiscoveryNode findSourceNodeForPeerRecovery(RoutingTable routingT | |
| return sourceNode; | ||
| } | ||
|
|
||
| private static DiscoveryNode findSourceNodeForSplitTargetRecovery( | ||
| RoutingTable routingTable, | ||
| DiscoveryNodes nodes, | ||
| ShardId sourceShardId | ||
| ) { | ||
| ShardRouting sourceShardRouting = routingTable.shardRoutingTable(sourceShardId).primaryShard(); | ||
|
|
||
| if (sourceShardRouting.active() == false) { | ||
| logger.trace("can't find reshard split source node because source shard {} is not active.", sourceShardRouting); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should assert false here and below as well.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it fails we can leave it out for now, but then we should add a jira to ensure we have these properties in the routing table. |
||
| return null; | ||
| } | ||
|
|
||
| DiscoveryNode sourceNode = nodes.get(sourceShardRouting.currentNodeId()); | ||
| if (sourceNode == null) { | ||
| logger.trace( | ||
| "can't find reshard split source node because source shard {} is assigned to an unknown node.", | ||
| sourceShardRouting | ||
| ); | ||
| return null; | ||
| } | ||
| return sourceNode; | ||
| } | ||
|
|
||
| private record PendingShardCreation(String clusterStateUUID, long startTimeMillis) {} | ||
|
|
||
| private class RecoveryListener implements PeerRecoveryTargetService.RecoveryListener { | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK master nodes are upgraded last meaning that data nodes will already be able to read the new value. Let me know if i am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand how this would work in general. Couldn't you have a cluster where all the indexing nodes are also master-eligible?
I do think we intend to fail resharding actions on a cluster until all nodes support it though. I thought we had a ticket for that but I don't see it now. I've stubbed in ES-12048 for this.
At any rate I don't think we have a problem on read, since by definition we know about the new source, and the other node either also does or simply won't send this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, i agree. If this logic existed it should have been on the writer but i think that's addressed by the check when enabling resharding that you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not rely on the master being upgraded last, but we can rely on the feature not being available. I suppose we have no rolling upgrade tests yet with the feature enabled, hence not having the version check should work just fine. So no change needed, just wanted to ensure we would not rely on the master being upgraded last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the consensus (from my perspective) is that there will be a feature flag that enables resharding as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a feature flag, but a guard on all nodes in the cluster having a transport version that speaks resharding. This covers a case where a cluster is being upgraded from pre-resharding to post after a resharding feature flag has been set.