Skip to content

Conversation

@lkts
Copy link
Contributor

@lkts lkts commented Jun 9, 2025

This PR introduces new recovery source for shards that are targets of a reshard split. This recovery source contains metadata necessary for the recovery process of split target shards to work. It also serves as an abstraction layer and helps recovery code path to avoid reasoning about data that is very specific to resharding like IndexReshardingMetadata.

These changes are tested in scope of resharding tests in the linked PR.

}

public static RecoverySource readFrom(StreamInput in) throws IOException {
// TODO is transport version check needed?
Copy link
Contributor Author

@lkts lkts Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK master nodes are upgraded last meaning that data nodes will already be able to read the new value. Let me know if i am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand how this would work in general. Couldn't you have a cluster where all the indexing nodes are also master-eligible?

I do think we intend to fail resharding actions on a cluster until all nodes support it though. I thought we had a ticket for that but I don't see it now. I've stubbed in ES-12048 for this.

At any rate I don't think we have a problem on read, since by definition we know about the new source, and the other node either also does or simply won't send this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, i agree. If this logic existed it should have been on the writer but i think that's addressed by the check when enabling resharding that you mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not rely on the master being upgraded last, but we can rely on the feature not being available. I suppose we have no rolling upgrade tests yet with the feature enabled, hence not having the version check should work just fine. So no change needed, just wanted to ensure we would not rely on the master being upgraded last.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the consensus (from my perspective) is that there will be a feature flag that enables resharding as a whole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a feature flag, but a guard on all nodes in the cluster having a transport version that speaks resharding. This covers a case where a cluster is being upgraded from pre-resharding to post after a resharding feature flag has been set.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jun 9, 2025
@lkts lkts requested review from bcully and henningandersen June 9, 2025 21:19
@lkts lkts marked this pull request as ready for review June 9, 2025 21:19
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 9, 2025
* Recovery of a shard that is created as a result of a resharding split.
* Not to be confused with _split API.
*/
public static class ReshardSplitTargetRecoverySource extends RecoverySource {
Copy link
Contributor Author

@lkts lkts Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be worrying too much about naming but i don't want to call it Split because of _split API and i want Target in there to disambiguate between source and target. So i came up with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ReshardSplit. I'm not sure what Target is disambiguating though. I think recovery is implicitly in the context of the target? We don't have two different peer recovery sources for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking there could be some custom logic for the reshard source but indeed peer recovery does not have that.

@lkts lkts added >non-issue :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Jun 9, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed needs:triage Requires assignment of a team area label labels Jun 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Copy link
Contributor

@bcully bcully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a naming question really.

}

public static RecoverySource readFrom(StreamInput in) throws IOException {
// TODO is transport version check needed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand how this would work in general. Couldn't you have a cluster where all the indexing nodes are also master-eligible?

I do think we intend to fail resharding actions on a cluster until all nodes support it though. I thought we had a ticket for that but I don't see it now. I've stubbed in ES-12048 for this.

At any rate I don't think we have a problem on read, since by definition we know about the new source, and the other node either also does or simply won't send this case.

* Recovery of a shard that is created as a result of a resharding split.
* Not to be confused with _split API.
*/
public static class ReshardSplitTargetRecoverySource extends RecoverySource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ReshardSplit. I'm not sure what Target is disambiguating though. I think recovery is implicitly in the context of the target? We don't have two different peer recovery sources for instance.

logger.trace("ignoring initializing shard {} - no source node can be found.", shardId);
return;
}
} else if (shardRouting.recoverySource().getType() == Type.RESHARD_SPLIT_TARGET) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a request for change, this isn't your API, but it would be nice if the ShardRouting API didn't make us get type and then cast in two steps. I could imagine something like a shardRouting.recoverySource().asReshardSplit() maybe that did the check and returned the cast object if it passed or null.

I'm just griping though, it's probably not worth doing now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could instanceof the shardRouting.recoverySource() instead. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that seems nicer

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

ShardId sourceShardId = reshardSplitRecoverySource.getSourceShardId();
sourceNode = findSourceNodeForReshardSplitRecovery(state.routingTable(project.id()), state.nodes(), sourceShardId);
if (sourceNode == null) {
logger.trace("ignoring initializing reshard target shard {} - no source node can be found.", shardId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert false here? I think this situation should be invalid, the shardRouting is taken from state further out - and we should assume that. I think we could add the same above, but would not want to complicate the work here with figuring that out.

Copy link
Contributor Author

@lkts lkts Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure i understand - are you saying that if routing was updated for the target shard then the same cluster state should contain routing for the source shard as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added asserts inside findSourceNodeForReshardSplitRecovery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly that having a target shard recovery without an active source shard is not sound. If we fail the source shard we should probably also fail the target shard. At least that is how peer recovery works, if the primary fails, we fail the replicas too in the same cluster state update.

I am also not fond of the way we just ignore recovery here. If we accept this case we need a test case to verify that we resume it later.

Hence the asserts to clarify the situation. We may refine later.

ShardRouting sourceShardRouting = routingTable.shardRoutingTable(sourceShardId).primaryShard();

if (sourceShardRouting.active() == false) {
logger.trace("can't find reshard split source node because source shard {} is not active.", sourceShardRouting);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assert false here and below as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fails we can leave it out for now, but then we should add a jira to ensure we have these properties in the routing table.

}

public static RecoverySource readFrom(StreamInput in) throws IOException {
// TODO is transport version check needed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not rely on the master being upgraded last, but we can rely on the feature not being available. I suppose we have no rolling upgrade tests yet with the feature enabled, hence not having the version check should work just fine. So no change needed, just wanted to ensure we would not rely on the master being upgraded last.

@lkts lkts merged commit b24bb35 into elastic:main Jun 13, 2025
18 checks passed
@lkts lkts deleted the split_recovery_source branch June 13, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Indexing Meta label for Distributed Indexing team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants