Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Jun 25, 2025

This PR makes snapshot service code and APIs multi-project compatible.

Resolves: ES-10225
Resolves: ES-10226

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jun 25, 2025
@ywangd ywangd added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 26, 2025
@ywangd ywangd requested review from DaveCTurner and pxsalehi June 26, 2025 04:39
@ywangd ywangd marked this pull request as ready for review June 26, 2025 04:40
@ywangd ywangd requested a review from a team as a code owner June 26, 2025 04:40
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jun 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member Author

ywangd commented Jun 26, 2025

Most of the changes are for SnapshotsService and BlobStoreRepository. The rest is mostly cascading. Initially I wanted to have separate PRs for clone, cleanup and maybe get-status APIs. But it does not seem to make much sense since there is a large code overlap and reuse among all APIs, e.g. it is actually better and easier to reason with to make the entirety of SnapshotsService project-aware. Restore is still left out. It is handled by different classes and will be addressed seperately.

* @param projectId The project that the repository belongs to
* @param name Name of the repository
*/
public record ProjectRepo(ProjectId projectId, String name) implements Writeable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an existing class extracted from RepositoryOperation.

Comment on lines +1828 to +1835
final var projectMetadata = clusterMetadata.getProject(getProjectId());
executor.execute(ActionRunnable.run(allMetaListeners.acquire(), () -> {
if (finalizeSnapshotContext.serializeProjectMetadata()) {
PROJECT_METADATA_FORMAT.write(projectMetadata, blobContainer(), snapshotId.getUUID(), compress);
} else {
GLOBAL_METADATA_FORMAT.write(clusterMetadata, blobContainer(), snapshotId.getUUID(), compress);
}
}));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we conditionally write ProjectMetadata for multi-project snapshots. The Metadata in this case is a thin wrapper around ProjectMetadata to reuse existing finalization related classes.

Comment on lines +3855 to +3876
private void startExecutableClones(SnapshotsInProgress snapshotsInProgress) {
for (List<SnapshotsInProgress.Entry> entries : snapshotsInProgress.entriesByRepo()) {
startExecutableClones(entries);
}
}

/**
* Maybe kick off new shard clone operations for all repositories of the specified project
*/
private void startExecutableClones(SnapshotsInProgress snapshotsInProgress, ProjectId projectId) {
for (List<SnapshotsInProgress.Entry> entries : snapshotsInProgress.entriesByRepo(projectId)) {
startExecutableClones(entries);
}
}

/**
* Maybe kick off new shard clone operations for the single specified project repository
*/
private void startExecutableClones(SnapshotsInProgress snapshotsInProgress, ProjectRepo projectRepo) {
startExecutableClones(snapshotsInProgress.forRepo(Objects.requireNonNull(projectRepo)));
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshotting is state machine that triggers next operation when the current operation finishes. In most cases, the triggering is confined in the same repository. This is the simplest case and gets migrated as is. The other case is triggering across all repositories. In a MP setup, this could mean either across all repositories of all projects or across all repositories of a single project. This is the reason for the 3 variants of the same named method here. The principles that I have applied are:

  1. If the scope was a single repository, keep it as is.
  2. If the scope was all repositories and reacting to cluster state changes, i.e. applyClusterState, it applies to all repositories across all projects.
  3. If the scope was all repository and happening after completing a particular snapshot operation, e.g. deleting a snapshot entry, it applies to all repositories of a single project that the operation is associated with.

Comment on lines +1838 to +1841
private static Tuple<ClusterState, List<SnapshotDeletionsInProgress.Entry>> readyDeletions(
ClusterState currentState,
@Nullable ProjectId projectId
) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of different scopes for triggering next operation. In this case, it does not have the single repository scope, but either cluster wide or single project (when projectId == null).

@ywangd
Copy link
Member Author

ywangd commented Jun 27, 2025

Depending on how we handle project soft-deletion and/or clean-up, snapshots may see a project getting concurrently deleted and thereore fail. This PR does not attempt to handle it more gracefully since it would become noise and get burried in the large amount of namespacing changes. I can raise a separate ticket to track this work.

@pxsalehi
Copy link
Member

Depending on how we handle project soft-deletion and/or clean-up, snapshots may see a project getting concurrently deleted and thereore fail. This PR does not attempt to handle it more gracefully since it would become noise and get burried in the large amount of namespacing changes. I can raise a separate ticket to track this work.

yeah, we'd need a new ticket for this under the soft-deletion epic. We briefly mentioned in the design doc that once the project is marked for deletion, we should 1) prevent any new snapshots being scheduled/requested. This partially goes back to making those internal actions aware of checking for the deletion project block. 2) any ongoing snapshotting should be cancelled for that project (I guess not that simple but somehow at least fail graciously and not blow up). For 1, we have ES-12121. But I don't think 2 has a ticket yet.

Copy link
Member

@pxsalehi pxsalehi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. To my eyes these all look like straight-forward changes to trickle down project ID everywhere necessary. I don't have a strong opinions about the details. Although considering snapshotting code is convoluted and in a delicate state, I'm gonna defer the final approval to David (or anyone else with more snapshotting experience).

Comment on lines +2092 to +2094
if (token == null) {
token = parser.nextToken();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this all about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed when the ProjectMetadata is parsed on its own, i.e. not as part of Metadata. In the later case, the parser.nextToken() method is already called when parsing the outer structure. In this PR, we need to de/serialize ProjectMetadata on its own. Hence the need for active call to this method. We also use this pattern in other places such as here.

SnapshotInfo snapshotInfo = null;
try {
snapshotInfo = SNAPSHOT_FORMAT.read(metadata.name(), blobContainer(), snapshotId.getUUID(), namedXContentRegistry);
snapshotInfo = SNAPSHOT_FORMAT.read(getProjectRepo(), blobContainer(), snapshotId.getUUID(), namedXContentRegistry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is getProjectRepo() now for non-MP always uding DEFAULT as project name here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct.

@ywangd
Copy link
Member Author

ywangd commented Jun 30, 2025

I don't think 2 has a ticket yet.

I raised ES-12217. Feel free to update it. Thanks!

@ywangd
Copy link
Member Author

ywangd commented Jul 1, 2025

@DaveCTurner Could you please review this PR since you are the expert in this area? I went through the changes carefully and the test in the linked PR has been quite helpful in catching issues. We could use your blessing for the final approval. Thank you! 🙏

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me, although as with most changes to snapshot code that means more "no obvious bugs" rather than "obviously no bugs".

I'd rather there was more coverage of the multi-project cases in tests, in particular in SnapshotStressTestsIT and SnapshotResiliencyTests. AIUI that's still something we're working towards, particularly for the internal-cluster tests. I'm going to have to leave it to your judgement to decide whether it's safe to merge this without that test coverage in place.

}

public static String projectRepoString(ProjectId projectId, String repositoryName) {
return "[" + projectId + "][" + repositoryName + "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes some of the user-facing messages a bit harder to understand (especially in a single-project world where everything is now prefixed with a mysterious [default]). Would it make more sense to say something like [repository=${repositoryName}/project=${projectId}]? Not completely sold on that either, but nor do I have a better suggestion right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to be [project/repositoryName]. Using slash as a separator is at least consistent with what we do for NodeIndicesStats. The difference is that NodeIndicesStats hides the default project when MP is disabled. It makes sense there since it is part of the API response. Our method here is used for loggings. So I'd think we don't need this conditional logic. Also we'd need a projectResolver to tell whether MP is disabled which seems not worth the complexity. Using a slash also maintains the same number of bracket pairs which might be helpful if people has some parsing script relying on it. In addition, I slightly prefer having project first then repositoryName in the output since in some places the entire sequence is "project, repository, snapshot" which seems to flow better than having the project in the middle.

Bottom line is that we can still change it in future when necessary since it's for logging usages.

@ywangd
Copy link
Member Author

ywangd commented Jul 3, 2025

Test coverage is a good point. I do plan to improve them in follow-ups. It is a valid thing to do for all MP changes. Since snapshot state machine is particularly intricate, I think it worths to track separately and rasied ES-12246.

I think it is safe and beneficial to merge this PR first. My reasonings are:

  1. The most important and immediate concern is to ensure nothing breaks in single cluster setup. This is still guaranteed by all existing stress tests.
  2. Multi-project snapshots do not have any production usage yet. Less obvious bugs likely won't have serious impact. We still want to find and fix them. It's an ogoing effort. Any obvious bugs should be caught by the IT tests on the linked serverless PR that runs snapshot operations concurrently across multiple projects.
  3. OBS work will be unblocked once this PR is merged. It would be great to have two teams work in parallel. This is also the reason why restsore is not yet MP ready since it is not needed by OBS.
  4. Making SnapshotStressTestsIT or SnapshotResiliencyTests MP ready requires considerable effort that feels better to be handled separately. The former requires more general support from the InternalClusterTest infrastructure which I also plan to work on for supporting ES-12053. The later may be a bit simpler but still needs changes to DeterministicTaskQueue and/or DisruptableMockTransport to pass necessary ThreadContext headers between nodes as well as a few other things. Even when all these are in place, the MP tests will be excercising a single active project with other projects being dormant. This is still valuable and is what most current MP tests do, e.g. the snapshot related YAML tests. But for snapshots, I think we get more value if all projects are excercised concurrently. See also ES-12246 for a some more details.
  5. Restore is not yet MP ready. So it is not really feasible to update SnapshotStressTestsIT or SnapshotResiliencyTests fully at this stage.

In summary, my suggestion is to merge this PR as is. Multi-project snapshots is an active ongoing project. We will keep improving it in the coming iterations. I hope this makes sense. Thanks!

@ywangd ywangd requested a review from DaveCTurner July 3, 2025 03:22
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes this seems safe enough for now.

The later may be a bit simpler but still needs changes to DeterministicTaskQueue and/or DisruptableMockTransport to pass necessary ThreadContext headers

Wow yeah I didn't notice the lack of thread context propagation here. I wonder how we have got away without that for all this time.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 3, 2025
@elasticsearchmachine elasticsearchmachine merged commit f15ef7c into elastic:main Jul 3, 2025
32 checks passed
@ywangd ywangd deleted the ES-10225-snapshots-service-multi-project branch July 3, 2025 08:49
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 3, 2025
This PR makes snapshot service code and APIs multi-project compatible. 

Resolves: ES-10225 Resolves: ES-10226
elasticsearchmachine pushed a commit that referenced this pull request Jul 25, 2025
This PR makes the restore process project aware and unmute relevant
tests. The later requires TransportRecoveryAction to be project aware
which is done in this PR as well.

Relates: #130000 Resolves: ES-10228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants