Snapshots support multi-project #130000

ywangd · 2025-06-25T11:43:50Z

This PR makes snapshot service code and APIs multi-project compatible.

Resolves: ES-10225
Resolves: ES-10226

…rvice-multi-project

elasticsearchmachine · 2025-06-26T04:41:18Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-06-26T08:09:52Z

Most of the changes are for SnapshotsService and BlobStoreRepository. The rest is mostly cascading. Initially I wanted to have separate PRs for clone, cleanup and maybe get-status APIs. But it does not seem to make much sense since there is a large code overlap and reuse among all APIs, e.g. it is actually better and easier to reason with to make the entirety of SnapshotsService project-aware. Restore is still left out. It is handled by different classes and will be addressed seperately.

ywangd · 2025-06-26T08:10:16Z

server/src/main/java/org/elasticsearch/repositories/ProjectRepo.java

+ * @param projectId The project that the repository belongs to
+ * @param name      Name of the repository
+ */
+public record ProjectRepo(ProjectId projectId, String name) implements Writeable {


This is an existing class extracted from RepositoryOperation.

ywangd · 2025-06-27T02:01:20Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                    final var projectMetadata = clusterMetadata.getProject(getProjectId());
+                    executor.execute(ActionRunnable.run(allMetaListeners.acquire(), () -> {
+                        if (finalizeSnapshotContext.serializeProjectMetadata()) {
+                            PROJECT_METADATA_FORMAT.write(projectMetadata, blobContainer(), snapshotId.getUUID(), compress);
+                        } else {
+                            GLOBAL_METADATA_FORMAT.write(clusterMetadata, blobContainer(), snapshotId.getUUID(), compress);
+                        }
+                    }));


This is where we conditionally write ProjectMetadata for multi-project snapshots. The Metadata in this case is a thin wrapper around ProjectMetadata to reuse existing finalization related classes.

ywangd · 2025-06-27T02:44:16Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+    private void startExecutableClones(SnapshotsInProgress snapshotsInProgress) {
+        for (List<SnapshotsInProgress.Entry> entries : snapshotsInProgress.entriesByRepo()) {
+            startExecutableClones(entries);
+        }
+    }
+
+    /**
+     * Maybe kick off new shard clone operations for all repositories of the specified project
+     */
+    private void startExecutableClones(SnapshotsInProgress snapshotsInProgress, ProjectId projectId) {
+        for (List<SnapshotsInProgress.Entry> entries : snapshotsInProgress.entriesByRepo(projectId)) {
+            startExecutableClones(entries);
        }
    }

+    /**
+     * Maybe kick off new shard clone operations for the single specified project repository
+     */
+    private void startExecutableClones(SnapshotsInProgress snapshotsInProgress, ProjectRepo projectRepo) {
+        startExecutableClones(snapshotsInProgress.forRepo(Objects.requireNonNull(projectRepo)));
+    }
+


Snapshotting is state machine that triggers next operation when the current operation finishes. In most cases, the triggering is confined in the same repository. This is the simplest case and gets migrated as is. The other case is triggering across all repositories. In a MP setup, this could mean either across all repositories of all projects or across all repositories of a single project. This is the reason for the 3 variants of the same named method here. The principles that I have applied are:

If the scope was a single repository, keep it as is.

If the scope was all repositories and reacting to cluster state changes, i.e. applyClusterState, it applies to all repositories across all projects.

If the scope was all repository and happening after completing a particular snapshot operation, e.g. deleting a snapshot entry, it applies to all repositories of a single project that the operation is associated with.

ywangd · 2025-06-27T02:45:58Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+    private static Tuple<ClusterState, List<SnapshotDeletionsInProgress.Entry>> readyDeletions(
+        ClusterState currentState,
+        @Nullable ProjectId projectId
+    ) {


This is another example of different scopes for triggering next operation. In this case, it does not have the single repository scope, but either cluster wide or single project (when projectId == null).

ywangd · 2025-06-27T03:00:24Z

Depending on how we handle project soft-deletion and/or clean-up, snapshots may see a project getting concurrently deleted and thereore fail. This PR does not attempt to handle it more gracefully since it would become noise and get burried in the large amount of namespacing changes. I can raise a separate ticket to track this work.

…rvice-multi-project

pxsalehi · 2025-06-27T11:19:37Z

Depending on how we handle project soft-deletion and/or clean-up, snapshots may see a project getting concurrently deleted and thereore fail. This PR does not attempt to handle it more gracefully since it would become noise and get burried in the large amount of namespacing changes. I can raise a separate ticket to track this work.

yeah, we'd need a new ticket for this under the soft-deletion epic. We briefly mentioned in the design doc that once the project is marked for deletion, we should 1) prevent any new snapshots being scheduled/requested. This partially goes back to making those internal actions aware of checking for the deletion project block. 2) any ongoing snapshotting should be cancelled for that project (I guess not that simple but somehow at least fail graciously and not blow up). For 1, we have ES-12121. But I don't think 2 has a ticket yet.

pxsalehi

LGTM. To my eyes these all look like straight-forward changes to trickle down project ID everywhere necessary. I don't have a strong opinions about the details. Although considering snapshotting code is convoluted and in a delicate state, I'm gonna defer the final approval to David (or anyone else with more snapshotting experience).

pxsalehi · 2025-06-27T11:29:43Z

server/src/main/java/org/elasticsearch/cluster/metadata/ProjectMetadata.java

+            if (token == null) {
+                token = parser.nextToken();
+            }


what's this all about?

This is needed when the ProjectMetadata is parsed on its own, i.e. not as part of Metadata. In the later case, the parser.nextToken() method is already called when parsing the outer structure. In this PR, we need to de/serialize ProjectMetadata on its own. Hence the need for active call to this method. We also use this pattern in other places such as here.

pxsalehi · 2025-06-27T11:37:57Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

            SnapshotInfo snapshotInfo = null;
            try {
-                snapshotInfo = SNAPSHOT_FORMAT.read(metadata.name(), blobContainer(), snapshotId.getUUID(), namedXContentRegistry);
+                snapshotInfo = SNAPSHOT_FORMAT.read(getProjectRepo(), blobContainer(), snapshotId.getUUID(), namedXContentRegistry);


is getProjectRepo() now for non-MP always uding DEFAULT as project name here?

That is correct.

ywangd · 2025-06-30T02:44:48Z

I don't think 2 has a ticket yet.

I raised ES-12217. Feel free to update it. Thanks!

…rvice-multi-project

ywangd · 2025-07-01T00:36:54Z

@DaveCTurner Could you please review this PR since you are the expert in this area? I went through the changes carefully and the test in the linked PR has been quite helpful in catching issues. We could use your blessing for the final approval. Thank you! 🙏

DaveCTurner

This looks ok to me, although as with most changes to snapshot code that means more "no obvious bugs" rather than "obviously no bugs".

I'd rather there was more coverage of the multi-project cases in tests, in particular in SnapshotStressTestsIT and SnapshotResiliencyTests. AIUI that's still something we're working towards, particularly for the internal-cluster tests. I'm going to have to leave it to your judgement to decide whether it's safe to merge this without that test coverage in place.

DaveCTurner · 2025-07-02T12:36:10Z

server/src/main/java/org/elasticsearch/repositories/ProjectRepo.java

+    }
+
+    public static String projectRepoString(ProjectId projectId, String repositoryName) {
+        return "[" + projectId + "][" + repositoryName + "]";


I think this makes some of the user-facing messages a bit harder to understand (especially in a single-project world where everything is now prefixed with a mysterious [default]). Would it make more sense to say something like [repository=${repositoryName}/project=${projectId}]? Not completely sold on that either, but nor do I have a better suggestion right now.

I changed it to be [project/repositoryName]. Using slash as a separator is at least consistent with what we do for NodeIndicesStats. The difference is that NodeIndicesStats hides the default project when MP is disabled. It makes sense there since it is part of the API response. Our method here is used for loggings. So I'd think we don't need this conditional logic. Also we'd need a projectResolver to tell whether MP is disabled which seems not worth the complexity. Using a slash also maintains the same number of bracket pairs which might be helpful if people has some parsing script relying on it. In addition, I slightly prefer having project first then repositoryName in the output since in some places the entire sequence is "project, repository, snapshot" which seems to flow better than having the project in the middle.

Bottom line is that we can still change it in future when necessary since it's for logging usages.

…rvice-multi-project

ywangd · 2025-07-03T03:22:35Z

Test coverage is a good point. I do plan to improve them in follow-ups. It is a valid thing to do for all MP changes. Since snapshot state machine is particularly intricate, I think it worths to track separately and rasied ES-12246.

I think it is safe and beneficial to merge this PR first. My reasonings are:

The most important and immediate concern is to ensure nothing breaks in single cluster setup. This is still guaranteed by all existing stress tests.
Multi-project snapshots do not have any production usage yet. Less obvious bugs likely won't have serious impact. We still want to find and fix them. It's an ogoing effort. Any obvious bugs should be caught by the IT tests on the linked serverless PR that runs snapshot operations concurrently across multiple projects.
OBS work will be unblocked once this PR is merged. It would be great to have two teams work in parallel. This is also the reason why restsore is not yet MP ready since it is not needed by OBS.
Making SnapshotStressTestsIT or SnapshotResiliencyTests MP ready requires considerable effort that feels better to be handled separately. The former requires more general support from the InternalClusterTest infrastructure which I also plan to work on for supporting ES-12053. The later may be a bit simpler but still needs changes to DeterministicTaskQueue and/or DisruptableMockTransport to pass necessary ThreadContext headers between nodes as well as a few other things. Even when all these are in place, the MP tests will be excercising a single active project with other projects being dormant. This is still valuable and is what most current MP tests do, e.g. the snapshot related YAML tests. But for snapshots, I think we get more value if all projects are excercised concurrently. See also ES-12246 for a some more details.
Restore is not yet MP ready. So it is not really feasible to update SnapshotStressTestsIT or SnapshotResiliencyTests fully at this stage.

In summary, my suggestion is to merge this PR as is. Multi-project snapshots is an active ongoing project. We will keep improving it in the coming iterations. I hope this makes sense. Thanks!

DaveCTurner

Thanks, yes this seems safe enough for now.

The later may be a bit simpler but still needs changes to DeterministicTaskQueue and/or DisruptableMockTransport to pass necessary ThreadContext headers

Wow yeah I didn't notice the lack of thread context propagation here. I wonder how we have got away without that for all this time.

This PR makes snapshot service code and APIs multi-project compatible. Resolves: ES-10225 Resolves: ES-10226

This PR makes the restore process project aware and unmute relevant tests. The later requires TransportRecoveryAction to be project aware which is done in this PR as well. Relates: #130000 Resolves: ES-10228

ywangd added 3 commits June 25, 2025 13:52

Make SnapshotsService and related APIs support multi-project

d4c3e22

Fix more issues and write the correct metadata

f3c531f

unmute

7655d7c

elasticsearchmachine added the v9.2.0 label Jun 25, 2025

tweak

f669e03

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jun 25, 2025

ywangd added 9 commits June 26, 2025 00:14

fix license header

a7454d6

Merge remote-tracking branch 'origin/main' into ES-10225-snapshots-se…

c7cb973

…rvice-multi-project

fix test and improve logging

d9248b4

fix tests

132401f

Merge remote-tracking branch 'origin/main' into ES-10225-snapshots-se…

e4f7a8d

…rvice-multi-project

fix more tests

0981ad4

Merge remote-tracking branch 'origin/main' into ES-10225-snapshots-se…

1f05ee1

…rvice-multi-project

tweak

6109e9d

tighten assertions

b846483

ywangd added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 26, 2025

ywangd requested review from DaveCTurner and pxsalehi June 26, 2025 04:39

ywangd marked this pull request as ready for review June 26, 2025 04:40

ywangd requested a review from a team as a code owner June 26, 2025 04:40

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jun 26, 2025

ywangd commented Jun 26, 2025

View reviewed changes

ywangd commented Jun 27, 2025

View reviewed changes

tweak

89b4c00

Merge remote-tracking branch 'origin/main' into ES-10225-snapshots-se…

a04e77e

…rvice-multi-project

pxsalehi reviewed Jun 27, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into ES-10225-snapshots-se…

59a3421

…rvice-multi-project

DaveCTurner reviewed Jul 2, 2025

View reviewed changes

ywangd added 2 commits July 3, 2025 12:30

use slash to separate project and repo name

e1e6ef2

Merge remote-tracking branch 'origin/main' into ES-10225-snapshots-se…

bf32f33

…rvice-multi-project

ywangd requested a review from DaveCTurner July 3, 2025 03:22

DaveCTurner approved these changes Jul 3, 2025

View reviewed changes

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 3, 2025

elasticsearchmachine merged commit f15ef7c into elastic:main Jul 3, 2025
32 checks passed

ywangd deleted the ES-10225-snapshots-service-multi-project branch July 3, 2025 08:49

mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 3, 2025

Snapshots support multi-project (elastic#130000)

e75e94a

This PR makes snapshot service code and APIs multi-project compatible. Resolves: ES-10225 Resolves: ES-10226

pxsalehi mentioned this pull request Jul 3, 2025

[CI] BlobStoreCorruptionIT testCorruptionDetection failing #130536

Closed

ywangd mentioned this pull request Jul 22, 2025

Make restore support multi-project #131661

Merged

Snapshots support multi-project #130000

Snapshots support multi-project #130000

Uh oh!

Conversation

ywangd commented Jun 25, 2025

Uh oh!

elasticsearchmachine commented Jun 26, 2025

Uh oh!

ywangd commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jun 27, 2025

Uh oh!

pxsalehi commented Jun 27, 2025

Uh oh!

pxsalehi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jun 30, 2025

Uh oh!

ywangd commented Jul 1, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ywangd commented Jul 3, 2025 •

edited

Loading