- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
Snapshots support multi-project #130000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots support multi-project #130000
Conversation
…rvice-multi-project
…rvice-multi-project
…rvice-multi-project
| Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) | 
| Most of the changes are for  | 
| * @param projectId The project that the repository belongs to | ||
| * @param name Name of the repository | ||
| */ | ||
| public record ProjectRepo(ProjectId projectId, String name) implements Writeable { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an existing class extracted from RepositoryOperation.
| final var projectMetadata = clusterMetadata.getProject(getProjectId()); | ||
| executor.execute(ActionRunnable.run(allMetaListeners.acquire(), () -> { | ||
| if (finalizeSnapshotContext.serializeProjectMetadata()) { | ||
| PROJECT_METADATA_FORMAT.write(projectMetadata, blobContainer(), snapshotId.getUUID(), compress); | ||
| } else { | ||
| GLOBAL_METADATA_FORMAT.write(clusterMetadata, blobContainer(), snapshotId.getUUID(), compress); | ||
| } | ||
| })); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we conditionally write ProjectMetadata for multi-project snapshots. The Metadata in this case is a thin wrapper around ProjectMetadata to reuse existing finalization related classes.
| private void startExecutableClones(SnapshotsInProgress snapshotsInProgress) { | ||
| for (List<SnapshotsInProgress.Entry> entries : snapshotsInProgress.entriesByRepo()) { | ||
| startExecutableClones(entries); | ||
| } | ||
| } | ||
|  | ||
| /** | ||
| * Maybe kick off new shard clone operations for all repositories of the specified project | ||
| */ | ||
| private void startExecutableClones(SnapshotsInProgress snapshotsInProgress, ProjectId projectId) { | ||
| for (List<SnapshotsInProgress.Entry> entries : snapshotsInProgress.entriesByRepo(projectId)) { | ||
| startExecutableClones(entries); | ||
| } | ||
| } | ||
|  | ||
| /** | ||
| * Maybe kick off new shard clone operations for the single specified project repository | ||
| */ | ||
| private void startExecutableClones(SnapshotsInProgress snapshotsInProgress, ProjectRepo projectRepo) { | ||
| startExecutableClones(snapshotsInProgress.forRepo(Objects.requireNonNull(projectRepo))); | ||
| } | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshotting is state machine that triggers next operation when the current operation finishes. In most cases, the triggering is confined in the same repository. This is the simplest case and gets migrated as is. The other case is triggering across all repositories. In a MP setup, this could mean either across all repositories of all projects or across all repositories of a single project. This is the reason for the 3 variants of the same named method here. The principles that I have applied are:
- If the scope was a single repository, keep it as is.
- If the scope was all repositories and reacting to cluster state changes, i.e. applyClusterState, it applies to all repositories across all projects.
- If the scope was all repository and happening after completing a particular snapshot operation, e.g. deleting a snapshot entry, it applies to all repositories of a single project that the operation is associated with.
| private static Tuple<ClusterState, List<SnapshotDeletionsInProgress.Entry>> readyDeletions( | ||
| ClusterState currentState, | ||
| @Nullable ProjectId projectId | ||
| ) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example of different scopes for triggering next operation. In this case, it does not have the single repository scope, but either cluster wide or single project (when projectId == null).
| Depending on how we handle project soft-deletion and/or clean-up, snapshots may see a project getting concurrently deleted and thereore fail. This PR does not attempt to handle it more gracefully since it would become noise and get burried in the large amount of namespacing changes. I can raise a separate ticket to track this work. | 
…rvice-multi-project
| 
 yeah, we'd need a new ticket for this under the soft-deletion epic. We briefly mentioned in the design doc that once the project is marked for deletion, we should 1) prevent any new snapshots being scheduled/requested. This partially goes back to making those internal actions aware of checking for the deletion project block. 2) any ongoing snapshotting should be cancelled for that project (I guess not that simple but somehow at least fail graciously and not blow up). For 1, we have ES-12121. But I don't think 2 has a ticket yet. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. To my eyes these all look like straight-forward changes to trickle down project ID everywhere necessary. I don't have a strong opinions about the details. Although considering snapshotting code is convoluted and in a delicate state, I'm gonna defer the final approval to David (or anyone else with more snapshotting experience).
| if (token == null) { | ||
| token = parser.nextToken(); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this all about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed when the ProjectMetadata is parsed on its own, i.e. not as part of Metadata. In the later case, the parser.nextToken() method is already called when parsing the outer structure. In this PR, we need to de/serialize ProjectMetadata on its own. Hence the need for active call to this method. We also use this pattern in other places such as here.
| SnapshotInfo snapshotInfo = null; | ||
| try { | ||
| snapshotInfo = SNAPSHOT_FORMAT.read(metadata.name(), blobContainer(), snapshotId.getUUID(), namedXContentRegistry); | ||
| snapshotInfo = SNAPSHOT_FORMAT.read(getProjectRepo(), blobContainer(), snapshotId.getUUID(), namedXContentRegistry); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is getProjectRepo() now for non-MP always uding DEFAULT as project name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct.
| 
 I raised ES-12217. Feel free to update it. Thanks! | 
…rvice-multi-project
| @DaveCTurner Could you please review this PR since you are the expert in this area? I went through the changes carefully and the test in the linked PR has been quite helpful in catching issues. We could use your blessing for the final approval. Thank you! 🙏 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me, although as with most changes to snapshot code that means more "no obvious bugs" rather than "obviously no bugs".
I'd rather there was more coverage of the multi-project cases in tests, in particular in SnapshotStressTestsIT and SnapshotResiliencyTests. AIUI that's still something we're working towards, particularly for the internal-cluster tests. I'm going to have to leave it to your judgement to decide whether it's safe to merge this without that test coverage in place.
| } | ||
|  | ||
| public static String projectRepoString(ProjectId projectId, String repositoryName) { | ||
| return "[" + projectId + "][" + repositoryName + "]"; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes some of the user-facing messages a bit harder to understand (especially in a single-project world where everything is now prefixed with a mysterious [default]). Would it make more sense to say something like [repository=${repositoryName}/project=${projectId}]? Not completely sold on that either, but nor do I have a better suggestion right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to be [project/repositoryName]. Using slash as a separator is at least consistent with what we do for NodeIndicesStats. The difference is that NodeIndicesStats hides the default project when MP is disabled. It makes sense there since it is part of the API response. Our method here is used for loggings. So I'd think we don't need this conditional logic. Also we'd need a projectResolver to tell whether MP is disabled which seems not worth the complexity. Using a slash also maintains the same number of bracket pairs which might be helpful if people has some parsing script relying on it. In addition, I slightly prefer having project first then repositoryName in the output since in some places the entire sequence is "project, repository, snapshot" which seems to flow better than having the project in the middle.
Bottom line is that we can still change it in future when necessary since it's for logging usages.
| Test coverage is a good point. I do plan to improve them in follow-ups. It is a valid thing to do for all MP changes. Since snapshot state machine is particularly intricate, I think it worths to track separately and rasied ES-12246. I think it is safe and beneficial to merge this PR first. My reasonings are: 
 In summary, my suggestion is to merge this PR as is. Multi-project snapshots is an active ongoing project. We will keep improving it in the coming iterations. I hope this makes sense. Thanks! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes this seems safe enough for now.
The later may be a bit simpler but still needs changes to DeterministicTaskQueue and/or DisruptableMockTransport to pass necessary ThreadContext headers
Wow yeah I didn't notice the lack of thread context propagation here. I wonder how we have got away without that for all this time.
This PR makes snapshot service code and APIs multi-project compatible. Resolves: ES-10225 Resolves: ES-10226
This PR makes the restore process project aware and unmute relevant tests. The later requires TransportRecoveryAction to be project aware which is done in this PR as well. Relates: #130000 Resolves: ES-10228
This PR makes snapshot service code and APIs multi-project compatible.
Resolves: ES-10225
Resolves: ES-10226