Skip to content

Add design for the OADP Virtual Machine Data Protection (VMDP) #1845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: oadp-dev
Choose a base branch
from

Conversation

mpryc
Copy link
Contributor

@mpryc mpryc commented Jul 29, 2025

New design to allow user based backup and restore through the kopia server deployed as part of the OADP operator.

This design replaces the following two designs, which after agreeing can be closed:
#1827
#1830

Why the changes were made

To enable backup and restore operations via a proxy service managed by the OADP Operator, improving flexibility and management of backup workflows.

How to test the changes made

Read the design.

AI assisted: Cursor (default model).

New design to allow user based backup and restore through
the kopia server deployed as part of the OADP operator.

Signed-off-by: Michal Pryc <[email protected]>
Assisted-by: Cursor
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2025
### Out of Scope

The following are explicitly out of scope for this project phase:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have anything we can do cap or limit the space used in s3? If not let's mark it out of scope. Customers will want this though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add to the design as out of scope:

No native mechanisms for enforcing storage quotas or caps at the repository level. This functionality may be addressed in future phases through cloud provider integration or Kopia enhancements.

- **Full VM Protection**: This design does not modify or replace OADP’s snapshot-based backup and restore for entire VMs or persistent volumes.
- **Application Quiescing**: The system will not implement application-level consistency mechanisms (e.g., pausing databases). Ensuring data consistency remains the user's responsibility.
- **Block-Level Operations**: Backup or restore of raw block devices, unmounted partitions, or disk images is not supported.
- **Graphical User Interface (UI)**: No integration with the OpenShift Console or other web-based user interfaces is included.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only reason we're avoiding a ui is because of the CVE impact, is that correct? Just want to make sure we're thinking about it in the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CVE impact is one factor, but not the only reason for limiting UI use.

  • Web UI may create repository management issues across users. UI operates on repository-level access, not individual user-level access, so one user could disconnect or misconfigure a repository, affecting all other users. Or we would need to have some extra logic to disallow reconfiguring it which would introduce more development and testing resources. We may plan for it after this initial tech-preview.
  • Web UIs increase the attack surface and require ongoing security maintenance.
  • A CLI-first approach better supports self-service and automation, aligning with VM-level operations.

- **Client Lifecycle Management**: Automated deployment, installation, or upgrading of the in-guest client software is out of scope.
- **External VM Access**: The design does not introduce new mechanisms for cluster-initiated access to guest VMs (e.g., SSH injection, `virt`-based access).
- **New Storage Backends**: This effort does not introduce support for new storage backend types. Only those already supported by OADP will be used.
- **Advanced Observability**: Detailed metrics, dashboards, or alerting features for file-level backup operations are not addressed in this phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham-pampattiwar sorry but w/ my PM hat on. I wonder if the DPT could potentially be used to report on the size of s3 buckets. Do a little poor man's s3 reporting, just a thought and not a todo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, we could have some data, but I think it would be more into two places

This imo is outside of this design and can be a nice enhancement to the overall status of unified repo.

- **New Storage Backends**: This effort does not introduce support for new storage backend types. Only those already supported by OADP will be used.
- **Advanced Observability**: Detailed metrics, dashboards, or alerting features for file-level backup operations are not addressed in this phase.

## Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this is nice


- A **Velero Backup Repository (BR)** stores its metadata (e.g., snapshots, indexes) directly within the backup storage backend (e.g., an S3 bucket). This means the repository metadata is part of the stored content and can be automatically synchronized or recovered when a new `BackupStorageLocation` (BSL) points to the same storage location.

- In contrast, a **BackupStorageLocationRepository (BSLR)** is a OpenShift-managed resource that represents a Kopia repository with metadata managed primarily inside OpenShift (via the CRD). The BSLR’s metadata is **not stored in the backup storage backend** (e.g., S3). As a result, repository metadata for BSLRs is **not automatically synced or restored** when creating or switching to a new BSL pointing to the same storage location.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm.. I hadn't thought of the bslr not being protected the sync controller. Would we be leaking creds or exposing anything if ( wave's magic wand ) and BSLR's were added to the sync controller list of objects?

3. **Custom Resource Definitions (CRDs) and Controller**
This component extends the existing OADP operator by adding a new controller and introducing two new CRDs:

- **BackupStorageLocationServer (BSLS)** — represents the deployment and configuration of a Kopia Backup Service instance within the OpenShift cluster. The BSLS points either directly to a Velero `BackupRepository` or to a new CRD `BackupStorageLocationRepository`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BSLS config is also something what wouldn't be synced by the sync-controller right? Also interested in hearing thoughts about that.

Copy link
Contributor

@weshayutin weshayutin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.. couple questions about

  • sync controller and new crds, whether or not it makes sense or is possible to sync these objects.
  • I would like to better understand the maintenance jobs for BSLR's. Is it the same as any maint job and does the oadp dpa config handle that?

Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC any user that is able to exec into the vm, would be able to restore/backup files through the BSLS by running the kopia commands.

Is that a fair assumption?

Have we considered using oauth tokens to do the auth/z with kubernetes rbac?

I am imagining something like, user execs into the VM -> user calls the kopia command and passes in or authenticates with the cluster oauth server -> that credential is then used by the BSLS to create a subjectaccessreview -> we design some RBAC rules to verify that the authenticated user can do the action.

Copy link

openshift-ci bot commented Jul 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@weshayutin
Copy link
Contributor

@shawn-hurley @shubham-pampattiwar any comments?

@shawn-hurley
Copy link
Contributor

Any comments that I have had, have been addressed offline.

Overall, this is an excellent idea for Tech Preview to kick the tires of it.

I am also thinking about a world where this is out of tech preview and BSLR might be an interesting concept to add to non-admin(?) but that is a wild thought that I just had.

@weshayutin
Copy link
Contributor

@shawn-hurley thanks! I agree the BSLR concept is going to be a valuable tool imho, excited to see how we can utilize that.

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @mpryc ! Solid architecture and impressive design, this feature is gonna be GOLD. Added a couple of comments here and there. Thank you !!


- **BackupStorageLocationRepository (BSLR)** — represents the Kopia repository backing the file-level backups for a given Backup Storage Location. The BSLR points back to a `BackupStorageLocation` (BSL). The BSLR concept is similar to a Kopia BackupRepository but is managed by OADP.

This design separates the management of backup repositories from the storage locations themselves, making it easy to map Kopia repositories to OpenShift BackupStorageLocations in a flexible way. Administrators can control exactly how repositories are created, assigned, and managed for each user or VM. By building on top of the existing OADP operator and using familiar resources like `BackupStorageLocation` (BSL), the system fits naturally into OpenShift’s data protection workflows. This approach makes it straightforward to perform detailed, file-level backup and restore operations, while keeping the process consistent and easy to understand.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ OpenShift BackupStorageLocations/ OADP-Velero BackupStorageLocations


The solution consists of the following core components:

1. **In-Guest Backup Client**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice client installation is currently out of scope, would it make sense to include a brief installation framework in this design, or is this planned as a separate follow-up? Since installation method can influence architectural decisions, considering both together might help avoid future design constraints.

1. **In-Guest Backup Client**
A lightweight, statically linked Go-based CLI tool, built from the standard Kopia CLI source code and fully compatible with Kopia APIs. It is designed to run entirely in user space (without requiring elevated privileges) and can be easily downloaded and executed—even outside standard installation paths. This client allows users to perform file-level backup and restore operations from within the guest VM, authenticating with the Kopia server using user-specific credentials.

2. **Backup Service (Kopia Server)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ok so this backup server acts as a proxy between the VM and the object storage, right ?
  2. Also, I see that there are no constraints on the number of backup servers being deployed ? (IMO we should keep it simple as one per cluster or one per namespace or one per VM, some kind of policy in place would be helpful in monitoring and managing things/workflows)
  3. Where will the backup server be deployed ? Which node and which namespace ?


- **BackupStorageLocationServer (BSLS)** — represents the deployment and configuration of a Kopia Backup Service instance within the OpenShift cluster. The BSLS points either directly to a Velero `BackupRepository` or to a new CRD `BackupStorageLocationRepository`.

- **BackupStorageLocationRepository (BSLR)** — represents the Kopia repository backing the file-level backups for a given Backup Storage Location. The BSLR points back to a `BackupStorageLocation` (BSL). The BSLR concept is similar to a Kopia BackupRepository but is managed by OADP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm So we are introducing BSLR so as to overcome the limitation of BR being NS scoped and one particular ns can have multiple VMs ? Thus giving more fine grained control to users (vm admins/users)

- **BackupStorageLocationServer (BSLS)** — represents the deployment and configuration of a Kopia Backup Service instance within the OpenShift cluster. The BSLS points either directly to a Velero `BackupRepository` or to a new CRD `BackupStorageLocationRepository`.

- **BackupStorageLocationRepository (BSLR)** — represents the Kopia repository backing the file-level backups for a given Backup Storage Location. The BSLR points back to a `BackupStorageLocation` (BSL). The BSLR concept is similar to a Kopia BackupRepository but is managed by OADP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we explored the possibility of overcoming the limitations Velero BR ? (I know that would be an upstream task but just wanted to know your thoughts on this, I am thinking on the lines where we could directly leverage the Velero BR to enable file level B/R on Velero backups directly)

vmUser --> kopiaService
```

### Backup and Restore Workflow Summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the design should clarify what are the VM admin/user duties and Cluster admin duties ?

  • Who manages what in the VMDP lifecycle
  • Permission boundaries between different admin roles
  • Operational responsibilities for troubleshooting
  • Self-service capabilities vs admin-required tasks

IMHO just adding a simple step-by-step workflow starting from installation through end-to-end use cases, clearly identifying actors (VM users vs cluster admins) at each step would go a long for this design doc 😄


#### Prerequisites

- Internal OpenShift networking must allow VMs to connect to the Backup Service deployed in the OADP namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include preflight connectivity checks to validate VM-to-backup-server communication before backup operations? Consider:

  • Network connectivity tests (ping/telnet to backup server)
  • Service availability validation
  • Authentication pre-verification
  • Performance/bandwidth checks
  • DNS resolution validation
    This would help troubleshoot deployment issues and provide better user experience when backup operations fail due to connectivity problems.

Hmm... Could be implemented as a vmdp test-connection command or automated health checks.

```yaml
# Example BSLS CRD specification
apiVersion: oadp.openshift.io/v1alpha1
kind: BackupStorageLocationServer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What container image will be used for this backup server ? Official kopia image or some custom OADP extended image ?
I think we also add an image override via DPA for this server.


The **Virtual Machine Data Protection (VMDP)** feature introduces a OpenShift-native, client-server architecture that enables file-level backup and restore operations initiated from within OpenShift Virtual Machines. The system integrates seamlessly with OpenShift APIs and the existing OADP infrastructure, while preserving a clear separation of responsibilities between cluster administrators and VM users.

### Architecture Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do administrators enable VMDP? Should this integrate with DPA configuration or be standalone CRD management? Please clarify this aspect in the design.

tlsFingerprint: sha256:abcd1234ef5678...
conditions: []
```

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this auth flow correct ?
VM User -> [username/password] -> Backup Server -> [repo password] -> Repository

Copy link

openshift-ci bot commented Aug 13, 2025

@mpryc: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants