Skip to content

Integrate native KubeVirt backup API (backup.kubevirt.io/v1alpha1)#412

Open
aagarwal-apexanalytix wants to merge 3 commits intokubevirt:mainfrom
aagarwal-apexanalytix:feat/native-backup-integration
Open

Integrate native KubeVirt backup API (backup.kubevirt.io/v1alpha1)#412
aagarwal-apexanalytix wants to merge 3 commits intokubevirt:mainfrom
aagarwal-apexanalytix:feat/native-backup-integration

Conversation

@aagarwal-apexanalytix
Copy link
Copy Markdown

What this PR does / why we need it:

Adds optional support for the native KubeVirt backup API (backup.kubevirt.io/v1alpha1) alongside the existing CSI snapshot path. When enabled via Velero Backup labels, the plugin uses VirtualMachineBackup CRs for CBT-based, QEMU-aware backup with incremental support via VirtualMachineBackupTracker checkpoints.

Key capabilities:

  • Async native backup via Velero BackupItemAction v2 (ExecuteProgressCancel)
  • Incremental backup via tracker checkpoints with configurable forceFullEveryN
  • Guest agent auto-detection — skips quiesce when QEMU agent is not connected
  • Scratch PVC pattern — native backup writes to a temporary PVC, Velero snapshots it, preventing CSI double-snapshot on source PVCs
  • Graceful CSI fallback — falls back to existing CSI path if CRDs not installed, VM not running, or any native operation fails
  • Cleanup on deleteDeleteItemAction removes VirtualMachineBackup CRs and scratch PVCs when Velero backup is deleted
  • Atomic groupingItemBlockAction ensures VM + related resources are backed up together

All behavior is opt-in via labels; existing CSI-only workflows are unaffected.

Which issue(s) this PR fixes:

Fixes #411

Special notes for your reviewer:

The implementation spans three commits for reviewability:

  1. feat: integrate native KubeVirt backup API — Core implementation: new pkg/util/nativebackup/ package (8 files), v2 upgrades to VM backup/restore actions, new DeleteItemAction and ItemBlockAction, RBAC, unit tests
  2. fix: annotation loss, nil panic, and silent cleanup errors — Bug fixes found during review: annotations set on VM struct instead of unstructured item, nil guard in volumeInDVTemplates, cleanup error logging
  3. improve: API timeouts, incremental counter, error propagation, agent check — Hardening: 30s context timeout on all K8s API calls, annotation-based incremental counter (replaces stub), ParseOperationID returns error, guest agent check returns error for caller differentiation

Configuration is via labels on the Velero Backup object:

Label Purpose
velero.kubevirt.io/native-backup Enable native backup for this Backup
velero.kubevirt.io/incremental-backup Enable incremental via tracker
velero.kubevirt.io/skip-quiesce Force skip filesystem quiesce
velero.kubevirt.io/scratch-storage-class Override scratch PVC storage class
velero.kubevirt.io/force-full-every-n Force full backup every N incrementals
velero.kubevirt.io/native-backup-concurrency Max concurrent native backups

Cluster-wide defaults via ConfigMap kubevirt-velero-plugin-config in the velero namespace.

Verified against KubeVirt v1.8.1 VirtualMachineBackup and VirtualMachineBackupTracker CRD schemas from a live cluster.

Release note:

Add optional native KubeVirt backup API integration (backup.kubevirt.io/v1alpha1) for CBT-based incremental backup with QEMU guest agent quiesce support. Enable via `velero.kubevirt.io/native-backup` label on Velero Backup objects. Falls back to CSI snapshots when native API is unavailable or fails.

Add support for the native KubeVirt backup API alongside the existing
CSI snapshot path. When enabled via Velero Backup labels, the plugin
uses VirtualMachineBackup CRs for CBT-based backup with QEMU guest
agent quiescing instead of CSI volume snapshots.

Key changes:

Phase 1a - v2 migration:
- Upgrade VMBackupItemAction to BackupItemAction v2 (async Progress/Cancel)
- Upgrade VMRestorePlugin to RestoreItemAction v2 (AreAdditionalItemsReady)
- VM restore now waits for PVCs to be bound before creating the VM

Phase 1b - Push mode full backup:
- New nativebackup package: config, detect, backup, scratch, progress,
  tracker, agent, volumes
- CRD feature detection with cached discovery
- Scratch PVC provisioning sized to VM disk capacity
- VirtualMachineBackup CR lifecycle (create, progress, cancel, cleanup)
- Graceful CSI fallback on any failure (stopped VM, missing CRD, errors)
- CSI double-snapshot prevention for native-backed PVCs

Phase 1c - Operational polish:
- Idempotent Execute() for Velero retries (AlreadyExists handling)
- Finalize phase guard (no async ops in Finalize)
- QEMU guest agent detection with auto skipQuiesce
- Backup metadata annotations (type, checkpoint, volumes)
- Scratch PVC TTL + garbage collection

Phase 2 - Incremental backup:
- VirtualMachineBackupTracker lifecycle (create once, reuse)
- Checkpoint health check (redefinition after VM restart)
- Source resolution: VM for full, Tracker for incremental
- forceFullEveryN periodic full backup support

Phase 3 - Atomicity + cleanup:
- VMDeleteItemAction: cleanup native artifacts on backup deletion
- VMItemBlockAction: atomic backup of VM + related resources
- RBAC ClusterRole for backup.kubevirt.io permissions

Configuration via Velero Backup labels:
  velero.kubevirt.io/native-backup: "true"
  velero.kubevirt.io/incremental-backup: "true"
  velero.kubevirt.io/skip-quiesce: "true"
  velero.kubevirt.io/scratch-storage-class: "<class>"
  velero.kubevirt.io/native-backup-concurrency: "5"
  velero.kubevirt.io/force-full-every-n: "7"

Or via ConfigMap (velero/kubevirt-velero-plugin-config) for defaults.

Signed-off-by: Akhilesh Agarwal <aagarwal@apexanalytix.com>
- Annotations now set on VM struct (not unstructured item) so they
  survive ToUnstructured conversion and are included in the backup
- Guard volumeInDVTemplates against nil DataVolume source to prevent
  panic on volumes without DataVolume (e.g. PVC, CloudInit)
- Log cleanup errors instead of silently discarding them

Signed-off-by: Akhilesh Agarwal <aagarwal@apexanalytix.com>
…check

1. Replace context.TODO() with apiContext() (30s timeout) across all
   Kubernetes API calls to prevent indefinite hangs.

2. Implement annotation-based incremental backup counter on tracker CR.
   getIncrementalCount/updateIncrementalCount read/write the annotation,
   making forceFullEveryN work correctly for any N.

3. ParseOperationID now returns an error for malformed input instead of
   silently returning partial results. All callers updated.

4. isGuestAgentConnected returns (bool, error) so ShouldSkipQuiesce can
   distinguish "agent not connected" from "API call failed" and log
   appropriately.

Signed-off-by: Akhilesh Agarwal <aagarwal@apexanalytix.com>
@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Apr 3, 2026
@kubevirt-bot
Copy link
Copy Markdown

Hi @aagarwal-apexanalytix. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubevirt-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign shellyka13 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aagarwal-apexanalytix
Copy link
Copy Markdown
Author

Hi @ShellyKa13 — friendly ping on this PR. Happy to address any feedback or questions about the design.

This adds optional support for the native KubeVirt backup API (backup.kubevirt.io/v1alpha1) with CBT/incremental backup, QEMU guest agent quiesce, and graceful CSI fallback. All behavior is opt-in via labels — existing workflows are unaffected.

Would appreciate a /test all when you get a chance so CI can validate. Thanks!

@weshayutin
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@weshayutin weshayutin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review the prior art

@kubevirt-bot
Copy link
Copy Markdown

@weshayutin: changing LGTM is restricted to collaborators

Details

In response to this:

please review the prior art

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Contributor

@weshayutin weshayutin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need a design and disucssion and some delineation from the current OADP / Velero design https://github.com/openshift/oadp-operator/blob/oadp-dev/docs/design/kubevirt-datamover.md

@kubevirt-bot
Copy link
Copy Markdown

@weshayutin: changing LGTM is restricted to collaborators

Details

In response to this:

This would need a design and disucssion and some delineation from the current OADP / Velero design https://github.com/openshift/oadp-operator/blob/oadp-dev/docs/design/kubevirt-datamover.md

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Integrate native KubeVirt backup API (backup.kubevirt.io/v1alpha1) with Velero

3 participants