Performance Tweaks for 1.000+ PV Snapshots (GCP) #9217

Mr-Philipp · 2025-08-28T10:55:52Z

Mr-Philipp
Aug 28, 2025

Hi,

I am relatively new to velero, but my first impression is very good.

I currently have a challenging task where we need to back up 1,000+ persistent volumes. In the medium term, this number could rise to 50k. Each volume is around 200MB in size and contains almost identical data.

I am using the GCP plugin, the data-mover and a Google Bucket in the same region.

With my current test setup, I can manage about 100 snapshot backups + data mover exports to a Google Bucket in 35 minutes with Velero in GCP.

I've already played around with parallel file uploads, but the actual bottleneck seems to be the disk snapshots in Google. Velero apparently only does 3 in parallel here (as it looks like, i couln't confirm that).

It also makes almost no difference whether I do 1 backup for 100 PVs or 100 backups for 1 PV each. The individual backups are only minimally slower.

I also tested doing backups without the data-mover disabled but there isnt any real big performance increase.
This means that a backup of the same 100 PVs without data mover (only disk snapshots) also takes about 25 minutes. Not much faster than exporting data to the bucket.

I can't see any other obvious bottlenecks in the cluster (such as memory or CPU).

Now my question:

What other options and parameters are there to parallelize this process?

How does the data mover for snapshots actually work? It seems that deduplication takes place for all PVs. Does the data mover do this, and how does it scale with 1,000+ nearly identical disk snapshots?

I would really appreciate any input.

UPDATE: tried to add a configmap to increade loadConcurrency by globalConfig to 8 according to documentation: https://velero.io/docs/main/node-agent-concurrency/

no change. Also added 10 additional worker nodes in hope that the datamover scales with the nodes but still get only 2 datamover pods in Runningstate while most of the time 2 other pods are in Pendingstate.

sseago · 2025-08-28T17:40:55Z

sseago
Aug 28, 2025
Maintainer

I don't think datamover will speed up your use case, since datamover starts with a CSI snapshot, followed by copying the snapshot data to s3 and then deleting the CSI snapshot.

One way to confirm that the bottleneck is, in fact, Google would be to create a simple script outside of velero that snapshots all of your disks and see what sort of throughput you get there vs. using velero. If Google is the bottleneck, then I think this becomes a google performance/tuning issue rather than a velero one.

4 replies

Mr-Philipp Aug 29, 2025
Author

thanks. that is a good starting point, i am going to test that!

Is the snapshot actual triggered by the data mover because if i don't use the data mover, snapshotting is still possible to i thought the snapshot is called by velero itself (or the gcp plugin or the CSI module)

i still dont really understand why the EnableCSI in addtion to --provider gcp --plugins velero/velero-plugin-for-gcp:v1.12.0 \ is needed.

Mr-Philipp Aug 29, 2025
Author

Update: This is not a limit of Google Cloud. wrote a script which triggered 20 snapshots in parallel without any issues.
So this seems to be a limit (probably set to 3) in the velero controller (or GCP Plugin?)

sseago Aug 29, 2025
Maintainer

This may be a situation where the bottleneck is Velero waiting for the snaphandle to be ready in the CSI plugin. If that's the bottleneck, the --item-block-worker-count install option might help you. This enables Velero to back up multiple items in parallel for each backup. The default value is 1. See if setting it to 2 or 3 or 5 alters the throughput.

Mr-Philipp Sep 1, 2025
Author

This may be a situation where the bottleneck is Velero waiting for the snaphandle to be ready in the CSI plugin. If that's the bottleneck, the --item-block-worker-count install option might help you. This enables Velero to back up multiple items in parallel for each backup. The default value is 1. See if setting it to 2 or 3 or 5 alters the throughput.

that actually improved bigger backups a lot. thanks! What this seems not so solve is that if i dont want to pool (eg 100) PVs in one backup but do 100 single PV backups, this seems still to be a serial process.

pooling multiple PV's into a single backup would solve this issue but then i cant restore single PV's which would be a use-case for me :(

Lyndon-Li · 2025-08-29T03:58:09Z

Lyndon-Li
Aug 29, 2025
Maintainer

This means that a backup of the same 100 PVs without data mover (only disk snapshots) also takes about 25 minutes. Not much faster than exporting data to the bucket.

From this statement, it seems the bottleneck is indeed on the CSI snapshot creation, so data mover won't help and Velero even doesn't help, because Velero always calls CSI snapshot provided by the infrastructure.

4 replies

Mr-Philipp Sep 3, 2025
Author

--item-block-worker-count increased the parallelization with snapshots but sadly this seems only to be the case during one single backup. Running multiple backups (eg. 1000) in parallel, is still slow. at least with all settings and parameters i have found.

blackpiglet Sep 3, 2025
Maintainer

Velero doesn't support running multiple backups in parallel now.
--item-block-worker-count means multiple ItemBlocks are backed up in parallel in one backup.

sseago Sep 4, 2025
Maintainer

Velero 1.18 should allow running multiple separate backups in parallel, with the caveat that concurrent backups must have no included namespaces in common. See the current design for the feature: #9204

Mr-Philipp Sep 9, 2025
Author

that is a cool feature but because all of my PV's i need to backup are in the same namespace, this seems to be not solving it for me

Performance Tweaks for 1.000+ PV Snapshots (GCP) #9217

Uh oh!

Uh oh!

Mr-Philipp Aug 28, 2025

Replies: 2 comments · 8 replies

Uh oh!

sseago Aug 28, 2025 Maintainer

Uh oh!

Mr-Philipp Aug 29, 2025 Author

Uh oh!

Uh oh!

Mr-Philipp Aug 29, 2025 Author

Uh oh!

sseago Aug 29, 2025 Maintainer

Uh oh!

Uh oh!

Mr-Philipp Sep 1, 2025 Author

Uh oh!

Lyndon-Li Aug 29, 2025 Maintainer

Uh oh!

Mr-Philipp Sep 3, 2025 Author

Uh oh!

blackpiglet Sep 3, 2025 Maintainer

Uh oh!

sseago Sep 4, 2025 Maintainer

Uh oh!

Mr-Philipp Sep 9, 2025 Author

Mr-Philipp
Aug 28, 2025

Replies: 2 comments 8 replies

sseago
Aug 28, 2025
Maintainer

Mr-Philipp Aug 29, 2025
Author

Mr-Philipp Aug 29, 2025
Author

sseago Aug 29, 2025
Maintainer

Mr-Philipp Sep 1, 2025
Author

Lyndon-Li
Aug 29, 2025
Maintainer

Mr-Philipp Sep 3, 2025
Author

blackpiglet Sep 3, 2025
Maintainer

sseago Sep 4, 2025
Maintainer

Mr-Philipp Sep 9, 2025
Author