Performance Tweaks for 1.000+ PV Snapshots (GCP) #9217
Replies: 2 comments 8 replies
-
|
I don't think datamover will speed up your use case, since datamover starts with a CSI snapshot, followed by copying the snapshot data to s3 and then deleting the CSI snapshot. One way to confirm that the bottleneck is, in fact, Google would be to create a simple script outside of velero that snapshots all of your disks and see what sort of throughput you get there vs. using velero. If Google is the bottleneck, then I think this becomes a google performance/tuning issue rather than a velero one. |
Beta Was this translation helpful? Give feedback.
-
From this statement, it seems the bottleneck is indeed on the CSI snapshot creation, so data mover won't help and Velero even doesn't help, because Velero always calls CSI snapshot provided by the infrastructure. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am relatively new to velero, but my first impression is very good.
I currently have a challenging task where we need to back up 1,000+ persistent volumes. In the medium term, this number could rise to 50k. Each volume is around 200MB in size and contains almost identical data.
I am using the GCP plugin, the data-mover and a Google Bucket in the same region.
With my current test setup, I can manage about 100 snapshot backups + data mover exports to a Google Bucket in 35 minutes with Velero in GCP.
I've already played around with parallel file uploads, but the actual bottleneck seems to be the disk snapshots in Google. Velero apparently only does 3 in parallel here (as it looks like, i couln't confirm that).
It also makes almost no difference whether I do 1 backup for 100 PVs or 100 backups for 1 PV each. The individual backups are only minimally slower.
I also tested doing backups without the data-mover disabled but there isnt any real big performance increase.
This means that a backup of the same 100 PVs without data mover (only disk snapshots) also takes about 25 minutes. Not much faster than exporting data to the bucket.
I can't see any other obvious bottlenecks in the cluster (such as memory or CPU).
Now my question:
What other options and parameters are there to parallelize this process?
How does the data mover for snapshots actually work? It seems that deduplication takes place for all PVs. Does the data mover do this, and how does it scale with 1,000+ nearly identical disk snapshots?
I would really appreciate any input.
UPDATE: tried to add a configmap to increade
loadConcurrencybyglobalConfigto 8 according to documentation: https://velero.io/docs/main/node-agent-concurrency/no change. Also added 10 additional worker nodes in hope that the datamover scales with the nodes but still get only 2 datamover pods in
Runningstate while most of the time 2 other pods are inPendingstate.Beta Was this translation helpful? Give feedback.
All reactions