CPU iowait > 10%. A high iowait means that you are disk or network bound. #8972

bradley-baker · 2025-05-25T10:32:25Z

bradley-baker
May 25, 2025

We're getting the following alert in alert manager on our kubernetes cluster:

CPU iowait > 10%. A high iowait means that you are disk or network bound. VALUE = 24.854166666665588 LABELS = map[instance:10.20.10.238:9100 nodename:worker3]

Investigating the issue we see that its the velero maintenance job causing all the IO

Total DISK READ :     284.88 M/s | Total DISK WRITE :     945.10 K/s
Actual DISK READ:     284.88 M/s | Actual DISK WRITE:     660.47 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 983797 be/4 1002       26.48 M/s    0.00 B/s  0.00 % 52.92 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
1080176 be/4 1002       47.34 M/s    0.00 B/s  0.00 % 51.37 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
1494983 be/4 1002       30.97 M/s    0.00 B/s  0.00 % 50.88 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
 983796 be/4 1002       30.71 M/s    0.00 B/s  0.00 % 44.48 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
1007108 be/4 1002       45.28 M/s    0.00 B/s  0.00 % 43.13 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
 983798 be/4 1002       36.72 M/s    0.00 B/s  0.00 % 37.49 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
1023891 be/4 1002       43.12 M/s    0.00 B/s  0.00 % 37.11 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
 983799 be/4 1002       24.25 M/s    0.00 B/s  0.00 % 23.27 % velero repo-maintenance --repo-name=backend --repo-type=kopia --ba~ge-location=backups --log-level=debug --log-format=text
  15474 be/3 root        0.00 B/s    3.65 K/s  0.00 %  1.41 % [jbd2/dm-2-8]
  21858 be/4 cloud-us    0.00 B/s   47.44 K/s  0.00 %  1.23 % java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkadd~-0,elasticsearch-master-1,elasticsearch-master-2, [elasticsearch[e]
  18951 be/4 cloud-us    0.00 B/s   18.25 K/s  0.00 %  1.16 % java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des

I understand velero maintenance jobs by nature are IO intensive. But is there some way to limit the number of jobs running concurrently and/or can we limit how much IO they generate at once?

Or should I consider changing alert manager to exclude velero processes from being monitored?

Thanks!
Brad

blackpiglet · 2025-05-26T02:51:44Z

blackpiglet
May 26, 2025
Maintainer

This is unexpected. There should be only one running maintenance job for a specific BackupRepository.
Could you check the running Velero maintenance job number by running this CLI?
kubectl -n velero get pods -o wide

0 replies

bradley-baker · 2025-05-26T10:44:30Z

bradley-baker
May 26, 2025
Author

NAME                                                              READY   STATUS      RESTARTS   AGE    IP              NODE      NOMINATED NODE   READINESS GATES
data-production1-backups-kopia-8tjdw-maintain-job-17410204ch2dg   0/1     Completed   0          83d    10.42.146.115   worker4   <none>           <none>
data-production1-backups-kopia-8tjdw-maintain-job-17436941xt4pc   0/1     Completed   0          52d    10.42.23.187    worker1   <none>           <none>
data-production1-backups-kopia-8tjdw-maintain-job-17462574qh9gj   0/1     Completed   0          23d    10.42.18.140    worker3   <none>           <none>
node-agent-btxgr                                                  1/1     Running     2          142d   10.42.18.132    worker3   <none>           <none>
node-agent-dbrts                                                  1/1     Running     2          142d   10.42.35.90     worker2   <none>           <none>
node-agent-mtfwz                                                  1/1     Running     2          142d   10.42.23.147    worker1   <none>           <none>
node-agent-nfvkq                                                  1/1     Running     3          142d   10.42.146.79    worker4   <none>           <none>
repo-maintain-job-1741020351542-l25zz                             0/1     Completed   0          83d    10.42.146.126   worker4   <none>           <none>
repo-maintain-job-1741020453606-vnmsb                             0/1     Completed   0          83d    10.42.146.118   worker4   <none>           <none>
repo-maintain-job-1741025278503-z7plp                             0/1     Completed   0          83d    10.42.146.66    worker4   <none>           <none>
repo-maintain-job-1741025285979-n462j                             0/1     Completed   0          83d    10.42.146.78    worker4   <none>           <none>
repo-maintain-job-1741025292150-5hnxj                             0/1     Completed   0          83d    10.42.146.120   worker4   <none>           <none>
repo-maintain-job-1741025298317-d68l9                             0/1     Completed   0          83d    10.42.146.121   worker4   <none>           <none>
repo-maintain-job-1743693760450-vxfcz                             0/1     Completed   0          52d    10.42.23.151    worker1   <none>           <none>
repo-maintain-job-1743694150518-tdpqg                             0/1     Completed   0          52d    10.42.23.189    worker1   <none>           <none>
repo-maintain-job-1743697005149-58n8s                             0/1     Completed   0          52d    10.42.23.169    worker1   <none>           <none>
repo-maintain-job-1743697012710-9jxz4                             0/1     Completed   0          52d    10.42.23.190    worker1   <none>           <none>
repo-maintain-job-1743697019848-vx5rz                             0/1     Completed   0          52d    10.42.23.150    worker1   <none>           <none>
repo-maintain-job-1743697026072-cnlvp                             0/1     Completed   0          52d    10.42.23.168    worker1   <none>           <none>
repo-maintain-job-1746257364106-8664w                             0/1     Completed   0          23d    10.42.18.175    worker3   <none>           <none>
repo-maintain-job-1746257432625-7jpvm                             0/1     Completed   0          23d    10.42.18.164    worker3   <none>           <none>
repo-maintain-job-1746261644677-4pg85                             0/1     Completed   0          23d    10.42.18.190    worker3   <none>           <none>
repo-maintain-job-1746261650443-pdlxg                             0/1     Completed   0          23d    10.42.18.181    worker3   <none>           <none>
repo-maintain-job-1746261657965-nptmq                             0/1     Completed   0          23d    10.42.18.167    worker3   <none>           <none>
repo-maintain-job-1746261664109-khm7h                             0/1     Completed   0          23d    10.42.18.155    worker3   <none>           <none>
repo-maintain-job-1746261669247-wrz2h                             1/1     Running     0          23d    10.42.18.166    worker3   <none>           <none>
velero-d4fd5b69f-c7ms7                                            1/1     Running     5          142d   10.42.35.82     worker2   <none>           <none>

1 reply

bradley-baker May 28, 2025
Author

@blackpiglet any ideas?

blackpiglet · 2025-05-29T11:02:53Z

blackpiglet
May 29, 2025
Maintainer

Could you please help collect the debug bundle by running velero debug and uploading it here?
Need to find out the detailed configuration of Velero to investigate.

Another thing is, when there is no running repo maintenance job, do you still see many velero repo-maintenance processes?

0 replies

bradley-baker · 2025-05-29T20:49:08Z

bradley-baker
May 29, 2025
Author

I generated a bundle but the bundle is 71Mb in size and the limit is 25Mb. Is there some other way I can send this to you?

0 replies

blackpiglet · 2025-05-30T02:48:28Z

blackpiglet
May 30, 2025
Maintainer

How about using Slack?
My Slack account is Xun Jiang.

1 reply

bradley-baker May 30, 2025
Author

Sent

blackpiglet · 2025-05-30T11:08:29Z

blackpiglet
May 30, 2025
Maintainer

Thanks. I checked the debug bundle. I found there are 8 BackupRepositories. As a result, one repo maintenance job per BackupRepository is normal.

0 replies

bradley-baker · 2025-05-30T11:36:54Z

bradley-baker
May 30, 2025
Author

Is there anything we can do to stagger the jobs or throttle them so they're not thrashing the disks all at once? Or maybe make them run in series instead of in parallel?

0 replies

bradley-baker · 2025-05-30T20:06:03Z

bradley-baker
May 30, 2025
Author

Actually where are you seeing 8 backup repositories?

We have five backup jobs and they're all going to the same storage location?

#velero get backups
NAME                                        STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION      SELECTOR
backup-f5-app-schedule-7am-20250530120007   Completed   0        0          2025-05-30 08:00:07 -0400 EDT   1d        p1-cohesity-backups   app=f5-bigip-ctlr
backup-f5-app-schedule-7am-20250529120007   Completed   0        0          2025-05-29 08:00:07 -0400 EDT   15h       p1-cohesity-backups   app=f5-bigip-ctlr
backup-ns-data-dapr-10am-20250530140007     Completed   0        0          2025-05-30 10:00:07 -0400 EDT   2d        p1-cohesity-backups   <none>
backup-ns-data-dapr-10am-20250529140007     Completed   0        0          2025-05-29 10:00:07 -0400 EDT   1d        p1-cohesity-backups   <none>
backup-ns-data-dapr-10am-20250528140006     Completed   0        0          2025-05-28 10:00:06 -0400 EDT   17h       p1-cohesity-backups   <none>

0 replies

blackpiglet · 2025-05-31T14:47:59Z

blackpiglet
May 31, 2025
Maintainer

Please run this command line to get all the Velero backup repositories.

kubectl -n velero get backuprepositories

The Velero's BackupRepository number is not related to the backup number.
Velero will create a BackupRepository depending on the following three factors:

The BackupStorageLocation.
The Uploader type: including restic and kopia.
The backup PVC namespaces

Take this scenario as an example:

There are two BSLs in the cluster: bsl-1 and bsl-2.
There are four namespaces containing PVCs.
Created two backups:
- First backup including 3 namespaces containing PVC, and using bsl-1.
- Second backup including 4 namespaces containing PVS, and using bsl-2.
- Both backups use Kopia as the uploader.

As a result, three BackupRepositories are created for the first backup.
Another four BackupRepositories are created for the second backup.
Although the namespaces overlap between the two backups, due to the BSL difference, the BackupRepositories cannot be shared.

Each BackupRepositories will require repository maintenance jobs.

0 replies

blackpiglet · 2025-05-31T14:52:53Z

blackpiglet
May 31, 2025
Maintainer

Repository maintenance could be resource-consuming.
I think your request to avoid running them all at once is reasonable.

There are some configurations that could avoid that.
Please check this document for your reference: https://velero.io/docs/v1.16/repository-maintenance/

0 replies

CPU iowait > 10%. A high iowait means that you are disk or network bound. #8972

Uh oh!

Uh oh!

bradley-baker May 25, 2025

Replies: 10 comments · 2 replies

Uh oh!

blackpiglet May 26, 2025 Maintainer

Uh oh!

Uh oh!

bradley-baker May 26, 2025 Author

Uh oh!

bradley-baker May 28, 2025 Author

Uh oh!

blackpiglet May 29, 2025 Maintainer

Uh oh!

bradley-baker May 29, 2025 Author

Uh oh!

blackpiglet May 30, 2025 Maintainer

Uh oh!

bradley-baker May 30, 2025 Author

Uh oh!

blackpiglet May 30, 2025 Maintainer

Uh oh!

bradley-baker May 30, 2025 Author

Uh oh!

bradley-baker May 30, 2025 Author

Uh oh!

Uh oh!

blackpiglet May 31, 2025 Maintainer

Uh oh!

blackpiglet May 31, 2025 Maintainer

bradley-baker
May 25, 2025

Replies: 10 comments 2 replies

blackpiglet
May 26, 2025
Maintainer

bradley-baker
May 26, 2025
Author

bradley-baker May 28, 2025
Author

blackpiglet
May 29, 2025
Maintainer

bradley-baker
May 29, 2025
Author

blackpiglet
May 30, 2025
Maintainer

bradley-baker May 30, 2025
Author

blackpiglet
May 30, 2025
Maintainer

bradley-baker
May 30, 2025
Author

bradley-baker
May 30, 2025
Author

blackpiglet
May 31, 2025
Maintainer

blackpiglet
May 31, 2025
Maintainer