-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Hi,
We are using a VolumeSnapshotClass as below for Block volume snapshotting:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: oci-bv-snapshot-incremental
driver: blockvolume.csi.oraclecloud.com
parameters:
backupType: incremental # No functional restore difference between full and incremental
deletionPolicy: DeleteThis is integrated with CNPG for lower environment database volume snapshots.
Occasionally (every few weeks), we find these backups failing. with the error:
DeadlineExceeded desc = Timed out waiting for backup to become available
It looks like this is being thrown by the oci-bv csi here:
| return nil, status.Errorf(codes.DeadlineExceeded, "Timed out waiting for backup to become available") |
Which uses a timeout of 45 seconds as defined here:
| volumeAvailableTimeoutCtx, cancel := context.WithTimeout(ctx, 45 * time.Second) |
However, in practice a 45 second timeout is too conservative, looking in the logs, we see the following times for snapshot creation in uk-london-1 between going from com.oraclecloud.BlockVolumes.CreateVolumeBackup.begin to com.oraclecloud.BlockVolumes.CreateVolumeBackup.end state.
Over 9 samples: average: 37.4 seconds | min: 34 seconds | max: 41 seconds
With a backupPollInterval of 5 seconds, the CSI steps just outside of the permissible timeout of 45 seconds.
https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L150C36-L150C60
I believe the solution for this would be to increase the available timeout to 60 seconds to align better with the expected response times from the API.
| volumeAvailableTimeoutCtx, cancel := context.WithTimeout(ctx, 45 * time.Second) |
Thanks!