Skip to content

PVCs that are slow to provision cause the autoscaler to choke (for that volume) #33

@jacksontj

Description

@jacksontj

Describe the bug
We are running in GCP and we had a workload spawn a 10GB PVC. The underlying storage controller was having issues provisioning (due to GCP ratelimits) which lasted for ~30m. During that time the volume-autoscaler noticed the disk and treated the disk as 0 size; from our slack:

@channel ERROR: <project> FAILED requesting to scale up <volume> by 20% from 0 to 2G, it was using more than 70% disk or inode space over the last 1380 seconds

From looking at the code it seems that this is due to here -- where all exceptions consider a volume to be of 0 size.

To Reproduce
Steps to reproduce the behavior

  1. create PVC where underlying disk won't be provisioned due to failure
  2. wait for volumeautoscaler to kick in
  3. profit

Expected behavior
In the event that the underlying disk doesn't exist it seems more appropriate for volume autoscaler to skip that pvc; If there is no underlying disk we can't really resize it. So it seems that the correct behavior here would be to change this to skip the pvc instead of assuming the size is 0.

Screenshots
n/a

Extra Information Requested

  • Kubernetes Version: v1.33.5-gke.1162000
  • Prometheus Version: GCP managed

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions