|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes v1.34: Recovery From Volume Expansion Failure (GA)" |
| 4 | +date: 2025-0X-XXT09:00:00-08:00 |
| 5 | +draft: true |
| 6 | +slug: kubernetes-v1-34-recover-expansion-failure |
| 7 | +author: > |
| 8 | + [Hemant Kumar](https://github.com/gnufied) (Red Hat) |
| 9 | +--- |
| 10 | + |
| 11 | +Have you ever made a typo when expanding your persistent volumes in Kubernetes? Meant to specify `2TB` |
| 12 | +but specified `20TiB`? This seemingly innocuous problem was kinda hard to fix - and took the project almost 5 years to fix. |
| 13 | +[Automated recovery from storage expansion](/docs/concepts/storage/persistent-volumes/#recovering-from-failure-when-expanding-volumes) has been around for a while in beta; however, with the v1.34 release, we have graduated this to |
| 14 | +**general availability**. |
| 15 | + |
| 16 | +While it was always possible to recover from failing volume expansions manually, it usually required cluster-admin access and was tedious to do (See aformentioned link for more information). |
| 17 | + |
| 18 | +What if you make a mistake and then realize immediately? |
| 19 | +With Kubernetes v1.34, you should be able to reduce the requested size of the PersistentVolumeClaim (PVC) and, as long as the expansion to previously requested |
| 20 | +size hadn't finished, you can amend the size requested. Kubernetes will |
| 21 | +automatically work to correct it. Any quota consumed by failed expansion will be returned to the user and the associated PersistentVolume should be resized to the |
| 22 | +latest size you specified. |
| 23 | + |
| 24 | +I'll walk through an example of how all of this works. |
| 25 | + |
| 26 | +## Reducing PVC size to recover from failed expansion |
| 27 | + |
| 28 | +Imagine that you are running out of disk space for one of your database servers, and you want to expand the PVC from previously |
| 29 | +specified `10TB` to `100TB` - but you make a typo and specify `1000TB`. |
| 30 | + |
| 31 | +```yaml |
| 32 | +kind: PersistentVolumeClaim |
| 33 | +apiVersion: v1 |
| 34 | +metadata: |
| 35 | + name: myclaim |
| 36 | +spec: |
| 37 | + accessModes: |
| 38 | + - ReadWriteOnce |
| 39 | + resources: |
| 40 | + requests: |
| 41 | + storage: 1000TB # newly specified size - but incorrect! |
| 42 | +``` |
| 43 | +
|
| 44 | +Now, you may be out of disk space on your disk array or simply ran out of allocated quota on your cloud-provider. But, assume that expansion to `1000TB` is never going to succeed. |
| 45 | + |
| 46 | +In Kubernetes v1.34, you can simply correct your mistake and request a new PVC size, |
| 47 | +that is smaller than the mistake, provided it is still larger than the original size |
| 48 | +of the actual PersistentVolume. |
| 49 | + |
| 50 | +```yaml |
| 51 | +kind: PersistentVolumeClaim |
| 52 | +apiVersion: v1 |
| 53 | +metadata: |
| 54 | + name: myclaim |
| 55 | +spec: |
| 56 | + accessModes: |
| 57 | + - ReadWriteOnce |
| 58 | + resources: |
| 59 | + requests: |
| 60 | + storage: 100TB # Corrected size; has to be greater than 10TB. |
| 61 | + # You cannot shrink the volume below its actual size. |
| 62 | +``` |
| 63 | + |
| 64 | +This requires no admin intervention. Even better, any surplus Kubernetes quota that you temporarily consumed will be automatically returned. |
| 65 | + |
| 66 | +This fault recovery mechanism does have a caveat: whatever new size you specify for the PVC, it **must** be still higher than the original size in `.status.capacity`. |
| 67 | +Since Kubernetes doesn't support shrinking your PV objects, you can never go below the size that was originally allocated for your PVC request. |
| 68 | + |
| 69 | +## Improved error handling and observability of volume expansion |
| 70 | + |
| 71 | +Implementing what might look like a relatively minor change also required us to almost |
| 72 | +fully redo how volume expansion works under the hood in Kubernetes. |
| 73 | +There are new API fields available in PVC objects which you can monitor to observe progress of volume expansion. |
| 74 | + |
| 75 | +### Improved observability of in-progress expansion |
| 76 | + |
| 77 | +You can query `.status.allocatedResourceStatus['storage']` of a PVC to monitor progress of a volume expansion operation. |
| 78 | +For a typical block volume, this should transition between `ControllerResizeInProgress`, `NodeResizePending` and `NodeResizeInProgress` and become nil/empty when volume expansion has finished. |
| 79 | + |
| 80 | +If for some reason, volume expansion to requested size is not feasible it should accordingly be in states like - `ControllerResizeInfeasible` or `NodeResizeInfeasible`. |
| 81 | + |
| 82 | +You can also observe size towards which Kubernetes is working by watching `pvc.status.allocatedResources`. |
| 83 | + |
| 84 | +### Improved error handling and reporting |
| 85 | + |
| 86 | +Kubernetes should now retry your failed volume expansions at slower rate, it should make fewer requests to both storage system and Kubernetes apiserver. |
| 87 | + |
| 88 | +Errors observerd during volume expansion are now reported as condition on PVC objects and should persist unlike events. Kubernetes will now populate `pvc.status.conditions` with error keys `ControllerResizeError` or `NodeResizeError` when volume expansion fails. |
| 89 | + |
| 90 | +### Fixes long standing bugs in resizing workflows |
| 91 | + |
| 92 | +This feature also has allowed us to fix long standing bugs in resizing workflow such as [Kubernetes issue #115294](https://github.com/kubernetes/kubernetes/issues/115294). |
| 93 | +If you observe anything broken, please report your bugs to [https://github.com/kubernetes/kubernetes/issues](https://github.com/kubernetes/kubernetes/issues/new/choose), along with details about how to reproduce the problem. |
| 94 | + |
| 95 | +Working on this feature through its lifecycle was challenging and it wouldn't have been possible to reach GA |
| 96 | +without feedback from [@msau42](https://github.com/msau42), [@jsafrane](https://github.com/jsafrane) and [@xing-yang](https://github.com/xing-yang). |
| 97 | + |
| 98 | +All of the contributors who worked on this also appreciate the input provided by [@thockin](https://github.com/thockin) and [@liggitt](https://github.comliggitt) at various Kubernetes contributor summits. |
0 commit comments