Skip to content

Commit d93f2cb

Browse files
committed
Add new changes to recover from expansion KEP
Update the kep
1 parent a96ece4 commit d93f2cb

File tree

1 file changed

+107
-33
lines changed
  • keps/sig-storage/1790-recover-resize-failure

1 file changed

+107
-33
lines changed

keps/sig-storage/1790-recover-resize-failure/README.md

Lines changed: 107 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,9 @@
1313
- [User flow stories](#user-flow-stories)
1414
- [Case 0 (default PVC creation):](#case-0-default-pvc-creation)
1515
- [Case 1 (controller+node expandable):](#case-1-controllernode-expandable)
16-
- [Case 2 (controller+node expandable with no GET_VOLUME capability):](#case-2-controllernode-expandable-with-no-get_volume-capability)
16+
- [Case 2 (node only expandable volume):](#case-2-node-only-expandable-volume)
1717
- [Case 3 (Malicious user)](#case-3-malicious-user)
18+
- [Case 4(Malicious User and rounding to GB/GiB bounaries)](#case-4malicious-user-and-rounding-to-gbgib-bounaries)
1819
- [Risks and Mitigations](#risks-and-mitigations)
1920
- [Graduation Criteria](#graduation-criteria)
2021
- [Test Plan](#test-plan)
@@ -31,6 +32,7 @@
3132
- [Alternatives](#alternatives)
3233
- [Why not use pvc.Status.Capacity for tracking quota?](#why-not-use-pvcstatuscapacity-for-tracking-quota)
3334
- [Allow admins to manually fix PVCs which are stuck in resizing condition](#allow-admins-to-manually-fix-pvcs-which-are-stuck-in-resizing-condition)
35+
- [Solving limitation of allowing restore all the way to original size](#solving-limitation-of-allowing-restore-all-the-way-to-original-size)
3436
- [Infrastructure Needed [optional]](#infrastructure-needed-optional)
3537
<!-- /toc -->
3638

@@ -77,7 +79,7 @@ the user is using less storage than he/she actually is.
7779
To solve this problem - we propose that although users are allowed to reduce size of their
7880
PVC (as long as requested size > `pvc.Status.Capacity`), quota calculation
7981
will use `max(pvc.Spec.Capacity, pvc.Status.AllocatedResources)` and reduction in `pvc.Status.AllocatedResources`
80-
is only done by resize-controller after verifying volume size using `ControllerGetVolume` csi RPC call.
82+
is only done by resize-controller after previously issued expansion has failed with some kind of terminal failure.
8183

8284
### Goals
8385

@@ -94,6 +96,7 @@ As part of this proposal, we are mainly proposing three changes:
9496

9597
1. Relax API validation on PVC update so as reducing `pvc.Spec.Resoures` is allowed, as long as requested size is `>pvc.Status.Capacity`.
9698
2. Add a new field in PVC API called - `pvc.Status.AllocatedResources` for the purpose of tracking used storage size. By default only api-server or resize-controller can set this field.
99+
3. Add a new field in PVC API called - `pvc.Status.ResizeStatus` for purpose of tracking status of volume expansion.
97100
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` when evaluating usage for PVC.
98101

99102
### Implementation
@@ -103,73 +106,139 @@ reduces the PVC request size, for both CSI and in-tree plugins they are designed
103106

104107
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
105108

106-
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When a PVC is created - this field defaults to `pvc.Spec.Resources` but when user expands the PVC,
107-
and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
108-
109-
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value (thereby giving quota back to the user) - only if current actual size of volume is less than or equal to `pvc.Spec.Resources` after entire control-plane and node side expansion is successfully finished. It will fetch actual size of the volume by using `ControllerGetVolume` CSI RPC call. It is possible to track completion of resizing operation in external-resizer via function - https://github.com/kubernetes-csi/external-resizer/blob/master/pkg/controller/controller.go#L394.
110-
111-
If CSI driver does not have `GET_VOLUME` controller capability(or `ControllerGetVolume` does not report volume size) and `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (i.e user is attempting to reduce size of a volume that expansion controller previously tried to expand) - then although expansion-controller will try volume expansion with value in `pvc.Spec.Resources` - it will not reduce reported value in `pvc.Status.AllocatedResources`, which will result in no quota being restored to the user. In other words - for CSI drivers that don't have `GET_VOLUME` controller capability - `pvc.Status.AllocatedResources` will report highest requested value and reducing `pvc.Spec.Resources` will not result in reduction of used quota.
112-
113-
*Note:* This proposal expects that users can not modify `pvc.Status` directly and cheat quota system. In general this should be fine because users should not have access to edit status of almost anything. Link to discussion on slack - https://kubernetes.slack.com/archives/CJUQN3E4T/p1620059624022100
114-
115-
*Note:* It is expected that `ControllerGetVolume` RPC call returns disk size not filesystem size of the volume. We will ensure that there are sanity test around this and this distinction is made more clear in the CSI spec.
109+
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When user expands the PVC, and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
110+
111+
Resizing operation will always work towards full-filling size recorded in `pvc.Status.AllocatedResources` and only when previous operation has finished(i.e `pvc.Status.ResizeStatus` is nil) or when previous operation has failed with a terminal error - it will use new user requested value from `pvc.Spec.Resources`.
112+
113+
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value only if:
114+
1. If `pvc.Status.ResizeStatus` is `ResizeFailedOnController` (indicating that previous expansion to last known `allocatedResources` failed with a final error).
115+
2. If `pvc.Status.ResizeStatus` is `ResizeFailedOnNode` and SP supports node-only expansion (indicating that previous expansion to last known `allocatedResources` failed on node with a final error).
116+
117+
One key exception is - `pvc.Status.Allocatedresources` will not be lowered if expansion succeeded on the controller but failed on the node with a final error. The implementation of new `allocatedResources` will look like:
118+
119+
120+
```golang
121+
func (ctrl *resizeController) getNewSize(pvc *v1.PersistentVolumeClaim, pv *v1.PersistentVolume) resource.Quantity {
122+
newSize := pvc.Spec.Resources.Requests.Storage()
123+
if pvc.Status.ResizeStatus == nil || *pvc.Status.ResizeStatus == "" {
124+
return *newSize
125+
}
126+
resizeStatus := *pvc.Status.ResizeStatus
127+
allocatedSize := pvc.Status.AllocatedResources.Storage()
128+
129+
switch resizeStatus {
130+
case v1.PersistentVolumeClaimResizeInProgress:
131+
// If resize was inprogress before this means - we should keep trying to finish
132+
// previously started expansion to last known `allocatedSize`.
133+
if allocatedSize != nil {
134+
newSize = allocatedSize
135+
}
136+
case v1.PersistentVolumeClaimResizeFailedOnController:
137+
// if expansion failed on controller with a final error, then we can safely use new user requested size
138+
newSize = pvc.Spec.Resources.Requests.Storage()
139+
case v1.PersistentVolumeClaimResizeFailedOnNode:
140+
if newSize.Cmp(*allocatedSize) > 0 {
141+
newSize = pvc.Spec.Resources.Requests.Storage()
142+
} else {
143+
// newSize is smaller than whatever we tried to expand before
144+
// and if expansion failed on the node, but it succeeded in control-plane
145+
// then it is not safe to use lower value from pvc.Spec
146+
if ctrl.resizer.SupportsControllerExpansion() {
147+
newSize = allocatedSize
148+
}
149+
}
150+
151+
}
152+
return *newSize
153+
}
154+
```
155+
156+
One more key design choice is - once expansion on node failed and user requested lower size (by modifying `pvc.Spec.Resources`) - kubelet must not retry expanding the same volume to previous `allocatedResources`, because external-resizer will consider `ResizeFailedOnNode` state as a terminal error and allow lowering of `allocatedResources` - and at the same time kubelet will try to expand the volume to higher `allocatedResources`. This will be solved in kubelet by following code:
157+
158+
```golang
159+
// read latest version of PVC from api-server
160+
if pvcStatusCap.Cmp(pvSpecCap) < 0 {
161+
// this is the if block where kubelet performs expansion
162+
if (pvc.Spec.Resources.Storage < pvc.Status.AllocatedResources ) && pvc.Status.ResizeStatus == ResizeFailedOnNode {
163+
return false, nil
164+
}
165+
}
166+
```
116167

117168
#### User flow stories
118169

119170
##### Case 0 (default PVC creation):
120171
- User creates a 10Gi PVC by setting - `pvc.spec.resources.requests["storage"] = "10Gi"`.
121-
- API server sets `pvc.Status.AllocatedResources` to "10Gi" on creation.
122172

123173
##### Case 1 (controller+node expandable):
124174
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`.
125-
- `pvc.Status.AllocatedResources` still reports `10Gi`.
126175
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
127176
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
177+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
128178
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
179+
- Expansion controller sets `pvc.Status.ResizeStatus` to `ResizeFailedOnController`.
129180
- User requests size to 20Gi.
130181
- Expansion controller tries expanding the PVC/PV to `20Gi` size.
182+
- Expansion controler notices that previous expansion to last known `allocatedresources` failed, so it sets new `allocatedResources` to `20G`
131183
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
132-
- After completion, expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
133-
- Expansion controller fetches actual size of the volume using `ControllerGetVolume` CSI RPC call.
134-
- Expansion controller sees that `actual_volume_size`(20Gi) < `pvc.Status.AllocatedResources` (100Gi), so it sets `pvc.Status.AllocatedResources` to 20Gi.
135184
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
136185

137-
##### Case 2 (controller+node expandable with no GET_VOLUME capability):
186+
##### Case 2 (node only expandable volume):
138187
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`
139-
- `pvc.Status.AllocatedResources` still reports `10Gi`.
140188
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
141189
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
142-
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
190+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
191+
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `100G`.
192+
- Expansion starts on the node and it fails on the node.
193+
- Kubelet sets `pvc.Status.ResizeStatus` to `ResizeFailedOnNode`.
143194
- User requests size to 20Gi.
144-
- Expansion controller tries expanding the PVC/PV to `20Gi` size.
145-
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
146-
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
147-
- Expansion controller sees that CSI driver does not have `GET_VOLUME` controller capability.
148-
- `pvc.Status.AllocatedResources` still reports `100Gi` and hence there is no change in used quota.
195+
- Kubelet notices that `pvc.Status.AllocatedResources > pvc.Spec.Resources` (that means user shrunk the volume) and last resize on node failed, it must not retry expansion to old `allocatedResources`.
196+
- At this point Kubelet will wait for `pvc.Status.AllocatedResources` to be updated.
197+
- Expansion controller starts expanding the volume and sees that last expansion failed on the node and driver does not have control-plane expansion.
198+
- Expansion controller sets `pvc.Status.AllocatedResources` to `20G`.
199+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
200+
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `20G`.
201+
- Expansion succeed on the node with latest `allocatedResources` and `pvc.Status.Size` is set to `20G`.
202+
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
149203

150204

151205
##### Case 3 (Malicious user)
152206
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
153-
- `pvc.Status.AllocatedResources` still reports `10Gi`.
154207
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
155208
- Expansion controller slowly starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
209+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
156210
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
157211
- While the storage backend is re-sizing the volume, user requests size 20Gi by changing `pvc.spec.resources.requests["storage"] = "20Gi"`
158-
- Expansion succeeds previous expansion and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `100Gi`.
159-
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
160-
- Expansion controller fetches actual size of the volume using `ControllerGetVolume` CSI RPC call.
161-
- Expansion controller sees that `actual_volume_size`(100Gi) >= `pvc.Status.AllocatedResources` (100Gi), so it does not make any modifictions to `pvc.Status.AllocatedResources`.
212+
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
213+
- Expansion controller starts expanding the volume but `allocatedResources` stays at `100G`.
214+
- Expansion to `100G` succeeds and `pv.Spec.Capacity` and `pvc.Status.Capacity` report new size as `100G`.
215+
- Although `pvc.Spec.Resources` reports size as `20G`, expansion to `20G` is never attempted.
216+
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
217+
218+
#### Case 4(Malicious User and rounding to GB/GiB bounaries)
219+
- User requests a PVC of 10.1GB but underlying storage provides provisions a volume of 11GB after rounding.
220+
- Size recorded in `pvc.Status.Capacity` and `pv.Spec.Capacity` is however still 10.1GB.
221+
- User expands expands the PVC to 100GB.
222+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `89.9GB` to used quota.
223+
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100GB` (before expanding).
224+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
225+
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10.1GB until the resize is finished.
226+
- while resize was in progress - expansion controler crashes and loses state.
227+
- User reduces the size of PVC to 10.5GB.
228+
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
229+
- Expansion controller starts expanding the volume but `allocatedResources` stays at `100G`.
230+
- Expansion to `100G` succeeds and `pv.Spec.Capacity` and `pvc.Status.Capacity` report new size as `100G`.
231+
- Although `pvc.Spec.Resources` reports size as `10.5GB`, expansion to `10.5GB` is never attempted.
162232
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
163233

164234
### Risks and Mitigations
165235

166236
- Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request.
167-
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value unless CSI driver in question has `GET_VOLUME` controller capability and `ControllerGetVolume` actually reports real size of the underlying volume.
168237

169238

170239
## Graduation Criteria
171240

172-
* *Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. The limitation about quota and CSI capability should be clearly documented.
241+
* *Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
173242
* *Beta* in 1.23: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with enhanced e2e and more stability improvements.
174243
* *GA* in 1.25 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
175244

@@ -360,7 +429,7 @@ _This section must be completed when targeting beta graduation to a release._
360429

361430
## Drawbacks
362431

363-
- One drawback is while this KEP allows an user to reduce requested PVC size, it does not automatically reduce quota.
432+
- One drawback is while this KEP allows an user to reduce requested PVC size, it does not allow reduction in size all the way to same value as original size recorded in `pvc.Status.Cap`. The reason is currently resize controller and kubelet work on a volume one after the other and we need special logic in both kubelet and control-plane to process reduction in size all the way upto original value. This is somewhat racy and we need a signaling mechanism between control-plane and kubelet when `pvc.Status.AllocatedResources` needs to be adjusted. We plan to revisit and address this in next release.
364433

365434

366435
## Alternatives
@@ -378,4 +447,9 @@ We also considered if it is worth leaving this problem as it is and allow admins
378447
- A workflow that depends on restarting the pods after resizing is successful - can't complete if resizing is forever stuck. This is the case with current design of statefulset expansion KEP.
379448
- If a storage type only supports node expansion and it keeps failing then PVC could become unusable and prevent its usage.
380449

450+
### Solving limitation of allowing restore all the way to original size
451+
452+
As mentioned above - one limitation of this KEP is, users can only recover upto a size greater than previous value recorded in `pvc.Status.Cap`.
453+
454+
381455
## Infrastructure Needed [optional]

0 commit comments

Comments
 (0)