Skip to content

Commit 54d4ef8

Browse files
committed
Add pdf and svg document for explaining the flow
1 parent d93f2cb commit 54d4ef8

File tree

3 files changed

+52
-78
lines changed

3 files changed

+52
-78
lines changed

keps/sig-storage/1790-recover-resize-failure/README.md

Lines changed: 51 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
- [Case 1 (controller+node expandable):](#case-1-controllernode-expandable)
1616
- [Case 2 (node only expandable volume):](#case-2-node-only-expandable-volume)
1717
- [Case 3 (Malicious user)](#case-3-malicious-user)
18-
- [Case 4(Malicious User and rounding to GB/GiB bounaries)](#case-4malicious-user-and-rounding-to-gbgib-bounaries)
18+
- [Case 4(Malicious User and rounding to GB/GiB bounaries)](#case-4malicious-user-and-rounding-to-gbgib-bounaries)
1919
- [Risks and Mitigations](#risks-and-mitigations)
2020
- [Graduation Criteria](#graduation-criteria)
2121
- [Test Plan](#test-plan)
@@ -96,7 +96,14 @@ As part of this proposal, we are mainly proposing three changes:
9696

9797
1. Relax API validation on PVC update so as reducing `pvc.Spec.Resoures` is allowed, as long as requested size is `>pvc.Status.Capacity`.
9898
2. Add a new field in PVC API called - `pvc.Status.AllocatedResources` for the purpose of tracking used storage size. By default only api-server or resize-controller can set this field.
99-
3. Add a new field in PVC API called - `pvc.Status.ResizeStatus` for purpose of tracking status of volume expansion.
99+
3. Add a new field in PVC API called - `pvc.Status.ResizeStatus` for purpose of tracking status of volume expansion. Following are possible values of newly introduced `ResizeStatus` field:
100+
- nil // General default because type is a pointer.
101+
- empty // When expansion is complete, the empty string is set by resize controller or kubelet.
102+
- ControllerExpansionInProgress // state set when external resize controller starts expanding the volune.
103+
- ControllerExpansionFailed // state set when expansion has failed in external resize controller with a terminal error. Transient errors don't set ControllerExpansionFailed.
104+
- NodeExpansionPending // state set when external resize controller has finished expanding the volume but further expansion is needed on the node.
105+
- NodeExpansionInProgress // state set when kubelet starts expanding the volume.
106+
- NodeExpansionFailed // state set when expansion has failed in kubelet with a terminal error. Transient errors don't set NodeExpansionFailed.
100107
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` when evaluating usage for PVC.
101108

102109
### Implementation
@@ -106,64 +113,22 @@ reduces the PVC request size, for both CSI and in-tree plugins they are designed
106113

107114
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
108115

109-
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When user expands the PVC, and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
110-
111-
Resizing operation will always work towards full-filling size recorded in `pvc.Status.AllocatedResources` and only when previous operation has finished(i.e `pvc.Status.ResizeStatus` is nil) or when previous operation has failed with a terminal error - it will use new user requested value from `pvc.Spec.Resources`.
112-
113-
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value only if:
114-
1. If `pvc.Status.ResizeStatus` is `ResizeFailedOnController` (indicating that previous expansion to last known `allocatedResources` failed with a final error).
115-
2. If `pvc.Status.ResizeStatus` is `ResizeFailedOnNode` and SP supports node-only expansion (indicating that previous expansion to last known `allocatedResources` failed on node with a final error).
116-
117-
One key exception is - `pvc.Status.Allocatedresources` will not be lowered if expansion succeeded on the controller but failed on the node with a final error. The implementation of new `allocatedResources` will look like:
118-
119-
120-
```golang
121-
func (ctrl *resizeController) getNewSize(pvc *v1.PersistentVolumeClaim, pv *v1.PersistentVolume) resource.Quantity {
122-
newSize := pvc.Spec.Resources.Requests.Storage()
123-
if pvc.Status.ResizeStatus == nil || *pvc.Status.ResizeStatus == "" {
124-
return *newSize
125-
}
126-
resizeStatus := *pvc.Status.ResizeStatus
127-
allocatedSize := pvc.Status.AllocatedResources.Storage()
128-
129-
switch resizeStatus {
130-
case v1.PersistentVolumeClaimResizeInProgress:
131-
// If resize was inprogress before this means - we should keep trying to finish
132-
// previously started expansion to last known `allocatedSize`.
133-
if allocatedSize != nil {
134-
newSize = allocatedSize
135-
}
136-
case v1.PersistentVolumeClaimResizeFailedOnController:
137-
// if expansion failed on controller with a final error, then we can safely use new user requested size
138-
newSize = pvc.Spec.Resources.Requests.Storage()
139-
case v1.PersistentVolumeClaimResizeFailedOnNode:
140-
if newSize.Cmp(*allocatedSize) > 0 {
141-
newSize = pvc.Spec.Resources.Requests.Storage()
142-
} else {
143-
// newSize is smaller than whatever we tried to expand before
144-
// and if expansion failed on the node, but it succeeded in control-plane
145-
// then it is not safe to use lower value from pvc.Spec
146-
if ctrl.resizer.SupportsControllerExpansion() {
147-
newSize = allocatedSize
148-
}
149-
}
150-
151-
}
152-
return *newSize
153-
}
154-
```
155-
156-
One more key design choice is - once expansion on node failed and user requested lower size (by modifying `pvc.Spec.Resources`) - kubelet must not retry expanding the same volume to previous `allocatedResources`, because external-resizer will consider `ResizeFailedOnNode` state as a terminal error and allow lowering of `allocatedResources` - and at the same time kubelet will try to expand the volume to higher `allocatedResources`. This will be solved in kubelet by following code:
157-
158-
```golang
159-
// read latest version of PVC from api-server
160-
if pvcStatusCap.Cmp(pvSpecCap) < 0 {
161-
// this is the if block where kubelet performs expansion
162-
if (pvc.Spec.Resources.Storage < pvc.Status.AllocatedResources ) && pvc.Status.ResizeStatus == ResizeFailedOnNode {
163-
return false, nil
164-
}
165-
}
166-
```
116+
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When user expands the PVC, and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion and it will set `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
117+
118+
Resizing operation in external resize controller will always work towards full-filling size recorded in `pvc.Status.AllocatedResources` and only when previous operation has finished(i.e `pvc.Status.ResizeStatus` is nil) or when previous operation has failed with a terminal error - it will use new user requested value from `pvc.Spec.Resources`.
119+
120+
Kubelet on the other hand will only expand volumes for which `pvc.Status.ResizeStatus` is in `NodeExpansionPending` or `NodeExpansionInProgress` state and `pv.Spec.Cap > pvc.Status.Cap`. If a volume expansion fails in kubelet with a terminal error(which will set `NodeExpansionFailed` state) - then it must wait for resize controller in external-resizer to reconcile the state and put it back in `NodeExpansionPending`.
121+
122+
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value only if one of the following is true:
123+
124+
1. If `pvc.Status.ResizeStatus` is `ControllerExpansionFailed` (indicating that previous expansion to last known `allocatedResources` failed with a final error) and previous control-plane has not succeeded.
125+
2. If `pvc.Status.ResizeStatus` is `NodeExpansionFailed` and SP supports node-only expansion (indicating that previous expansion to last known `allocatedResources` failed on node with a final error).
126+
3. If `pvc.Status.ResizeStatus` is `nil` or `empty` and previous `ControllerExpandVolume** has not succeeded.
127+
128+
**Note**: Whenever resize controller or kubelet modifies `pvc.Status` (such as when setting both `AllocatedResources` and `ResizeStatus`) - it is expected that all changes to `pvc.Status` are submitted as part of same patch request to avoid race conditions.
129+
130+
The complete expansion and recovery flow of both control-plane and kubelet is documented in attached PDF. The attached pdf - documents complete volume expansion flow via state diagrams and is much more exhaustive than the text above.
131+
167132

168133
#### User flow stories
169134

@@ -174,11 +139,10 @@ if pvcStatusCap.Cmp(pvSpecCap) < 0 {
174139
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`.
175140
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
176141
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
177-
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
142+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
178143
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
179-
- Expansion controller sets `pvc.Status.ResizeStatus` to `ResizeFailedOnController`.
144+
- Expansion controller sets `pvc.Status.ResizeStatus` to `ControllerExpansionFailed`.
180145
- User requests size to 20Gi.
181-
- Expansion controller tries expanding the PVC/PV to `20Gi` size.
182146
- Expansion controler notices that previous expansion to last known `allocatedresources` failed, so it sets new `allocatedResources` to `20G`
183147
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
184148
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
@@ -187,46 +151,51 @@ if pvcStatusCap.Cmp(pvSpecCap) < 0 {
187151
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`
188152
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
189153
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
190-
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
154+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
191155
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `100G`.
192-
- Expansion starts on the node and it fails on the node.
193-
- Kubelet sets `pvc.Status.ResizeStatus` to `ResizeFailedOnNode`.
156+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `NodeExpansionPending`.
157+
- Expansion starts on the node and kubelet sets `pvc.Status.ResizeStatus` to `NodeExpansionInProgress`.
158+
- Expansion fails on the node with a final error.
159+
- Kubelet sets `pvc.Status.ResizeStatus` to `NodeExpansionFailed`.
160+
- Since pvc has `pvc.Status.ResizeStatus` set to `NodeExpansionFailed` - kubelet will stop retrying node expansion.
161+
- At this point Kubelet will wait for `pvc.Status.ResizeStatus` to be `NodeExpansionPending`.
194162
- User requests size to 20Gi.
195-
- Kubelet notices that `pvc.Status.AllocatedResources > pvc.Spec.Resources` (that means user shrunk the volume) and last resize on node failed, it must not retry expansion to old `allocatedResources`.
196-
- At this point Kubelet will wait for `pvc.Status.AllocatedResources` to be updated.
197163
- Expansion controller starts expanding the volume and sees that last expansion failed on the node and driver does not have control-plane expansion.
198164
- Expansion controller sets `pvc.Status.AllocatedResources` to `20G`.
199-
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
165+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
200166
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `20G`.
201167
- Expansion succeed on the node with latest `allocatedResources` and `pvc.Status.Size` is set to `20G`.
168+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `NodeExpansionPending`.
169+
- Kubelet can now retry expansion and expansion on node succeeds.
170+
- Kubelet sets `pvc.Status.ResizeStatus` to empty string and `pvc.Status.Capacity` to new value.
202171
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
203172

204173

205174
##### Case 3 (Malicious user)
206175
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
207176
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
208177
- Expansion controller slowly starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
209-
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
178+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
210179
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
211180
- While the storage backend is re-sizing the volume, user requests size 20Gi by changing `pvc.spec.resources.requests["storage"] = "20Gi"`
212181
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
213-
- Expansion controller starts expanding the volume but `allocatedResources` stays at `100G`.
182+
- Expansion controller keeps expanding the volume to `allocatedResources`.
214183
- Expansion to `100G` succeeds and `pv.Spec.Capacity` and `pvc.Status.Capacity` report new size as `100G`.
215184
- Although `pvc.Spec.Resources` reports size as `20G`, expansion to `20G` is never attempted.
216185
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
217186

218-
#### Case 4(Malicious User and rounding to GB/GiB bounaries)
187+
##### Case 4(Malicious User and rounding to GB/GiB bounaries)
219188
- User requests a PVC of 10.1GB but underlying storage provides provisions a volume of 11GB after rounding.
220189
- Size recorded in `pvc.Status.Capacity` and `pv.Spec.Capacity` is however still 10.1GB.
221190
- User expands expands the PVC to 100GB.
222191
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `89.9GB` to used quota.
223192
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100GB` (before expanding).
224-
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
193+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
225194
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10.1GB until the resize is finished.
226195
- while resize was in progress - expansion controler crashes and loses state.
227196
- User reduces the size of PVC to 10.5GB.
228197
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
229-
- Expansion controller starts expanding the volume but `allocatedResources` stays at `100G`.
198+
- Expansion controller starts expanding the volume to last known `allocatedResources` (which is `100GB`).
230199
- Expansion to `100G` succeeds and `pv.Spec.Capacity` and `pvc.Status.Capacity` report new size as `100G`.
231200
- Although `pvc.Spec.Resources` reports size as `10.5GB`, expansion to `10.5GB` is never attempted.
232201
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
@@ -238,9 +207,9 @@ if pvcStatusCap.Cmp(pvSpecCap) < 0 {
238207

239208
## Graduation Criteria
240209

241-
* *Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
242-
* *Beta* in 1.23: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with enhanced e2e and more stability improvements.
243-
* *GA* in 1.25 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
210+
* *Alpha* in 1.23 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
211+
* *Beta* in 1.24: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with enhanced e2e and more stability improvements.
212+
* *GA* in 1.26 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
244213

245214
### Test Plan
246215

@@ -259,7 +228,7 @@ if pvcStatusCap.Cmp(pvSpecCap) < 0 {
259228
* **How can this feature be enabled / disabled in a live cluster?**
260229
- [x] Feature gate (also fill in values in `kep.yaml`)
261230
- Feature gate name: `RecoverExpansionFailure`
262-
- Components depending on the feature gate: kube-apiserver, kube-controller-manager, csi-external-resizer
231+
- Components depending on the feature gate: kube-apiserver, kube-controller-manager, csi-external-resizer, kubelet
263232
- [ ] Other
264233
- Describe the mechanism:
265234
- Will enabling / disabling the feature require downtime of the control
@@ -272,6 +241,10 @@ if pvcStatusCap.Cmp(pvSpecCap) < 0 {
272241
so it should not break any of existing automation. This means that if `pvc.Status.AllocatedResources` is available it will be
273242
used for calculating quota.
274243

244+
One more thing to keep in mind is - enabling this feature in kubelet while keeping it disabled in external-resizer will cause
245+
all volume expansions operations to get stuck(similar thing will happen when feature moves to beta and kubelet is newer but external-resizer sidecar is older).
246+
This limitation will be documented in external-resizer.
247+
275248
* **Can the feature be disabled once it has been enabled (i.e. can we rollback
276249
the enablement)?**
277250
Yes the feature gate can be disabled once enabled. However quota resources present in the cluster will be based off a field(`pvc.Status.AllocatedResources`) which is no longer updated.

keps/sig-storage/1790-recover-resize-failure/control_plane_expansion.svg

Lines changed: 1 addition & 0 deletions
Loading
Binary file not shown.

0 commit comments

Comments
 (0)