You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -77,7 +79,7 @@ the user is using less storage than he/she actually is.
77
79
To solve this problem - we propose that although users are allowed to reduce size of their
78
80
PVC (as long as requested size > `pvc.Status.Capacity`), quota calculation
79
81
will use `max(pvc.Spec.Capacity, pvc.Status.AllocatedResources)` and reduction in `pvc.Status.AllocatedResources`
80
-
is only done by resize-controller after verifying volume size using `ControllerGetVolume` csi RPC call.
82
+
is only done by resize-controller after previously issued expansion has failed with some kind of terminal failure.
81
83
82
84
### Goals
83
85
@@ -94,6 +96,7 @@ As part of this proposal, we are mainly proposing three changes:
94
96
95
97
1. Relax API validation on PVC update so as reducing `pvc.Spec.Resoures` is allowed, as long as requested size is `>pvc.Status.Capacity`.
96
98
2. Add a new field in PVC API called - `pvc.Status.AllocatedResources` for the purpose of tracking used storage size. By default only api-server or resize-controller can set this field.
99
+
3. Add a new field in PVC API called - `pvc.Status.ResizeStatus` for purpose of tracking status of volume expansion.
97
100
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` when evaluating usage for PVC.
98
101
99
102
### Implementation
@@ -103,73 +106,139 @@ reduces the PVC request size, for both CSI and in-tree plugins they are designed
103
106
104
107
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
105
108
106
-
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When a PVC is created - this field defaults to `pvc.Spec.Resources` but when user expands the PVC,
107
-
and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
108
-
109
-
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value (thereby giving quota back to the user) - only if current actual size of volume is less than or equal to `pvc.Spec.Resources` after entire control-plane and node side expansion is successfully finished. It will fetch actual size of the volume by using `ControllerGetVolume` CSI RPC call. It is possible to track completion of resizing operation in external-resizer via function - https://github.com/kubernetes-csi/external-resizer/blob/master/pkg/controller/controller.go#L394.
110
-
111
-
If CSI driver does not have `GET_VOLUME` controller capability(or `ControllerGetVolume` does not report volume size) and `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (i.e user is attempting to reduce size of a volume that expansion controller previously tried to expand) - then although expansion-controller will try volume expansion with value in `pvc.Spec.Resources` - it will not reduce reported value in `pvc.Status.AllocatedResources`, which will result in no quota being restored to the user. In other words - for CSI drivers that don't have `GET_VOLUME` controller capability - `pvc.Status.AllocatedResources` will report highest requested value and reducing `pvc.Spec.Resources` will not result in reduction of used quota.
112
-
113
-
*Note:* This proposal expects that users can not modify `pvc.Status` directly and cheat quota system. In general this should be fine because users should not have access to edit status of almost anything. Link to discussion on slack - https://kubernetes.slack.com/archives/CJUQN3E4T/p1620059624022100
114
-
115
-
*Note:* It is expected that `ControllerGetVolume` RPC call returns disk size not filesystem size of the volume. We will ensure that there are sanity test around this and this distinction is made more clear in the CSI spec.
109
+
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When user expands the PVC, and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
110
+
111
+
Resizing operation will always work towards full-filling size recorded in `pvc.Status.AllocatedResources` and only when previous operation has finished(i.e `pvc.Status.ResizeStatus` is nil) or when previous operation has failed with a terminal error - it will use new user requested value from `pvc.Spec.Resources`.
112
+
113
+
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value only if:
114
+
1. If `pvc.Status.ResizeStatus` is `ResizeFailedOnController` (indicating that previous expansion to last known `allocatedResources` failed with a final error).
115
+
2. If `pvc.Status.ResizeStatus` is `ResizeFailedOnNode` and SP supports node-only expansion (indicating that previous expansion to last known `allocatedResources` failed on node with a final error).
116
+
117
+
One key exception is - `pvc.Status.Allocatedresources` will not be lowered if expansion succeeded on the controller but failed on the node with a final error. The implementation of new `allocatedResources` will look like:
// If resize was inprogress before this means - we should keep trying to finish
132
+
// previously started expansion to last known `allocatedSize`.
133
+
if allocatedSize != nil {
134
+
newSize = allocatedSize
135
+
}
136
+
case v1.PersistentVolumeClaimResizeFailedOnController:
137
+
// if expansion failed on controller with a final error, then we can safely use new user requested size
138
+
newSize = pvc.Spec.Resources.Requests.Storage()
139
+
case v1.PersistentVolumeClaimResizeFailedOnNode:
140
+
if newSize.Cmp(*allocatedSize) > 0 {
141
+
newSize = pvc.Spec.Resources.Requests.Storage()
142
+
} else {
143
+
// newSize is smaller than whatever we tried to expand before
144
+
// and if expansion failed on the node, but it succeeded in control-plane
145
+
// then it is not safe to use lower value from pvc.Spec
146
+
if ctrl.resizer.SupportsControllerExpansion() {
147
+
newSize = allocatedSize
148
+
}
149
+
}
150
+
151
+
}
152
+
return *newSize
153
+
}
154
+
```
155
+
156
+
One more key design choice is - once expansion on node failed and user requested lower size (by modifying `pvc.Spec.Resources`) - kubelet must not retry expanding the same volume to previous `allocatedResources`, because external-resizer will consider `ResizeFailedOnNode` state as a terminal error and allow lowering of `allocatedResources` - and at the same time kubelet will try to expand the volume to higher `allocatedResources`. This will be solved in kubelet by following code:
157
+
158
+
```golang
159
+
// read latest version of PVC from api-server
160
+
if pvcStatusCap.Cmp(pvSpecCap) < 0 {
161
+
// this is the if block where kubelet performs expansion
162
+
if (pvc.Spec.Resources.Storage < pvc.Status.AllocatedResources ) && pvc.Status.ResizeStatus == ResizeFailedOnNode {
163
+
returnfalse, nil
164
+
}
165
+
}
166
+
```
116
167
117
168
#### User flow stories
118
169
119
170
##### Case 0 (default PVC creation):
120
171
- User creates a 10Gi PVC by setting - `pvc.spec.resources.requests["storage"] = "10Gi"`.
121
-
- API server sets `pvc.Status.AllocatedResources` to "10Gi" on creation.
122
172
123
173
##### Case 1 (controller+node expandable):
124
174
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`.
125
-
-`pvc.Status.AllocatedResources` still reports `10Gi`.
126
175
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
127
176
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
177
+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
128
178
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
179
+
- Expansion controller sets `pvc.Status.ResizeStatus` to `ResizeFailedOnController`.
129
180
- User requests size to 20Gi.
130
181
- Expansion controller tries expanding the PVC/PV to `20Gi` size.
182
+
- Expansion controler notices that previous expansion to last known `allocatedresources` failed, so it sets new `allocatedResources` to `20G`
131
183
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
132
-
- After completion, expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
133
-
- Expansion controller fetches actual size of the volume using `ControllerGetVolume` CSI RPC call.
134
-
- Expansion controller sees that `actual_volume_size`(20Gi) < `pvc.Status.AllocatedResources` (100Gi), so it sets `pvc.Status.AllocatedResources` to 20Gi.
135
184
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
136
185
137
-
##### Case 2 (controller+node expandable with no GET_VOLUME capability):
186
+
##### Case 2 (node only expandable volume):
138
187
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`
139
-
-`pvc.Status.AllocatedResources` still reports `10Gi`.
140
188
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
141
189
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
142
-
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
190
+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
191
+
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `100G`.
192
+
- Expansion starts on the node and it fails on the node.
193
+
- Kubelet sets `pvc.Status.ResizeStatus` to `ResizeFailedOnNode`.
143
194
- User requests size to 20Gi.
144
-
- Expansion controller tries expanding the PVC/PV to `20Gi` size.
145
-
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
146
-
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
147
-
- Expansion controller sees that CSI driver does not have `GET_VOLUME` controller capability.
148
-
-`pvc.Status.AllocatedResources` still reports `100Gi` and hence there is no change in used quota.
195
+
- Kubelet notices that `pvc.Status.AllocatedResources > pvc.Spec.Resources` (that means user shrunk the volume) and last resize on node failed, it must not retry expansion to old `allocatedResources`.
196
+
- At this point Kubelet will wait for `pvc.Status.AllocatedResources` to be updated.
197
+
- Expansion controller starts expanding the volume and sees that last expansion failed on the node and driver does not have control-plane expansion.
198
+
- Expansion controller sets `pvc.Status.AllocatedResources` to `20G`.
199
+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
200
+
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `20G`.
201
+
- Expansion succeed on the node with latest `allocatedResources` and `pvc.Status.Size` is set to `20G`.
202
+
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
149
203
150
204
151
205
##### Case 3 (Malicious user)
152
206
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
153
-
-`pvc.Status.AllocatedResources` still reports `10Gi`.
154
207
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
155
208
- Expansion controller slowly starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
209
+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
156
210
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
157
211
- While the storage backend is re-sizing the volume, user requests size 20Gi by changing `pvc.spec.resources.requests["storage"] = "20Gi"`
158
-
- Expansion succeeds previous expansion and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `100Gi`.
159
-
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
160
-
- Expansion controller fetches actual size of the volume using `ControllerGetVolume` CSI RPC call.
161
-
- Expansion controller sees that `actual_volume_size`(100Gi) >= `pvc.Status.AllocatedResources` (100Gi), so it does not make any modifictions to `pvc.Status.AllocatedResources`.
212
+
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
213
+
- Expansion controller starts expanding the volume but `allocatedResources` stays at `100G`.
214
+
- Expansion to `100G` succeeds and `pv.Spec.Capacity` and `pvc.Status.Capacity` report new size as `100G`.
215
+
- Although `pvc.Spec.Resources` reports size as `20G`, expansion to `20G` is never attempted.
216
+
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
217
+
218
+
#### Case 4(Malicious User and rounding to GB/GiB bounaries)
219
+
- User requests a PVC of 10.1GB but underlying storage provides provisions a volume of 11GB after rounding.
220
+
- Size recorded in `pvc.Status.Capacity` and `pv.Spec.Capacity` is however still 10.1GB.
221
+
- User expands expands the PVC to 100GB.
222
+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `89.9GB` to used quota.
223
+
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100GB` (before expanding).
224
+
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ResizeInProgress`.
225
+
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10.1GB until the resize is finished.
226
+
- while resize was in progress - expansion controler crashes and loses state.
227
+
- User reduces the size of PVC to 10.5GB.
228
+
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
229
+
- Expansion controller starts expanding the volume but `allocatedResources` stays at `100G`.
230
+
- Expansion to `100G` succeeds and `pv.Spec.Capacity` and `pvc.Status.Capacity` report new size as `100G`.
231
+
- Although `pvc.Spec.Resources` reports size as `10.5GB`, expansion to `10.5GB` is never attempted.
162
232
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
163
233
164
234
### Risks and Mitigations
165
235
166
236
- Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request.
167
-
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value unless CSI driver in question has `GET_VOLUME` controller capability and `ControllerGetVolume` actually reports real size of the underlying volume.
168
237
169
238
170
239
## Graduation Criteria
171
240
172
-
**Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. The limitation about quota and CSI capability should be clearly documented.
241
+
**Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
173
242
**Beta* in 1.23: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with enhanced e2e and more stability improvements.
174
243
**GA* in 1.25 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
175
244
@@ -360,7 +429,7 @@ _This section must be completed when targeting beta graduation to a release._
360
429
361
430
## Drawbacks
362
431
363
-
- One drawback is while this KEP allows an user to reduce requested PVC size, it does not automatically reduce quota.
432
+
- One drawback is while this KEP allows an user to reduce requested PVC size, it does not allow reduction in size all the way to same value as original size recorded in `pvc.Status.Cap`. The reason is currently resize controller and kubelet work on a volume one after the other and we need special logic in both kubelet and control-plane to process reduction in size all the way upto original value. This is somewhat racy and we need a signaling mechanism between control-plane and kubelet when `pvc.Status.AllocatedResources` needs to be adjusted. We plan to revisit and address this in next release.
364
433
365
434
366
435
## Alternatives
@@ -378,4 +447,9 @@ We also considered if it is worth leaving this problem as it is and allow admins
378
447
- A workflow that depends on restarting the pods after resizing is successful - can't complete if resizing is forever stuck. This is the case with current design of statefulset expansion KEP.
379
448
- If a storage type only supports node expansion and it keeps failing then PVC could become unusable and prevent its usage.
380
449
450
+
### Solving limitation of allowing restore all the way to original size
451
+
452
+
As mentioned above - one limitation of this KEP is, users can only recover upto a size greater than previous value recorded in `pvc.Status.Cap`.
0 commit comments