You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -54,10 +53,9 @@ expand the volume and keeps failing. We want to make it easier for users to reco
54
53
expansion failures, so that user can retry volume expansion with values that may succeed.
55
54
56
55
To enable recovery, this proposal relaxes API validation for PVC objects to allow reduction in
57
-
requested size. This KEP also adds a new field called `pvc.Spec.AllocatedResources` to allow resource quota
56
+
requested size. This KEP also adds a new field called `pvc.Status.AllocatedResources` to allow resource quota
58
57
to be tracked accurately if user reduces size of their PVCs.
59
58
60
-
61
59
## Motivation
62
60
63
61
Sometimes a user may expand a PVC to a size which may not be supported by
@@ -71,11 +69,12 @@ this feature to abuse the quota system. They can do so by rapidly
71
69
expanding and then shrinking the PVC. In general - both CSI
72
70
and in-tree volume plugins have been designed to never perform actual
73
71
shrink on underlying volume, but it can fool the quota system in believing
74
-
the user is using less storage than he/she actually is. To solve this
75
-
problem - we propose that although users are allowed to reduce size of their
72
+
the user is using less storage than he/she actually is.
73
+
74
+
To solve this problem - we propose that although users are allowed to reduce size of their
76
75
PVC (as long as requested size >= `pvc.Status.Capacity`), quota calculation
77
-
will only use previously requested higher value. This is covered in more
78
-
detail in [Implementation](#implementation).
76
+
will use `max(pvc.Spec.Capacity, pvc.Status.AllocatedResources)` and reduction in `pvc.Status.AllocatedResources`
77
+
is only done by resize-controller after verifying volume size using `GET_VOLUME` csi RPC call.
79
78
80
79
81
80
### Goals
@@ -92,8 +91,8 @@ detail in [Implementation](#implementation).
92
91
As part of this proposal, we are mainly proposing three changes:
93
92
94
93
1. Relax API validation on PVC update so as reducing `pvc.Spec.Resoures` is allowed, as long as requested size is `>=pvc.Status.Capacity`.
95
-
2. Add a new field in PVC API called - `pvc.Spec.AllocatedResources` for the purpose of tracking used storage size. This field is only allowed to increase.
96
-
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when evaluating usage for PVC.
94
+
2. Add a new field in PVC API called - `pvc.Status.AllocatedResources` for the purpose of tracking used storage size. By default only api-server or resize-controller can set this field.
95
+
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` when evaluating usage for PVC.
97
96
98
97
### Implementation
99
98
@@ -102,64 +101,67 @@ reduces the PVC request size, for both CSI and in-tree plugins they are designed
102
101
103
102
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
104
103
105
-
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions.
104
+
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When a PVC is created - this field defaults to `pvc.Spec.Resources` but when user expands the PVC,
105
+
and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
106
106
107
-
#### On using pvc.Spec.AllocatedResources
107
+
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value (thereby giving quota back to the user) - only if current actual size of volume is less than or equal to `pvc.Spec.Resources`. It will do that only after fetching current size of the volume by using `ControllerGetVolume` CSI RPC call.
108
108
109
-
We chose to use `pvc.Spec.AllocatedResources`for storing maximum of user requested pvc capacity because once user requests a new size in `pvc.Spec.Resources`even if she cancels the operation later on, resize-controller may have already started working on reconciling newly requested size.
109
+
If CSI driver does not have `GET_VOLUME` controller capability and `pvc.Spec.Resources` < `pvc.Status.AllocatedResources`(i.e user is attempting to reduce size of a volume that expansion controller previously tried to expand) - then although expansion-controller will try volume expansion with value in `pvc.Spec.Resources`- it will not reduce reported value in `pvc.Status.AllocatedResources`, which will result in no quota being restored to the user. In other words - for CSI drivers that don't have `GET_VOLUME` controller capability - `pvc.Status.AllocatedResources` will report highest requested value and reducing `pvc.Spec.Resources` will not result in reduction of used quota.
110
110
111
-
AllocatedResources is not volume size but more like whatever user has requested and towards which resize-controller was working to reconcile. It is possible that user has requested smaller size since then but that does not changes the fact that resize-controller has already tried to expand to AllocatedResources and might have partially succeeded. So AllocatedResources is maximum user requested size for this volume and does not reflect actual volume size of the PV and hence is not recorded in `pvc.Status.Capacity`.
111
+
#### User flow stories
112
112
113
-
This also falls inline with how in-place update of Pod is handled(https://github.com/kubernetes/enhancements/pull/1342/files) and allocated resources is recorded in `pod.Spec.Containers[i].ResourcesAllocated`.
114
113
115
-
#### User flow stories
114
+
##### Case 0 (default PVC creation):
115
+
- User creates a 10Gi PVC by setting - `pvc.spec.resources.requests["storage"] = "10Gi"`.
116
+
- API server sets `pvc.Status.AllocatedResources` to "10Gi" on creation.
116
117
117
118
###### Case 1 (controller+node expandable):
118
-
119
119
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`.
120
-
- API Server sets `pvc.spec.allocatedResources` to 100Gi via pvc storage strategy.
121
-
- Quota controller uses `max(pvc.Spec.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
120
+
-`pvc.Status.AllocatedResources` still reports `10Gi`.
121
+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
122
+
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
122
123
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
123
124
- User requests size to 20Gi.
124
-
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is `100Gi`.
125
+
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
126
+
- Expansion controller fetches actual size of the volume using `ControllerGetVolume` CSI RPC call.
127
+
- Expansion controller sees that `actual_volume_size`(10Gi) < `pvc.Spec.Resources` (20Gi), so it sets `pvc.Status.AllocatedResources` to 20Gi.
128
+
- Expansion controller continues with rest of the expansion flow.
129
+
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
125
130
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
147
-
- API server sets `pvc.spec.allocatedResources` to 100Gi via PVC storage strategy.
148
-
- Quota controller uses `max(pvc.Spec.AllocatedResources, pvc.Spec.Resources)` and adds 90Gi to used quota. (100Gi is the total space taken by the PVC)
149
-
- Volume plugin starts slowly expanding to 100Gi. `pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
148
+
-`pvc.Status.AllocatedResources` still reports `10Gi`.
149
+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
150
+
- Expansion controller slow starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
151
+
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
150
152
- While the storage backend is re-sizing the volume, user requests size 20Gi by changing `pvc.spec.resources.requests["storage"] = "20Gi"`
151
-
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is 100Gi.
152
-
- Expansion succeeds and pvc.Status.Capacity and pv.Spec.Capacity report new size as 100Gi, as that's what the volume plugin did.
153
+
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
154
+
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `100Gi`, as that's what the volume plugin did.
153
155
154
156
### Risks and Mitigations
155
157
156
-
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value. In practice though - this should be acceptable since such expansion failures should be rare and admin can unblock the user by increasing the quota or rebuilding PVC if needed. We will emit events on PV and PVC to alerts admins and users.
158
+
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value unless CSI driver in question has `GET_VOLUME` controller capability.
157
159
158
160
## Graduation Criteria
159
161
160
-
**Alpha* in 1.19 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. The limitation about quota should be clearly documented.
161
-
**Beta* in 1.20: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with confirmed production usage.
162
-
*GA in 1.21 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
162
+
**Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. The limitation about quota and CSI capability should be clearly documented.
163
+
**Beta* in 1.23: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with confirmed production usage.
164
+
**GA* in 1.24 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
163
165
164
166
### Test Plan
165
167
@@ -188,16 +190,16 @@ This also falls inline with how in-place update of Pod is handled(https://github
188
190
189
191
***Does enabling the feature change any default behavior?**
190
192
Allow users to reduce size of pvc in `pvc.spec.resources`. In general this was not permitted before
191
-
so it should not break any of existing automation. This means that if `pvc.Spec.AllocatedResources` is available it will be
193
+
so it should not break any of existing automation. This means that if `pvc.Status.AllocatedResources` is available it will be
192
194
used for calculating quota.
193
195
194
196
***Can the feature be disabled once it has been enabled (i.e. can we rollback
195
197
the enablement)?**
196
198
Yes the feature gate can be disabled once enabled. The quota code will use new field to calculate quota if available. This will allow quota code to use
197
-
`pvc.Spec.AllocatedResources` even if feature gate is disabled.
199
+
`pvc.Status.AllocatedResources` even if feature gate is disabled.
198
200
199
201
***What happens if we reenable the feature if it was previously rolled back?**
200
-
This KEP proposes a new field in pvc and it will be used if available for quota calculation. So when feature is rolled back, it will be possible to reduce pvc capacity and quota will use `pvc.Spec.AllocatedResources`.
202
+
This KEP proposes a new field in pvc and it will be used if available for quota calculation. So when feature is rolled back, it will be possible to reduce pvc capacity and quota will use `pvc.Status.AllocatedResources`.
201
203
202
204
***Are there any tests for feature enablement/disablement?**
203
205
The e2e framework does not currently support enabling and disabling feature
@@ -357,14 +359,6 @@ There were several alternatives considered before proposing this KEP.
357
359
358
360
`pvc.Status.Capacity` can't be used for tracking quota because pvc.Status.Capacity is calculated after binding happens, which could be when pod is started. This would allow an user to overcommit because quota won't reflect accurate value until PVC is bound to a PV.
359
361
360
-
### Fix Quota when resize succeeds with reduced size
361
-
362
-
The first thing we considered was doing the *right* thing and ensuring that quota always reflects correct volume size. So if user increases a 10GiB PVC to 1TB and that fails but user retries expansion by reducing the size to 100GiB and it succeeds then quota will report 100GiB rather than 1TB.
363
-
364
-
There are several problems with this approach though:
365
-
- Currently Kubernetes always calculates quota based on user requested size rather than actual volume size. So if a 10GiB PVC is bound to 100GiB PV, then used quota is 10GiB and hence special care must be taken when reconciling the quota with actual volume size because we could break existing conventions around quota.
366
-
- The second problem is - using actual volume size for computing quota results in too many problems because there are multiple sources of truth. For example - `ControllerExpandVolume` RPC call of CSI driver may report different size than `NodeExpandVolume` RPC call. So we must fix all existing CSI drivers to report sizes that are consistent. For example - `NodeExpandVolume` must return a size greater or equal to what `ControllerExpandVolume` returns, otherwise it could result in infinte expansion loop on node. The other problem is - `NodeExpandVolume` RPC call in CSI may not report size at all and hence CSI spec must be fixed.
367
-
368
362
### Allow admins to manually fix PVCs which are stuck in resizing condition
369
363
370
364
We also considered if it is worth leaving this problem as it is and allow admins to fix it manually since the problem that this KEP fixes is not very frequent(there have not been any reports from end-users about this bug). But there are couple of problems with this:
0 commit comments