Skip to content

Commit f70342e

Browse files
committed
Fix recover from resizing kep based on pod resizing design
1 parent 2340226 commit f70342e

File tree

1 file changed

+48
-54
lines changed
  • keps/sig-storage/1790-recover-resize-failure

1 file changed

+48
-54
lines changed

keps/sig-storage/1790-recover-resize-failure/README.md

Lines changed: 48 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
- [Non-Goals](#non-goals)
1111
- [Proposal](#proposal)
1212
- [Implementation](#implementation)
13-
- [On using pvc.Spec.AllocatedResources](#on-using-pvcspecallocatedresources)
1413
- [User flow stories](#user-flow-stories)
14+
- [Case 0 (default PVC creation):](#case-0-default-pvc-creation)
1515
- [Risks and Mitigations](#risks-and-mitigations)
1616
- [Graduation Criteria](#graduation-criteria)
1717
- [Test Plan](#test-plan)
@@ -27,7 +27,6 @@
2727
- [Drawbacks](#drawbacks)
2828
- [Alternatives](#alternatives)
2929
- [Why not use pvc.Status.Capacity for tracking quota?](#why-not-use-pvcstatuscapacity-for-tracking-quota)
30-
- [Fix Quota when resize succeeds with reduced size](#fix-quota-when-resize-succeeds-with-reduced-size)
3130
- [Allow admins to manually fix PVCs which are stuck in resizing condition](#allow-admins-to-manually-fix-pvcs-which-are-stuck-in-resizing-condition)
3231
- [Infrastructure Needed [optional]](#infrastructure-needed-optional)
3332
<!-- /toc -->
@@ -54,10 +53,9 @@ expand the volume and keeps failing. We want to make it easier for users to reco
5453
expansion failures, so that user can retry volume expansion with values that may succeed.
5554

5655
To enable recovery, this proposal relaxes API validation for PVC objects to allow reduction in
57-
requested size. This KEP also adds a new field called `pvc.Spec.AllocatedResources` to allow resource quota
56+
requested size. This KEP also adds a new field called `pvc.Status.AllocatedResources` to allow resource quota
5857
to be tracked accurately if user reduces size of their PVCs.
5958

60-
6159
## Motivation
6260

6361
Sometimes a user may expand a PVC to a size which may not be supported by
@@ -71,11 +69,12 @@ this feature to abuse the quota system. They can do so by rapidly
7169
expanding and then shrinking the PVC. In general - both CSI
7270
and in-tree volume plugins have been designed to never perform actual
7371
shrink on underlying volume, but it can fool the quota system in believing
74-
the user is using less storage than he/she actually is. To solve this
75-
problem - we propose that although users are allowed to reduce size of their
72+
the user is using less storage than he/she actually is.
73+
74+
To solve this problem - we propose that although users are allowed to reduce size of their
7675
PVC (as long as requested size >= `pvc.Status.Capacity`), quota calculation
77-
will only use previously requested higher value. This is covered in more
78-
detail in [Implementation](#implementation).
76+
will use `max(pvc.Spec.Capacity, pvc.Status.AllocatedResources)` and reduction in `pvc.Status.AllocatedResources`
77+
is only done by resize-controller after verifying volume size using `GET_VOLUME` csi RPC call.
7978

8079

8180
### Goals
@@ -92,8 +91,8 @@ detail in [Implementation](#implementation).
9291
As part of this proposal, we are mainly proposing three changes:
9392

9493
1. Relax API validation on PVC update so as reducing `pvc.Spec.Resoures` is allowed, as long as requested size is `>=pvc.Status.Capacity`.
95-
2. Add a new field in PVC API called - `pvc.Spec.AllocatedResources` for the purpose of tracking used storage size. This field is only allowed to increase.
96-
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when evaluating usage for PVC.
94+
2. Add a new field in PVC API called - `pvc.Status.AllocatedResources` for the purpose of tracking used storage size. By default only api-server or resize-controller can set this field.
95+
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` when evaluating usage for PVC.
9796

9897
### Implementation
9998

@@ -102,64 +101,67 @@ reduces the PVC request size, for both CSI and in-tree plugins they are designed
102101

103102
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
104103

105-
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Spec.AllocatedResources`. This field is only allowed to increase and will be set by the api-server to value in `pvc.Spec.Resources` as long as `pvc.Spec.Resources > pvc.Spec.AllocatedResources`. Quota controller code will be updated to use `max(pvc.Spec.Resources, pvc.Spec.AllocatedResources)` when calculating usage of the PVC under questions.
104+
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When a PVC is created - this field defaults to `pvc.Spec.Resources` but when user expands the PVC,
105+
and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
106106

107-
#### On using pvc.Spec.AllocatedResources
107+
When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value (thereby giving quota back to the user) - only if current actual size of volume is less than or equal to `pvc.Spec.Resources`. It will do that only after fetching current size of the volume by using `ControllerGetVolume` CSI RPC call.
108108

109-
We chose to use `pvc.Spec.AllocatedResources` for storing maximum of user requested pvc capacity because once user requests a new size in `pvc.Spec.Resources` even if she cancels the operation later on, resize-controller may have already started working on reconciling newly requested size.
109+
If CSI driver does not have `GET_VOLUME` controller capability and `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (i.e user is attempting to reduce size of a volume that expansion controller previously tried to expand) - then although expansion-controller will try volume expansion with value in `pvc.Spec.Resources` - it will not reduce reported value in `pvc.Status.AllocatedResources`, which will result in no quota being restored to the user. In other words - for CSI drivers that don't have `GET_VOLUME` controller capability - `pvc.Status.AllocatedResources` will report highest requested value and reducing `pvc.Spec.Resources` will not result in reduction of used quota.
110110

111-
AllocatedResources is not volume size but more like whatever user has requested and towards which resize-controller was working to reconcile. It is possible that user has requested smaller size since then but that does not changes the fact that resize-controller has already tried to expand to AllocatedResources and might have partially succeeded. So AllocatedResources is maximum user requested size for this volume and does not reflect actual volume size of the PV and hence is not recorded in `pvc.Status.Capacity`.
111+
#### User flow stories
112112

113-
This also falls inline with how in-place update of Pod is handled(https://github.com/kubernetes/enhancements/pull/1342/files) and allocated resources is recorded in `pod.Spec.Containers[i].ResourcesAllocated`.
114113

115-
#### User flow stories
114+
##### Case 0 (default PVC creation):
115+
- User creates a 10Gi PVC by setting - `pvc.spec.resources.requests["storage"] = "10Gi"`.
116+
- API server sets `pvc.Status.AllocatedResources` to "10Gi" on creation.
116117

117118
###### Case 1 (controller+node expandable):
118-
119119
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`.
120-
- API Server sets `pvc.spec.allocatedResources` to 100Gi via pvc storage strategy.
121-
- Quota controller uses `max(pvc.Spec.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
120+
- `pvc.Status.AllocatedResources` still reports `10Gi`.
121+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
122+
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
122123
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
123124
- User requests size to 20Gi.
124-
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is `100Gi`.
125+
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
126+
- Expansion controller fetches actual size of the volume using `ControllerGetVolume` CSI RPC call.
127+
- Expansion controller sees that `actual_volume_size`(10Gi) < `pvc.Spec.Resources` (20Gi), so it sets `pvc.Status.AllocatedResources` to 20Gi.
128+
- Expansion controller continues with rest of the expansion flow.
129+
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.
125130
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
126-
- `pvc.Spec.AllocatedResources` however keeps reporting `100Gi`.
127-
128-
129-
###### Case 2 (controller+node expandable):
130131

132+
###### Case 2 (controller+node expandable with no GET_VOLUME capability):
131133
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`
132-
- Api Server sets `pvc.spec.allocatedResources` to 100Gi via pvc storage strategy.
133-
- Quota controller uses `max(pvc.Spec.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
134+
- `pvc.Status.AllocatedResources` still reports `10Gi`.
135+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
136+
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
134137
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
135138
- User requests size to 20Gi.
136-
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is `100Gi`.
139+
- Expansion controller notices that `pvc.Spec.Resources` < `pvc.Status.AllocatedResources` (meaning user tried to reduce size).
140+
- Expansion controller sees that CSI driver does not have `GET_VOLUME` controller capability.
141+
- Expansion controller tries expansion to new value in `pvc.Spec.Resources` anyways (`20Gi`).
142+
- `pvc.Status.AllocatedResources` still reports `100Gi` and hence there is no change in used quota.
137143
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
138-
- `pvc.Spec.AllocatedResources` however keeps reporting `100Gi`.
139-
- User expands PVC to 120Gi
140-
- Api Server sets `pvc.spec.allocatedResources` to 120Gi.
141-
- Quota controller uses `max(pvc.Spec.AllocatedResources, pvc.Spec.Resources)` and adds `20Gi` to used quota.
142-
- Expansion to 120Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 20Gi.
143-
144-
###### Case 3
144+
- `pvc.Status.AllocatedResources` however keeps reporting `100Gi`.
145145

146+
###### Case 3 (Malicious user)
146147
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
147-
- API server sets `pvc.spec.allocatedResources` to 100Gi via PVC storage strategy.
148-
- Quota controller uses `max(pvc.Spec.AllocatedResources, pvc.Spec.Resources)` and adds 90Gi to used quota. (100Gi is the total space taken by the PVC)
149-
- Volume plugin starts slowly expanding to 100Gi. `pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
148+
- `pvc.Status.AllocatedResources` still reports `10Gi`.
149+
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
150+
- Expansion controller slow starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
151+
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
150152
- While the storage backend is re-sizing the volume, user requests size 20Gi by changing `pvc.spec.resources.requests["storage"] = "20Gi"`
151-
- Quota controller sees no change in storage usage by the PVC because `pvc.Spec.AllocatedResources` is 100Gi.
152-
- Expansion succeeds and pvc.Status.Capacity and pv.Spec.Capacity report new size as 100Gi, as that's what the volume plugin did.
153+
- Quota controller sees no change in storage usage by the PVC because `pvc.Status.AllocatedResources` is 100Gi.
154+
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `100Gi`, as that's what the volume plugin did.
153155

154156
### Risks and Mitigations
155157

156-
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value. In practice though - this should be acceptable since such expansion failures should be rare and admin can unblock the user by increasing the quota or rebuilding PVC if needed. We will emit events on PV and PVC to alerts admins and users.
158+
- One risk as mentioned above is, if expansion failed and user retried expansion(successfully) with smaller value, the quota code will keep reporting higher value unless CSI driver in question has `GET_VOLUME` controller capability.
157159

158160
## Graduation Criteria
159161

160-
* *Alpha* in 1.19 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. The limitation about quota should be clearly documented.
161-
* *Beta* in 1.20: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with confirmed production usage.
162-
* GA in 1.21 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
162+
* *Alpha* in 1.22 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. The limitation about quota and CSI capability should be clearly documented.
163+
* *Beta* in 1.23: Since this feature is part of general `ExpandPersistentVolumes` feature which is in beta, we are going to move this to beta with confirmed production usage.
164+
* *GA* in 1.24 along with `ExpandPersistentvolumes` feature. The list of issues for volume expansion going GA can be found at - https://github.com/orgs/kubernetes-csi/projects/12.
163165

164166
### Test Plan
165167

@@ -188,16 +190,16 @@ This also falls inline with how in-place update of Pod is handled(https://github
188190

189191
* **Does enabling the feature change any default behavior?**
190192
Allow users to reduce size of pvc in `pvc.spec.resources`. In general this was not permitted before
191-
so it should not break any of existing automation. This means that if `pvc.Spec.AllocatedResources` is available it will be
193+
so it should not break any of existing automation. This means that if `pvc.Status.AllocatedResources` is available it will be
192194
used for calculating quota.
193195

194196
* **Can the feature be disabled once it has been enabled (i.e. can we rollback
195197
the enablement)?**
196198
Yes the feature gate can be disabled once enabled. The quota code will use new field to calculate quota if available. This will allow quota code to use
197-
`pvc.Spec.AllocatedResources` even if feature gate is disabled.
199+
`pvc.Status.AllocatedResources` even if feature gate is disabled.
198200

199201
* **What happens if we reenable the feature if it was previously rolled back?**
200-
This KEP proposes a new field in pvc and it will be used if available for quota calculation. So when feature is rolled back, it will be possible to reduce pvc capacity and quota will use `pvc.Spec.AllocatedResources`.
202+
This KEP proposes a new field in pvc and it will be used if available for quota calculation. So when feature is rolled back, it will be possible to reduce pvc capacity and quota will use `pvc.Status.AllocatedResources`.
201203

202204
* **Are there any tests for feature enablement/disablement?**
203205
The e2e framework does not currently support enabling and disabling feature
@@ -357,14 +359,6 @@ There were several alternatives considered before proposing this KEP.
357359

358360
`pvc.Status.Capacity` can't be used for tracking quota because pvc.Status.Capacity is calculated after binding happens, which could be when pod is started. This would allow an user to overcommit because quota won't reflect accurate value until PVC is bound to a PV.
359361

360-
### Fix Quota when resize succeeds with reduced size
361-
362-
The first thing we considered was doing the *right* thing and ensuring that quota always reflects correct volume size. So if user increases a 10GiB PVC to 1TB and that fails but user retries expansion by reducing the size to 100GiB and it succeeds then quota will report 100GiB rather than 1TB.
363-
364-
There are several problems with this approach though:
365-
- Currently Kubernetes always calculates quota based on user requested size rather than actual volume size. So if a 10GiB PVC is bound to 100GiB PV, then used quota is 10GiB and hence special care must be taken when reconciling the quota with actual volume size because we could break existing conventions around quota.
366-
- The second problem is - using actual volume size for computing quota results in too many problems because there are multiple sources of truth. For example - `ControllerExpandVolume` RPC call of CSI driver may report different size than `NodeExpandVolume` RPC call. So we must fix all existing CSI drivers to report sizes that are consistent. For example - `NodeExpandVolume` must return a size greater or equal to what `ControllerExpandVolume` returns, otherwise it could result in infinte expansion loop on node. The other problem is - `NodeExpandVolume` RPC call in CSI may not report size at all and hence CSI spec must be fixed.
367-
368362
### Allow admins to manually fix PVCs which are stuck in resizing condition
369363

370364
We also considered if it is worth leaving this problem as it is and allow admins to fix it manually since the problem that this KEP fixes is not very frequent(there have not been any reports from end-users about this bug). But there are couple of problems with this:

0 commit comments

Comments
 (0)