Skip to content

Commit eff73b4

Browse files
authored
Merge pull request kubernetes#2680 from deepakkinni/issue2644
Honor Persistent Volume Reclaim Policy
2 parents ab2a626 + ef508ca commit eff73b4

File tree

3 files changed

+358
-0
lines changed

3 files changed

+358
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2644
2+
alpha:
3+
approver: "@ehashman"
Lines changed: 319 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
# KEP-2644: Honor Persistent Volume Reclaim Policy
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Proposal](#proposal)
8+
- [Risks and Mitigations](#risks-and-mitigations)
9+
- [Design Details](#design-details)
10+
- [Test Plan](#test-plan)
11+
- [E2E tests](#e2e-tests)
12+
- [Graduation Criteria](#graduation-criteria)
13+
- [Alpha](#alpha)
14+
- [Alpha -&gt; Beta](#alpha---beta)
15+
- [Beta -&gt; GA](#beta---ga)
16+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
17+
- [Version Skew Strategy](#version-skew-strategy)
18+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
19+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
20+
- [Scalability](#scalability)
21+
- [Implementation History](#implementation-history)
22+
- [Drawbacks](#drawbacks)
23+
- [Alternatives](#alternatives)
24+
<!-- /toc -->
25+
26+
## Release Signoff Checklist
27+
28+
Items marked with (R) are required *prior to targeting to a milestone / release*.
29+
30+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
31+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
32+
- [ ] (R) Design details are appropriately documented
33+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
34+
- [ ] e2e Tests for all Beta API Operations (endpoints)
35+
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
36+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
37+
- [ ] (R) Graduation criteria is in place
38+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
39+
- [ ] (R) Production readiness review completed
40+
- [ ] (R) Production readiness review approved
41+
- [ ] (R) "Implementation History" section is up-to-date for milestone
42+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
43+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
44+
45+
[kubernetes.io]: https://kubernetes.io/
46+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
47+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
48+
[kubernetes/website]: https://git.k8s.io/website
49+
50+
## Summary
51+
Reclaim policy associated with the Persistent Volume is currently ignored under certain circumstance. For a `Bound`
52+
PV-PVC pair the ordering of PV-PVC deletion determines whether the PV delete reclaim policy is honored. The PV honors
53+
the reclaim policy if the PVC is deleted prior to deleting the PV, however, if the PV is deleted prior to deleting the
54+
PVC then the Reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the
55+
external infrastructure is not removed.
56+
57+
## Motivation
58+
Prevent volumes from being leaked by honoring the PV Reclaim policy after the `Bound` PVC is also deleted.
59+
60+
## Proposal
61+
To better understand the existing issue we will initially walk through existing behavior when a Bound PV is deleted prior
62+
to deleting the PVC.
63+
64+
65+
```
66+
kubectl delete pv <pv-name>
67+
```
68+
On deleting a `Bound` PV, a `deletionTimestamp` is added on to the PV object, this triggers an update which is consumed
69+
by the PV-PVC-controller. The PV-PVC-controller observes that the PV is still associated with a PVC, and the PVC hasn't
70+
been deleted, as a result of this, the PV-PVC-controller attempts to update the PV phase to `Bound` state. Assuming that
71+
the PV was already bound previously, this update eventually amounts to a no-op.
72+
73+
The PVC is deleted at a later point of time.
74+
```
75+
kubectl -n <namespace> delete pvc <pvc-name>
76+
```
77+
78+
1. A `deletionTimestamp` is added on the PVC object.
79+
2. The PVC-protection-controller picks the update, verifies if `deletionTimestamp` is present and there are no pods that
80+
are currently using the PVC.
81+
3. If there are no Pods using the PVC, the PVC finalizers are removed, eventually triggering a PVC delete event.
82+
4. The PV-PVC-controller processes delete PVC event, leading to removal of the PVC from it's in-memory cache followed
83+
by triggering an explicit sync on the PV.
84+
5. The PV-PVC-controller processes the triggered explicit sync, here, it observes that the PVC is no longer available
85+
and updates the PV phase to `Released` state.
86+
6. If the PV-protection-controller picks the update, it observes that there is a `deletionTimestamp` and the PV is not
87+
in a `Bound` phase, this causes the finalizers to be removed.
88+
7. This is followed by PV-PVC-controller initiating the reclaim volume workflows.
89+
8. The reclaim volume workflow observes the `persistentVolumeReclaimPolicy` as `Delete` and schedules a volume deletion.
90+
9. Under the event that (6) has occurred and when `deleteVolumeOperation` executes it attempts to retrieve the latest PV
91+
state from the API server, however, due to (6) the PV is removed from the API server, leading to an error state. This
92+
results in the plugin volume deletion not being exercised hence leaking the volume.
93+
```go
94+
func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) {
95+
klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name)
96+
97+
// This method may have been waiting for a volume lock for some time.
98+
// Previous deleteVolumeOperation might just have saved an updated version, so
99+
// read current volume state now.
100+
newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{}) //<========== NotFound error thrown
101+
if err != nil {
102+
klog.V(3).Infof("error reading persistent volume %q: %v", volume.Name, err)
103+
return "", nil
104+
}
105+
// Remaining code below skipped for brevity...
106+
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.
107+
```
108+
10. Under the event that (6) has not occurred yet, during execution of `deleteVolumeOperation` it is observed that the
109+
PV has a pre-existing `deletionTimestamp`, this makes the method assume that delete is already being processed. This
110+
results in the plugin volume deletion not being exercised hence leaking the volume.
111+
```go
112+
func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) {
113+
klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name)
114+
115+
// This method may have been waiting for a volume lock for some time.
116+
// Previous deleteVolumeOperation might just have saved an updated version, so
117+
// read current volume state now.
118+
newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{})
119+
if err != nil {
120+
klog.V(3).Infof("error reading persistent volume %q: %v", volume.Name, err)
121+
return "", nil
122+
}
123+
124+
if newVolume.GetDeletionTimestamp() != nil {//<==========================DeletionTimestamp set since the PV was deleted first.
125+
klog.V(3).Infof("Volume %q is already being deleted", volume.Name)
126+
return "", nil
127+
}
128+
// Remaining code below skipped for brevity...
129+
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.
130+
```
131+
11. Meanwhile, the external-provisioner checks if there is a `deletionTimestamp` on the PV, if so, it assumes that its
132+
in a transitory state and returns false for the `shouldDelete` check.
133+
```go
134+
// shouldDelete returns whether a volume should have its backing volume
135+
// deleted, i.e. whether a Delete is "desired"
136+
func (ctrl *ProvisionController) shouldDelete(ctx context.Context, volume *v1.PersistentVolume) bool {
137+
if deletionGuard, ok := ctrl.provisioner.(DeletionGuard); ok {
138+
if !deletionGuard.ShouldDelete(ctx, volume) {
139+
return false
140+
}
141+
}
142+
143+
if ctrl.addFinalizer && !ctrl.checkFinalizer(volume, finalizerPV) && volume.ObjectMeta.DeletionTimestamp != nil {
144+
return false
145+
} else if volume.ObjectMeta.DeletionTimestamp != nil { //<========== DeletionTimestamp is set on the PV since it's deleted first.
146+
return false
147+
}
148+
// Remaining code below skipped for brevity...
149+
// Please refer sig-storage-lib-external-provisioner/controller/controller.go for the full code.
150+
```
151+
152+
The main approach in fixing the issue involves using an existing finalizer already implemented in sig-storage-lib-external-provisioner. It is `external-provisioner.volume.kubernetes.io/finalizer` which is added on the Persistent Volume. Currently it is applied only during provisioning if the feature is enabled. The proposal is to not only add this finalizer to newly provisioned PVs, but also to extend the library to add the finalizer to existing PVs. Adding the finalizer prevents the PV from being removed from the API server. The finalizer will be removed only after the physical volume on the storage system is deleted.
153+
154+
When the PVC is deleted, the PV is moved into `Released` state and following checks are made:
155+
156+
1. Plugin:
157+
If the volume has the finalizer `external-provisioner.volume.kubernetes.io/finalizer`, then, `DeletionTimestamp` checks can be ignored.
158+
159+
3. CSI driver
160+
161+
If when processing the PV update it is observed that it has `external-provisioner.volume.kubernetes.io/finalizer` finalizer and
162+
`DeletionTimestamp` set, then the volume deletion request is forwarded to the driver, provided other pre-defined conditions are met.
163+
164+
### Risks and Mitigations
165+
166+
The primary risk is volume deletion to a user that expect the current behavior, i.e, they do not expect the volume to be
167+
deleted from the underlying storage infrastructure when for a bound PV-PVC pair, the PV is deleted followed by deleting
168+
the PVC. This can be mitigated by initially introducing the fix behind a feature gate, followed by explicitly deprecating
169+
the current behavior.
170+
171+
## Design Details
172+
173+
A feature gate named `HonorPVReclaimPolicy` will be introduced for both `kube-controller-manager` and `external-provisioner`.
174+
175+
An existing finalizer `external-provisioner.volume.kubernetes.io/finalizer` is already implemented in sig-storage-lib-external-provisioner. It is added on the Persistent Volume. Currently it is applied only during provisioning if the feature is enabled. The proposal is to not only add this finalizer to newly provisioned PVs, but also to extend the library to add the finalizer to existing PVs. The existing `AddFinalizer` config option will be used to apply the finalizer. Adding the finalizer prevents the PV from being removed from the API server. The finalizer will be removed only after the physical volume on the storage system is deleted.
176+
177+
When CSI Migration is enabled, external-provisioner adds `external-provisioner.volume.kubernetes.io/finalizer` to all the PVs it creates, including in-tree ones. When CSI Migraiton is disabled, however, these PVs will be deleted by in-tree volume plugin. Therefore, in-tree PV controller needs to be modified to remove the finalizer when the PV is being deleted when CSI Migration is disabled.
178+
179+
```go
180+
// PVDeletionProtectionFinalizer finalizer is added to the PV to prevent PV from being deleted before the physical volume
181+
PVDeletionProtectionFinalizer = "external-provisioner.volume.kubernetes.io/finalizer"
182+
```
183+
184+
In `deleteVolumeOperation` in kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go, PV without `PVDeletionProtectionFinalizer` will be skipped as it is already processed and PV with `PVDeletionProtectionFinalizer` will proceed with volume deletion. This will allow external-provisioner to handle the deletion of the CSI volumes. The finalizer `PVDeletionProtectionFinalizer` will only be deleted after the underlying CSI volume is deleted.
185+
186+
```go
187+
// deleteVolumeOperation deletes a volume. This method is running in standalone
188+
// goroutine and already has all necessary locks.
189+
func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) {
190+
klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name)
191+
newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{})
192+
if err != nil {
193+
klog.V(3).Infof("error reading persistent volume %q: %v", volume.Name, err)
194+
return "", nil
195+
}
196+
197+
// PV without PVDeletionProtectionFinalizer was already processed and its volume deleted.
198+
if !ctrl.hasPVDeletionProtectionFinalizer(volume, pvutil.PVDeletionProtectionFinalizer) {
199+
if newVolume.GetDeletionTimestamp() != nil {
200+
klog.V(3).Infof("Volume %q is already being deleted", volume.Name)
201+
return "", nil
202+
}
203+
}
204+
//.....
205+
//.....
206+
//.....
207+
klog.V(4).Infof("deleteVolumeOperation [%s]: success", volume.Name)
208+
return pluginName, err
209+
}
210+
// Remaining code below skipped for brevity...
211+
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.
212+
```
213+
214+
On the driver side, the library adds the finalizer only to newly provisioned PVs. The proposal is to extend the library to (optionally) add the finalizer to all PVs (that are handled by the external-provisioner), to have all PVs protected once the feature is enabled.
215+
216+
When the `shouldDelete` checks succeed, a delete volume request is initiated on the driver. This ensures that the volume is deleted on the backend.
217+
218+
Once the volume is deleted from the backend, the finalizer can be removed. This allows the pv to be removed from the api server.
219+
220+
Note: This feature should work with CSI Migration disabled or enabled.
221+
222+
### Test Plan
223+
224+
#### E2E tests
225+
An e2e test to exercise deletion of PV prior to deletion of PVC for a `Bound` PV-PVC pair.
226+
227+
### Graduation Criteria
228+
229+
#### Alpha
230+
* Feedback
231+
* Fix of the issue behind a feature flag.
232+
* Unit tests and e2e tests outlined in design proposal implemented.
233+
* Tests should be done with both CSI Migration disabled and enabled.
234+
235+
#### Alpha -> Beta
236+
* Allow the Alpha fix to soak for one release.
237+
238+
#### Beta -> GA
239+
* Deployed in production and have gone through at least one K8s upgrade.
240+
241+
### Upgrade / Downgrade Strategy
242+
* Upgrade from old Kubernetes(1.22) to new Kubernetes(1.23) with `HonorPVReclaimPolicy` flag enabled
243+
244+
In the above scenario, the upgrade will cause change in default behavior as described in the current KEP. Additionally,
245+
if there are PVs that have a valid associated PVC and deletion timestamp set, then a finalizer is added to the PV.
246+
247+
* Downgrade from new Kubernetes(1.23) to old Kubernetes(1.22).
248+
249+
In this case, there may be PVs with the deletion finalizer that the older Kubernetes does not remove. Such PVs will be in the API server forever unless if the user manually removes them.
250+
251+
### Version Skew Strategy
252+
The fix is part of `kube-controller-manager` and `external-provisioner`.
253+
254+
1. `kube-controller-manager` is upgraded, but `external-provisioner` is not:
255+
256+
In this case the drivers would still have the issue since the `external-provisioner` is not updated.
257+
258+
2. `external-provisioner` is upgraded but `kube-controller-manager` is not:
259+
260+
In this case the finalizer will be added and removed by the external-provisioner. PVs backed by the in-tree volume plugin are not protected.
261+
262+
In addition, PVs migrated to CSI have the finalizer. When the CSI migration is disabled, in-tree volume plugin / controller-manager does not remove the finalizer. The finalizer must be manually removed by the cluster admin.
263+
264+
## Production Readiness Review Questionnaire
265+
266+
### Feature Enablement and Rollback
267+
268+
###### How can this feature be enabled / disabled in a live cluster?
269+
270+
- [x] Feature gate (also fill in values in `kep.yaml`)
271+
- Feature gate name: `HonorPVReclaimPolicy`
272+
- Components depending on the feature gate: kube-controller-manager, external-provisioner
273+
274+
###### Does enabling the feature change any default behavior?
275+
276+
Enabling the feature will delete the volume from underlying storage when the PV is deleted followed deleting the PVC for
277+
a bound PV-PVC pair where the PV reclaim policy is delete.
278+
279+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
280+
Yes. Disabling the feature flag will continue previous behavior.
281+
282+
###### What happens if we reenable the feature if it was previously rolled back?
283+
Will pick the new behavior as described before.
284+
285+
###### Are there any tests for feature enablement/disablement?
286+
287+
There will be unit tests for the feature `HonorPVReclaimPolicy` enablement/disablement.
288+
289+
### Scalability
290+
291+
###### Will enabling / using this feature result in any new API calls?
292+
The new finalizer will be added to all existing PVs when the feature is enabled. This will be a one-time sync.
293+
294+
###### Will enabling / using this feature result in introducing new API types?
295+
No.
296+
297+
###### Will enabling / using this feature result in any new calls to the cloud provider?
298+
Yes, previously the delete volume call would not reach the provider under the described circumstances, with this change
299+
the volume delete will be invoked to remove the volume from the underlying storage.
300+
301+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
302+
When the feature is enabled, we will be adding a new finalizer to existing PVs so this would result in an increase of the PV size.
303+
304+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
305+
We have metric for delete volume operations. So potentially the time to delete a PV could be longer now since we want to make sure volume on the storage system is deleted first before deleting the PV.
306+
307+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
308+
No.
309+
310+
## Implementation History
311+
312+
1.23: Alpha
313+
314+
## Drawbacks
315+
None. The current behavior could be considered as a drawback, the KEP presents the fix to the drawback. The current
316+
behavior was reported in [Issue-546](https://github.com/kubernetes-csi/external-provisioner/issues/546) and [Issue-195](https://github.com/kubernetes-csi/external-provisioner/issues/195)
317+
318+
## Alternatives
319+
Other approaches to fix have not been considered and will be considered if presented during the KEP review.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
title: Honor Persistent Volume Reclaim Policy
2+
kep-number: 2644
3+
authors:
4+
- "@deepakkinni"
5+
owning-sig: sig-storage
6+
participating-sigs:
7+
status: implementable
8+
creation-date: 2021-05-05
9+
reviewers:
10+
- "@xing-yang"
11+
- "@jsafrane"
12+
approvers:
13+
- "@jsafrane"
14+
- "@xing-yang"
15+
prr-approvers:
16+
- "@ehashman"
17+
see-also:
18+
replaces:
19+
20+
stage: alpha
21+
22+
latest-milestone: "v1.23"
23+
24+
milestone:
25+
alpha: "v1.23"
26+
beta: "v1.24"
27+
stable: "v1.25"
28+
29+
feature-gates:
30+
- name: HonorPvReclaim
31+
components:
32+
- kube-controller-manager
33+
- external-provisioner
34+
disable-supported: true
35+
36+
metrics:

0 commit comments

Comments
 (0)