@@ -301,6 +301,84 @@ already attempted to run with the bad configuration.
301
301
StatefulSet will then begin to recreate the Pods using the reverted template.
302
302
303
303
304
+ # # PersistentVolumeClaim retention
305
+
306
+ {{< feature-state for_k8s_version="v1.23" state="alpha" >}}
307
+
308
+ The optional `.spec.persistentVolumeClaimRetentionPolicy` field controls if
309
+ and how PVCs are deleted during the lifecycle of a StatefulSet. You must enable the
310
+ ` StatefulSetAutoDeletePVC` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
311
+ to use this field. Once enabled, there are two policies you can configure for each
312
+ StatefulSet :
313
+
314
+ ` whenDeleted`
315
+ : configures the volume retention behavior that applies when the StatefulSet is deleted
316
+
317
+ ` whenScaled`
318
+ : configures the volume retention behavior that applies when the replica count of
319
+ the StatefulSet is reduced; for example, when scaling down the set.
320
+
321
+ For each policy that you can configure, you can set the value to either `Delete` or `Retain`.
322
+
323
+ ` Delete`
324
+ : The PVCs created from the StatefulSet `volumeClaimTemplate` are deleted for each Pod
325
+ affected by the policy. With the `whenDeleted` policy all PVCs from the
326
+ ` volumeClaimTemplate` are deleted after their Pods have been deleted. With the
327
+ ` whenScaled` policy, only PVCs corresponding to Pod replicas being scaled down are
328
+ deleted, after their Pods have been deleted.
329
+
330
+ ` Retain` (default)
331
+ : PVCs from the `volumeClaimTemplate` are not affected when their Pod is
332
+ deleted. This is the behavior before this new feature.
333
+
334
+ Bear in mind that these policies **only** apply when Pods are being removed due to the
335
+ StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet
336
+ fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet
337
+ retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to
338
+ the node where the new Pod is about to launch.
339
+
340
+ The default for policies is `Retain`, matching the StatefulSet behavior before this new feature.
341
+
342
+ Here is an example policy.
343
+
344
+ ` ` ` yaml
345
+ apiVersion: apps/v1
346
+ kind: StatefulSet
347
+ ...
348
+ spec:
349
+ persistentVolumeClaimRetentionPolicy:
350
+ whenDeleted: Retain
351
+ whenScaled: Delete
352
+ ...
353
+ ` ` `
354
+
355
+ The StatefulSet {{<glossary_tooltip text="controller" term_id="controller">}} adds [owner
356
+ references](/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications)
357
+ to its PVCs, which are then deleted by the {{<glossary_tooltip text="garbage collector"
358
+ term_id="garbage-collection">}} after the Pod is terminated. This enables the Pod to
359
+ cleanly unmount all volumes before the PVCs are deleted (and before the backing PV and
360
+ volume are deleted, depending on the retain policy). When you set the `whenDeleted`
361
+ policy to `Delete`, an owner reference to the StatefulSet instance is placed on all PVCs
362
+ associated with that StatefulSet.
363
+
364
+ The `whenScaled` policy must delete PVCs only when a Pod is scaled down, and not when a
365
+ Pod is deleted for another reason. When reconciling, the StatefulSet controller compares
366
+ its desired replica count to the actual Pods present on the cluster. Any StatefulSet Pod
367
+ whose id greater than the replica count is condemned and marked for deletion. If the
368
+ ` whenScaled` policy is `Delete`, the condemned Pods are first set as owners to the
369
+ associated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCs
370
+ to be garbage collected after only the condemned Pods have terminated.
371
+
372
+ This means that if the controller crashes and restarts, no Pod will be deleted before its
373
+ owner reference has been updated appropriate to the policy. If a condemned Pod is
374
+ force-deleted while the controller is down, the owner reference may or may not have been
375
+ set up, depending on when the controller crashed. It may take several reconcile loops to
376
+ update the owner references, so some condemned Pods may have set up owner references and
377
+ other may not. For this reason we recommend waiting for the controller to come back up,
378
+ which will verify owner references before terminating Pods. If that is not possible, the
379
+ operator should verify the owner references on PVCs to ensure the expected objects are
380
+ deleted when Pods are force-deleted.
381
+
304
382
# # {{% heading "whatsnext" %}}
305
383
306
384
* Learn about [Pods](/docs/concepts/workloads/pods).
0 commit comments