Merge pull request #30597 from mattcary/kep-1847

k8s-ci-robot · web-flow · commit bbdb8c5fa76a · 2021-12-01T11:55:18.000-08:00
KEP 1847 StatefulSet autodelete documentation
diff --git a/content/en/docs/concepts/workloads/controllers/statefulset.md b/content/en/docs/concepts/workloads/controllers/statefulset.md
@@ -301,6 +301,84 @@ already attempted to run with the bad configuration.
 StatefulSet will then begin to recreate the Pods using the reverted template.
 
 
+## PersistentVolumeClaim retention
+
+{{< feature-state for_k8s_version="v1.23" state="alpha" >}}
+
+The optional `.spec.persistentVolumeClaimRetentionPolicy` field controls if
+and how PVCs are deleted during the lifecycle of a StatefulSet. You must enable the
+`StatefulSetAutoDeletePVC` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+to use this field. Once enabled, there are two policies you can configure for each
+StatefulSet:
+
+`whenDeleted`
+: configures the volume retention behavior that applies when the StatefulSet is deleted
+
+`whenScaled`
+: configures the volume retention behavior that applies when the replica count of
+  the StatefulSet   is reduced; for example, when scaling down the set.
+  
+For each policy that you can configure, you can set the value to either `Delete` or `Retain`.
+
+`Delete`
+: The PVCs created from the StatefulSet `volumeClaimTemplate` are deleted for each Pod
+  affected by the policy. With the `whenDeleted` policy all PVCs from the
+  `volumeClaimTemplate` are deleted after their Pods have been deleted. With the
+  `whenScaled` policy, only PVCs corresponding to Pod replicas being scaled down are
+  deleted, after their Pods have been deleted.
+
+`Retain` (default)
+: PVCs from the `volumeClaimTemplate` are not affected when their Pod is
+  deleted. This is the behavior before this new feature.
+
+Bear in mind that these policies **only** apply when Pods are being removed due to the
+StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet
+fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet
+retains the existing PVC.  The existing volume is unaffected, and the cluster will attach it to
+the node where the new Pod is about to launch.
+  
+The default for policies is `Retain`, matching the StatefulSet behavior before this new feature.
+
+Here is an example policy.
+
+```yaml
+apiVersion: apps/v1
+kind: StatefulSet
+...
+spec:
+  persistentVolumeClaimRetentionPolicy:
+    whenDeleted: Retain
+    whenScaled: Delete
+...
+```
+
+The StatefulSet {{<glossary_tooltip text="controller" term_id="controller">}} adds [owner
+references](/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications)
+to its PVCs, which are then deleted by the {{<glossary_tooltip text="garbage collector"
+term_id="garbage-collection">}} after the Pod is terminated. This enables the Pod to
+cleanly unmount all volumes before the PVCs are deleted (and before the backing PV and
+volume are deleted, depending on the retain policy).  When you set the `whenDeleted`
+policy to `Delete`, an owner reference to the StatefulSet instance is placed on all PVCs
+associated with that StatefulSet.
+
+The `whenScaled` policy must delete PVCs only when a Pod is scaled down, and not when a
+Pod is deleted for another reason. When reconciling, the StatefulSet controller compares
+its desired replica count to the actual Pods present on the cluster. Any StatefulSet Pod
+whose id greater than the replica count is condemned and marked for deletion. If the
+`whenScaled` policy is `Delete`, the condemned Pods are first set as owners to the
+associated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCs
+to be garbage collected after only the condemned Pods have terminated.
+
+This means that if the controller crashes and restarts, no Pod will be deleted before its
+owner reference has been updated appropriate to the policy. If a condemned Pod is
+force-deleted while the controller is down, the owner reference may or may not have been
+set up, depending on when the controller crashed. It may take several reconcile loops to
+update the owner references, so some condemned Pods may have set up owner references and
+other may not. For this reason we recommend waiting for the controller to come back up,
+which will verify owner references before terminating Pods. If that is not possible, the
+operator should verify the owner references on PVCs to ensure the expected objects are
+deleted when Pods are force-deleted.
+
 ## {{% heading "whatsnext" %}}
 
 * Learn about [Pods](/docs/concepts/workloads/pods).