moving 'Pod failure policy' under 'Handling Pod and container failures' (#40999)

Tamilselvan · web-flow · commit 1793e17a7491 · 2023-05-08T17:25:32.000+08:00
diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md
@@ -358,6 +358,100 @@ will be terminated once the job backoff limit has been reached. This can make de
 from failed Jobs is not lost inadvertently.
 {{< /note >}}
 
+### Pod failure policy {#pod-failure-policy}
+
+{{< feature-state for_k8s_version="v1.26" state="beta" >}}
+
+{{< note >}}
+You can only configure a Pod failure policy for a Job if you have the
+`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+enabled in your cluster. Additionally, it is recommended
+to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
+Pod disruption conditions in the Pod failure policy (see also:
+[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
+available in Kubernetes {{< skew currentVersion >}}.
+{{< /note >}}
+
+A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
+your cluster to handle Pod failures based on the container exit codes and the
+Pod conditions.
+
+In some situations, you  may want to have a better control when handling Pod
+failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
+which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
+* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
+  you can terminate a Job as soon as one of its Pods fails with an exit code
+  indicating a software bug.
+* To guarantee that your Job finishes even if there are disruptions, you can
+  ignore Pod failures caused by disruptions  (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
+  {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
+  or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
+  that they don't count towards the `.spec.backoffLimit` limit of retries.
+
+You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
+to meet the above use cases. This policy can handle Pod failures based on the
+container exit codes and the Pod conditions.
+
+Here is a manifest for a Job that defines a `podFailurePolicy`:
+
+{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
+
+In the example above, the first rule of the Pod failure policy specifies that
+the Job should be marked failed if the `main` container fails with the 42 exit
+code. The following are the rules for the `main` container specifically:
+
+- an exit code of 0 means that the container succeeded
+- an exit code of 42 means that the **entire Job** failed
+- any other exit code represents that the container failed, and hence the entire
+  Pod. The Pod will be re-created if the total number of restarts is
+  below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
+
+{{< note >}}
+Because the Pod template specifies a `restartPolicy: Never`,
+the kubelet does not restart the `main` container in that particular Pod.
+{{< /note >}}
+
+The second rule of the Pod failure policy, specifying the `Ignore` action for
+failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
+being counted towards the `.spec.backoffLimit` limit of retries.
+
+{{< note >}}
+If the Job failed, either by the Pod failure policy or Pod backoff
+failure policy, and the Job is running multiple Pods, Kubernetes terminates all
+the Pods in that Job that are still Pending or Running.
+{{< /note >}}
+
+These are some requirements and semantics of the API:
+- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
+  also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
+- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
+  are evaluated in order. Once a rule matches a Pod failure, the remaining rules
+  are ignored. When no rule matches the Pod failure, the default
+  handling applies.
+- you may want to restrict a rule to a specific container by specifying its name
+  in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
+  applies to all containers. When specified, it should match one the container
+  or `initContainer` names in the Pod template.
+- you may specify the action taken when a Pod failure policy is matched by
+  `spec.podFailurePolicy.rules[*].action`. Possible values are:
+  - `FailJob`: use to indicate that the Pod's job should be marked as Failed and
+     all running Pods should be terminated.
+  - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
+     should not be incremented and a replacement Pod should be created.
+  - `Count`: use to indicate that the Pod should be handled in the default way.
+     The counter towards the `.spec.backoffLimit` should be incremented.
+
+{{< note >}}
+When you use a `podFailurePolicy`, the job controller only matches Pods in the
+`Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
+(`Failed` or `Succeeded`) are considered still terminating. This implies that
+terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
+until they reach a terminal phase.
+Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
+(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
+ensures that deleted pods have their finalizers removed by the Job controller.
+{{< /note >}}
+
 ## Job termination and cleanup
 
 When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
@@ -725,100 +819,6 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
 `manualSelector: true` tells the system that you know what you are doing and to allow this
 mismatch.
 
-### Pod failure policy {#pod-failure-policy}
-
-{{< feature-state for_k8s_version="v1.26" state="beta" >}}
-
-{{< note >}}
-You can only configure a Pod failure policy for a Job if you have the
-`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
-enabled in your cluster. Additionally, it is recommended
-to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
-Pod disruption conditions in the Pod failure policy (see also:
-[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
-available in Kubernetes {{< skew currentVersion >}}.
-{{< /note >}}
-
-A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
-your cluster to handle Pod failures based on the container exit codes and the
-Pod conditions.
-
-In some situations, you  may want to have a better control when handling Pod
-failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
-which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
-* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
-  you can terminate a Job as soon as one of its Pods fails with an exit code
-  indicating a software bug.
-* To guarantee that your Job finishes even if there are disruptions, you can
-  ignore Pod failures caused by disruptions  (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
-  {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
-  or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
-  that they don't count towards the `.spec.backoffLimit` limit of retries.
-
-You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
-to meet the above use cases. This policy can handle Pod failures based on the
-container exit codes and the Pod conditions.
-
-Here is a manifest for a Job that defines a `podFailurePolicy`:
-
-{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
-
-In the example above, the first rule of the Pod failure policy specifies that
-the Job should be marked failed if the `main` container fails with the 42 exit
-code. The following are the rules for the `main` container specifically:
-
-- an exit code of 0 means that the container succeeded
-- an exit code of 42 means that the **entire Job** failed
-- any other exit code represents that the container failed, and hence the entire
-  Pod. The Pod will be re-created if the total number of restarts is
-  below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
-
-{{< note >}}
-Because the Pod template specifies a `restartPolicy: Never`,
-the kubelet does not restart the `main` container in that particular Pod.
-{{< /note >}}
-
-The second rule of the Pod failure policy, specifying the `Ignore` action for
-failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
-being counted towards the `.spec.backoffLimit` limit of retries.
-
-{{< note >}}
-If the Job failed, either by the Pod failure policy or Pod backoff
-failure policy, and the Job is running multiple Pods, Kubernetes terminates all
-the Pods in that Job that are still Pending or Running.
-{{< /note >}}
-
-These are some requirements and semantics of the API:
-- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
-  also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
-- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
-  are evaluated in order. Once a rule matches a Pod failure, the remaining rules
-  are ignored. When no rule matches the Pod failure, the default
-  handling applies.
-- you may want to restrict a rule to a specific container by specifying its name
-  in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
-  applies to all containers. When specified, it should match one the container
-  or `initContainer` names in the Pod template.
-- you may specify the action taken when a Pod failure policy is matched by
-  `spec.podFailurePolicy.rules[*].action`. Possible values are:
-  - `FailJob`: use to indicate that the Pod's job should be marked as Failed and
-     all running Pods should be terminated.
-  - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
-     should not be incremented and a replacement Pod should be created.
-  - `Count`: use to indicate that the Pod should be handled in the default way.
-     The counter towards the `.spec.backoffLimit` should be incremented.
-
-{{< note >}}
-When you use a `podFailurePolicy`, the job controller only matches Pods in the
-`Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
-(`Failed` or `Succeeded`) are considered still terminating. This implies that
-terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
-until they reach a terminal phase.
-Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
-(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
-ensures that deleted pods have their finalizers removed by the Job controller.
-{{< /note >}}
-
 ### Job tracking with finalizers
 
 {{< feature-state for_k8s_version="v1.26" state="stable" >}}