Skip to content

Commit 1793e17

Browse files
author
Tamilselvan
authored
moving 'Pod failure policy' under 'Handling Pod and container failures' (#40999)
1 parent 10b1aa8 commit 1793e17

File tree

1 file changed

+94
-94
lines changed
  • content/en/docs/concepts/workloads/controllers

1 file changed

+94
-94
lines changed

content/en/docs/concepts/workloads/controllers/job.md

Lines changed: 94 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -358,6 +358,100 @@ will be terminated once the job backoff limit has been reached. This can make de
358358
from failed Jobs is not lost inadvertently.
359359
{{< /note >}}
360360

361+
### Pod failure policy {#pod-failure-policy}
362+
363+
{{< feature-state for_k8s_version="v1.26" state="beta" >}}
364+
365+
{{< note >}}
366+
You can only configure a Pod failure policy for a Job if you have the
367+
`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
368+
enabled in your cluster. Additionally, it is recommended
369+
to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
370+
Pod disruption conditions in the Pod failure policy (see also:
371+
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
372+
available in Kubernetes {{< skew currentVersion >}}.
373+
{{< /note >}}
374+
375+
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
376+
your cluster to handle Pod failures based on the container exit codes and the
377+
Pod conditions.
378+
379+
In some situations, you may want to have a better control when handling Pod
380+
failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
381+
which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
382+
* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
383+
you can terminate a Job as soon as one of its Pods fails with an exit code
384+
indicating a software bug.
385+
* To guarantee that your Job finishes even if there are disruptions, you can
386+
ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
387+
{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
388+
or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
389+
that they don't count towards the `.spec.backoffLimit` limit of retries.
390+
391+
You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
392+
to meet the above use cases. This policy can handle Pod failures based on the
393+
container exit codes and the Pod conditions.
394+
395+
Here is a manifest for a Job that defines a `podFailurePolicy`:
396+
397+
{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
398+
399+
In the example above, the first rule of the Pod failure policy specifies that
400+
the Job should be marked failed if the `main` container fails with the 42 exit
401+
code. The following are the rules for the `main` container specifically:
402+
403+
- an exit code of 0 means that the container succeeded
404+
- an exit code of 42 means that the **entire Job** failed
405+
- any other exit code represents that the container failed, and hence the entire
406+
Pod. The Pod will be re-created if the total number of restarts is
407+
below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
408+
409+
{{< note >}}
410+
Because the Pod template specifies a `restartPolicy: Never`,
411+
the kubelet does not restart the `main` container in that particular Pod.
412+
{{< /note >}}
413+
414+
The second rule of the Pod failure policy, specifying the `Ignore` action for
415+
failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
416+
being counted towards the `.spec.backoffLimit` limit of retries.
417+
418+
{{< note >}}
419+
If the Job failed, either by the Pod failure policy or Pod backoff
420+
failure policy, and the Job is running multiple Pods, Kubernetes terminates all
421+
the Pods in that Job that are still Pending or Running.
422+
{{< /note >}}
423+
424+
These are some requirements and semantics of the API:
425+
- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
426+
also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
427+
- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
428+
are evaluated in order. Once a rule matches a Pod failure, the remaining rules
429+
are ignored. When no rule matches the Pod failure, the default
430+
handling applies.
431+
- you may want to restrict a rule to a specific container by specifying its name
432+
in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
433+
applies to all containers. When specified, it should match one the container
434+
or `initContainer` names in the Pod template.
435+
- you may specify the action taken when a Pod failure policy is matched by
436+
`spec.podFailurePolicy.rules[*].action`. Possible values are:
437+
- `FailJob`: use to indicate that the Pod's job should be marked as Failed and
438+
all running Pods should be terminated.
439+
- `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
440+
should not be incremented and a replacement Pod should be created.
441+
- `Count`: use to indicate that the Pod should be handled in the default way.
442+
The counter towards the `.spec.backoffLimit` should be incremented.
443+
444+
{{< note >}}
445+
When you use a `podFailurePolicy`, the job controller only matches Pods in the
446+
`Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
447+
(`Failed` or `Succeeded`) are considered still terminating. This implies that
448+
terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
449+
until they reach a terminal phase.
450+
Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
451+
(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
452+
ensures that deleted pods have their finalizers removed by the Job controller.
453+
{{< /note >}}
454+
361455
## Job termination and cleanup
362456

363457
When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
@@ -725,100 +819,6 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
725819
`manualSelector: true` tells the system that you know what you are doing and to allow this
726820
mismatch.
727821

728-
### Pod failure policy {#pod-failure-policy}
729-
730-
{{< feature-state for_k8s_version="v1.26" state="beta" >}}
731-
732-
{{< note >}}
733-
You can only configure a Pod failure policy for a Job if you have the
734-
`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
735-
enabled in your cluster. Additionally, it is recommended
736-
to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
737-
Pod disruption conditions in the Pod failure policy (see also:
738-
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
739-
available in Kubernetes {{< skew currentVersion >}}.
740-
{{< /note >}}
741-
742-
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
743-
your cluster to handle Pod failures based on the container exit codes and the
744-
Pod conditions.
745-
746-
In some situations, you may want to have a better control when handling Pod
747-
failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
748-
which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
749-
* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
750-
you can terminate a Job as soon as one of its Pods fails with an exit code
751-
indicating a software bug.
752-
* To guarantee that your Job finishes even if there are disruptions, you can
753-
ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
754-
{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
755-
or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
756-
that they don't count towards the `.spec.backoffLimit` limit of retries.
757-
758-
You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
759-
to meet the above use cases. This policy can handle Pod failures based on the
760-
container exit codes and the Pod conditions.
761-
762-
Here is a manifest for a Job that defines a `podFailurePolicy`:
763-
764-
{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
765-
766-
In the example above, the first rule of the Pod failure policy specifies that
767-
the Job should be marked failed if the `main` container fails with the 42 exit
768-
code. The following are the rules for the `main` container specifically:
769-
770-
- an exit code of 0 means that the container succeeded
771-
- an exit code of 42 means that the **entire Job** failed
772-
- any other exit code represents that the container failed, and hence the entire
773-
Pod. The Pod will be re-created if the total number of restarts is
774-
below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
775-
776-
{{< note >}}
777-
Because the Pod template specifies a `restartPolicy: Never`,
778-
the kubelet does not restart the `main` container in that particular Pod.
779-
{{< /note >}}
780-
781-
The second rule of the Pod failure policy, specifying the `Ignore` action for
782-
failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
783-
being counted towards the `.spec.backoffLimit` limit of retries.
784-
785-
{{< note >}}
786-
If the Job failed, either by the Pod failure policy or Pod backoff
787-
failure policy, and the Job is running multiple Pods, Kubernetes terminates all
788-
the Pods in that Job that are still Pending or Running.
789-
{{< /note >}}
790-
791-
These are some requirements and semantics of the API:
792-
- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
793-
also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
794-
- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
795-
are evaluated in order. Once a rule matches a Pod failure, the remaining rules
796-
are ignored. When no rule matches the Pod failure, the default
797-
handling applies.
798-
- you may want to restrict a rule to a specific container by specifying its name
799-
in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
800-
applies to all containers. When specified, it should match one the container
801-
or `initContainer` names in the Pod template.
802-
- you may specify the action taken when a Pod failure policy is matched by
803-
`spec.podFailurePolicy.rules[*].action`. Possible values are:
804-
- `FailJob`: use to indicate that the Pod's job should be marked as Failed and
805-
all running Pods should be terminated.
806-
- `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
807-
should not be incremented and a replacement Pod should be created.
808-
- `Count`: use to indicate that the Pod should be handled in the default way.
809-
The counter towards the `.spec.backoffLimit` should be incremented.
810-
811-
{{< note >}}
812-
When you use a `podFailurePolicy`, the job controller only matches Pods in the
813-
`Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
814-
(`Failed` or `Succeeded`) are considered still terminating. This implies that
815-
terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
816-
until they reach a terminal phase.
817-
Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
818-
(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
819-
ensures that deleted pods have their finalizers removed by the Job controller.
820-
{{< /note >}}
821-
822822
### Job tracking with finalizers
823823

824824
{{< feature-state for_k8s_version="v1.26" state="stable" >}}

0 commit comments

Comments
 (0)