@@ -358,6 +358,100 @@ will be terminated once the job backoff limit has been reached. This can make de
358
358
from failed Jobs is not lost inadvertently.
359
359
{{< /note >}}
360
360
361
+ ### Pod failure policy {#pod-failure-policy}
362
+
363
+ {{< feature-state for_k8s_version="v1.26" state="beta" >}}
364
+
365
+ {{< note >}}
366
+ You can only configure a Pod failure policy for a Job if you have the
367
+ ` JobPodFailurePolicy ` [ feature gate] ( /docs/reference/command-line-tools-reference/feature-gates/ )
368
+ enabled in your cluster. Additionally, it is recommended
369
+ to enable the ` PodDisruptionConditions ` feature gate in order to be able to detect and handle
370
+ Pod disruption conditions in the Pod failure policy (see also:
371
+ [ Pod disruption conditions] ( /docs/concepts/workloads/pods/disruptions#pod-disruption-conditions ) ). Both feature gates are
372
+ available in Kubernetes {{< skew currentVersion >}}.
373
+ {{< /note >}}
374
+
375
+ A Pod failure policy, defined with the ` .spec.podFailurePolicy ` field, enables
376
+ your cluster to handle Pod failures based on the container exit codes and the
377
+ Pod conditions.
378
+
379
+ In some situations, you may want to have a better control when handling Pod
380
+ failures than the control provided by the [ Pod backoff failure policy] ( #pod-backoff-failure-policy ) ,
381
+ which is based on the Job's ` .spec.backoffLimit ` . These are some examples of use cases:
382
+ * To optimize costs of running workloads by avoiding unnecessary Pod restarts,
383
+ you can terminate a Job as soon as one of its Pods fails with an exit code
384
+ indicating a software bug.
385
+ * To guarantee that your Job finishes even if there are disruptions, you can
386
+ ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
387
+ {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
388
+ or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
389
+ that they don't count towards the ` .spec.backoffLimit ` limit of retries.
390
+
391
+ You can configure a Pod failure policy, in the ` .spec.podFailurePolicy ` field,
392
+ to meet the above use cases. This policy can handle Pod failures based on the
393
+ container exit codes and the Pod conditions.
394
+
395
+ Here is a manifest for a Job that defines a ` podFailurePolicy ` :
396
+
397
+ {{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
398
+
399
+ In the example above, the first rule of the Pod failure policy specifies that
400
+ the Job should be marked failed if the ` main ` container fails with the 42 exit
401
+ code. The following are the rules for the ` main ` container specifically:
402
+
403
+ - an exit code of 0 means that the container succeeded
404
+ - an exit code of 42 means that the ** entire Job** failed
405
+ - any other exit code represents that the container failed, and hence the entire
406
+ Pod. The Pod will be re-created if the total number of restarts is
407
+ below ` backoffLimit ` . If the ` backoffLimit ` is reached the ** entire Job** failed.
408
+
409
+ {{< note >}}
410
+ Because the Pod template specifies a ` restartPolicy: Never ` ,
411
+ the kubelet does not restart the ` main ` container in that particular Pod.
412
+ {{< /note >}}
413
+
414
+ The second rule of the Pod failure policy, specifying the ` Ignore ` action for
415
+ failed Pods with condition ` DisruptionTarget ` excludes Pod disruptions from
416
+ being counted towards the ` .spec.backoffLimit ` limit of retries.
417
+
418
+ {{< note >}}
419
+ If the Job failed, either by the Pod failure policy or Pod backoff
420
+ failure policy, and the Job is running multiple Pods, Kubernetes terminates all
421
+ the Pods in that Job that are still Pending or Running.
422
+ {{< /note >}}
423
+
424
+ These are some requirements and semantics of the API:
425
+ - if you want to use a ` .spec.podFailurePolicy ` field for a Job, you must
426
+ also define that Job's pod template with ` .spec.restartPolicy ` set to ` Never ` .
427
+ - the Pod failure policy rules you specify under ` spec.podFailurePolicy.rules `
428
+ are evaluated in order. Once a rule matches a Pod failure, the remaining rules
429
+ are ignored. When no rule matches the Pod failure, the default
430
+ handling applies.
431
+ - you may want to restrict a rule to a specific container by specifying its name
432
+ in` spec.podFailurePolicy.rules[*].containerName ` . When not specified the rule
433
+ applies to all containers. When specified, it should match one the container
434
+ or ` initContainer ` names in the Pod template.
435
+ - you may specify the action taken when a Pod failure policy is matched by
436
+ ` spec.podFailurePolicy.rules[*].action ` . Possible values are:
437
+ - ` FailJob ` : use to indicate that the Pod's job should be marked as Failed and
438
+ all running Pods should be terminated.
439
+ - ` Ignore ` : use to indicate that the counter towards the ` .spec.backoffLimit `
440
+ should not be incremented and a replacement Pod should be created.
441
+ - ` Count ` : use to indicate that the Pod should be handled in the default way.
442
+ The counter towards the ` .spec.backoffLimit ` should be incremented.
443
+
444
+ {{< note >}}
445
+ When you use a ` podFailurePolicy ` , the job controller only matches Pods in the
446
+ ` Failed ` phase. Pods with a deletion timestamp that are not in a terminal phase
447
+ (` Failed ` or ` Succeeded ` ) are considered still terminating. This implies that
448
+ terminating pods retain a [ tracking finalizer] ( #job-tracking-with-finalizers )
449
+ until they reach a terminal phase.
450
+ Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
451
+ (see: [ Pod Phase] ( /docs/concepts/workloads/pods/pod-lifecycle/#pod-phase ) ). This
452
+ ensures that deleted pods have their finalizers removed by the Job controller.
453
+ {{< /note >}}
454
+
361
455
## Job termination and cleanup
362
456
363
457
When a Job completes, no more Pods are created, but the Pods are [ usually] ( #pod-backoff-failure-policy ) not deleted either.
@@ -725,100 +819,6 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
725
819
` manualSelector: true ` tells the system that you know what you are doing and to allow this
726
820
mismatch.
727
821
728
- ### Pod failure policy {#pod-failure-policy}
729
-
730
- {{< feature-state for_k8s_version="v1.26" state="beta" >}}
731
-
732
- {{< note >}}
733
- You can only configure a Pod failure policy for a Job if you have the
734
- ` JobPodFailurePolicy ` [ feature gate] ( /docs/reference/command-line-tools-reference/feature-gates/ )
735
- enabled in your cluster. Additionally, it is recommended
736
- to enable the ` PodDisruptionConditions ` feature gate in order to be able to detect and handle
737
- Pod disruption conditions in the Pod failure policy (see also:
738
- [ Pod disruption conditions] ( /docs/concepts/workloads/pods/disruptions#pod-disruption-conditions ) ). Both feature gates are
739
- available in Kubernetes {{< skew currentVersion >}}.
740
- {{< /note >}}
741
-
742
- A Pod failure policy, defined with the ` .spec.podFailurePolicy ` field, enables
743
- your cluster to handle Pod failures based on the container exit codes and the
744
- Pod conditions.
745
-
746
- In some situations, you may want to have a better control when handling Pod
747
- failures than the control provided by the [ Pod backoff failure policy] ( #pod-backoff-failure-policy ) ,
748
- which is based on the Job's ` .spec.backoffLimit ` . These are some examples of use cases:
749
- * To optimize costs of running workloads by avoiding unnecessary Pod restarts,
750
- you can terminate a Job as soon as one of its Pods fails with an exit code
751
- indicating a software bug.
752
- * To guarantee that your Job finishes even if there are disruptions, you can
753
- ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
754
- {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
755
- or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
756
- that they don't count towards the ` .spec.backoffLimit ` limit of retries.
757
-
758
- You can configure a Pod failure policy, in the ` .spec.podFailurePolicy ` field,
759
- to meet the above use cases. This policy can handle Pod failures based on the
760
- container exit codes and the Pod conditions.
761
-
762
- Here is a manifest for a Job that defines a ` podFailurePolicy ` :
763
-
764
- {{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
765
-
766
- In the example above, the first rule of the Pod failure policy specifies that
767
- the Job should be marked failed if the ` main ` container fails with the 42 exit
768
- code. The following are the rules for the ` main ` container specifically:
769
-
770
- - an exit code of 0 means that the container succeeded
771
- - an exit code of 42 means that the ** entire Job** failed
772
- - any other exit code represents that the container failed, and hence the entire
773
- Pod. The Pod will be re-created if the total number of restarts is
774
- below ` backoffLimit ` . If the ` backoffLimit ` is reached the ** entire Job** failed.
775
-
776
- {{< note >}}
777
- Because the Pod template specifies a ` restartPolicy: Never ` ,
778
- the kubelet does not restart the ` main ` container in that particular Pod.
779
- {{< /note >}}
780
-
781
- The second rule of the Pod failure policy, specifying the ` Ignore ` action for
782
- failed Pods with condition ` DisruptionTarget ` excludes Pod disruptions from
783
- being counted towards the ` .spec.backoffLimit ` limit of retries.
784
-
785
- {{< note >}}
786
- If the Job failed, either by the Pod failure policy or Pod backoff
787
- failure policy, and the Job is running multiple Pods, Kubernetes terminates all
788
- the Pods in that Job that are still Pending or Running.
789
- {{< /note >}}
790
-
791
- These are some requirements and semantics of the API:
792
- - if you want to use a ` .spec.podFailurePolicy ` field for a Job, you must
793
- also define that Job's pod template with ` .spec.restartPolicy ` set to ` Never ` .
794
- - the Pod failure policy rules you specify under ` spec.podFailurePolicy.rules `
795
- are evaluated in order. Once a rule matches a Pod failure, the remaining rules
796
- are ignored. When no rule matches the Pod failure, the default
797
- handling applies.
798
- - you may want to restrict a rule to a specific container by specifying its name
799
- in` spec.podFailurePolicy.rules[*].containerName ` . When not specified the rule
800
- applies to all containers. When specified, it should match one the container
801
- or ` initContainer ` names in the Pod template.
802
- - you may specify the action taken when a Pod failure policy is matched by
803
- ` spec.podFailurePolicy.rules[*].action ` . Possible values are:
804
- - ` FailJob ` : use to indicate that the Pod's job should be marked as Failed and
805
- all running Pods should be terminated.
806
- - ` Ignore ` : use to indicate that the counter towards the ` .spec.backoffLimit `
807
- should not be incremented and a replacement Pod should be created.
808
- - ` Count ` : use to indicate that the Pod should be handled in the default way.
809
- The counter towards the ` .spec.backoffLimit ` should be incremented.
810
-
811
- {{< note >}}
812
- When you use a ` podFailurePolicy ` , the job controller only matches Pods in the
813
- ` Failed ` phase. Pods with a deletion timestamp that are not in a terminal phase
814
- (` Failed ` or ` Succeeded ` ) are considered still terminating. This implies that
815
- terminating pods retain a [ tracking finalizer] ( #job-tracking-with-finalizers )
816
- until they reach a terminal phase.
817
- Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
818
- (see: [ Pod Phase] ( /docs/concepts/workloads/pods/pod-lifecycle/#pod-phase ) ). This
819
- ensures that deleted pods have their finalizers removed by the Job controller.
820
- {{< /note >}}
821
-
822
822
### Job tracking with finalizers
823
823
824
824
{{< feature-state for_k8s_version="v1.26" state="stable" >}}
0 commit comments