Skip to content

Commit 3c7cbd9

Browse files
authored
Merge pull request #46808 from mimowo/managed-by-docs-update
Update the docs for JobManagedBy and JobPodReplacementPolicy related to pod termination
2 parents 45a47d1 + f2b9799 commit 3c7cbd9

File tree

2 files changed

+88
-4
lines changed

2 files changed

+88
-4
lines changed

content/en/docs/concepts/workloads/controllers/job.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,12 +436,22 @@ kubectl get -o yaml job job-backoff-limit-per-index-example
436436
succeeded: 5 # 1 succeeded pod for each of 5 succeeded indexes
437437
failed: 10 # 2 failed pods (1 retry) for each of 5 failed indexes
438438
conditions:
439+
- message: Job has failed indexes
440+
reason: FailedIndexes
441+
status: "True"
442+
type: FailureTarget
439443
- message: Job has failed indexes
440444
reason: FailedIndexes
441445
status: "True"
442446
type: Failed
443447
```
444448
449+
The Job controller adds the `FailureTarget` Job condition to trigger
450+
[Job termination and cleanup](#job-termination-and-cleanup). When all of the
451+
Job Pods are terminated, the Job controller adds the `Failed` condition
452+
with the same values for `reason` and `message` as the `FailureTarget` Job
453+
condition. For details, see [Termination of Job Pods](#termination-of-job-pods).
454+
445455
Additionally, you may want to use the per-index backoff along with a
446456
[pod failure policy](#pod-failure-policy). When using
447457
per-index backoff, there is a new `FailIndex` action available which allows you to
@@ -541,6 +551,11 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be
541551
to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
542552
{{< /note >}}
543553

554+
When you use the `podFailurePolicy`, and the Job fails due to the pod
555+
matching the rule with the `FailJob` action, then the Job controller triggers
556+
the Job termination process by adding the `FailureTarget` condition.
557+
For more details, see [Job termination and cleanup](#job-termination-and-cleanup).
558+
544559
## Success policy {#success-policy}
545560

546561
{{< feature-state feature_gate_name="JobSuccessPolicy" >}}
@@ -647,6 +662,70 @@ there is no automatic Job restart once the Job status is `type: Failed`.
647662
That is, the Job termination mechanisms activated with `.spec.activeDeadlineSeconds`
648663
and `.spec.backoffLimit` result in a permanent Job failure that requires manual intervention to resolve.
649664

665+
### Terminal Job conditions
666+
667+
A Job has two possible terminal states, each of which has a corresponding Job
668+
condition:
669+
* Succeeded: Job condition `Complete`
670+
* Failed: Job condition `Failed`
671+
672+
Jobs fail for the following reasons:
673+
- The number of Pod failures exceeded the specified `.spec.backoffLimit` in the Job
674+
specification. For details, see [Pod backoff failure policy](#pod-backoff-failure-policy).
675+
- The Job runtime exceeded the specified `.spec.activeDeadlineSeconds`
676+
- An indexed Job that used `.spec.backoffLimitPerIndex` has failed indexes.
677+
For details, see [Backoff limit per index](#backoff-limit-per-index).
678+
- The number of failed indexes in the Job exceeded the specified
679+
`spec.maxFailedIndexes`. For details, see [Backoff limit per index](#backoff-limit-per-index)
680+
- A failed Pod matches a rule in `.spec.podFailurePolicy` that has the `FailJob`
681+
action. For details about how Pod failure policy rules might affect failure
682+
evaluation, see [Pod failure policy](#pod-failure-policy).
683+
684+
Jobs succeed for the following reasons:
685+
- The number of succeeded Pods reached the specified `.spec.completions`
686+
- The criteria specified in `.spec.successPolicy` are met. For details, see
687+
[Success policy](#success-policy).
688+
689+
In Kubernetes v1.31 and later the Job controller delays the addition of the
690+
terminal conditions,`Failed` or `Complete`, until all of the Job Pods are terminated.
691+
692+
In Kubernetes v1.30 and earlier, the Job controller added the `Complete` or the
693+
`Failed` Job terminal conditions as soon as the Job termination process was
694+
triggered and all Pod finalizers were removed. However, some Pods would still
695+
be running or terminating at the moment that the terminal condition was added.
696+
697+
In Kubernetes v1.31 and later, the controller only adds the Job terminal conditions
698+
_after_ all of the Pods are terminated. You can enable this behavior by using the
699+
`JobManagedBy` or the `JobPodReplacementPolicy` (enabled by default)
700+
[feature gates](/docs/reference/command-line-tools-reference/feature-gates/).
701+
702+
### Termination of Job pods
703+
704+
The Job controller adds the `FailureTarget` condition or the `SuccessCriteriaMet`
705+
condition to the Job to trigger Pod termination after a Job meets either the
706+
success or failure criteria.
707+
708+
Factors like `terminationGracePeriodSeconds` might increase the amount of time
709+
from the moment that the Job controller adds the `FailureTarget` condition or the
710+
`SuccessCriteriaMet` condition to the moment that all of the Job Pods terminate
711+
and the Job controller adds a [terminal condition](#terminal-job-conditions)
712+
(`Failed` or `Complete`).
713+
714+
You can use the `FailureTarget` or the `SuccessCriteriaMet` condition to evaluate
715+
whether the Job has failed or succeeded without having to wait for the controller
716+
to add a terminal condition.
717+
718+
For example, you might want to decide when to create a replacement Job
719+
that replaces a failed Job. If you replace the failed Job when the `FailureTarget`
720+
condition appears, your replacement Job runs sooner, but could result in Pods
721+
from the failed and the replacement Job running at the same time, using
722+
extra compute resources.
723+
724+
Alternatively, if your cluster has limited resource capacity, you could choose to
725+
wait until the `Failed` condition appears on the Job, which would delay your
726+
replacement Job but would ensure that you conserve resources by waiting
727+
until all of the failed Pods are removed.
728+
650729
## Clean up finished jobs automatically
651730

652731
Finished Jobs are usually no longer needed in the system. Keeping them around in

content/en/docs/tasks/job/pod-failure-policy.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,15 @@ After around 30s the entire Job should be terminated. Inspect the status of the
5050
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
5151
```
5252

53-
In the Job status, see a job `Failed` condition with the field `reason`
54-
equal `PodFailurePolicy`. Additionally, the `message` field contains a
55-
more detailed information about the Job termination, such as:
56-
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
53+
In the Job status, the following conditions display:
54+
- `FailureTarget` condition: has a `reason` field set to `PodFailurePolicy` and
55+
a `message` field with more information about the termination, like
56+
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
57+
The Job controller adds this condition as soon as the Job is considered a failure.
58+
For details, see [Termination of Job Pods](/docs/concepts/workloads/controllers/job/#termination-of-job-pods).
59+
- `Failed` condition: same `reason` and `message` as the `FailureTarget`
60+
condition. The Job controller adds this condition after all of the Job's Pods
61+
are terminated.
5762

5863
For comparison, if the Pod failure policy was disabled it would take 6 retries
5964
of the Pod, taking at least 2 minutes.

0 commit comments

Comments
 (0)