Skip to content

Commit deb1be8

Browse files
authored
Merge pull request #45135 from tenzen-y/job-success-policy-doc
Add JobSuccessPolicy Doc
2 parents 3d33323 + 7465256 commit deb1be8

File tree

3 files changed

+95
-0
lines changed

3 files changed

+95
-0
lines changed

content/en/docs/concepts/workloads/controllers/job.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -550,6 +550,62 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be
550550
to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
551551
{{< /note >}}
552552

553+
## Success policy {#success-policy}
554+
555+
{{< feature-state feature_gate_name="JobSuccessPolicy" >}}
556+
557+
{{< note >}}
558+
You can only configure a success policy for an Indexed Job if you have the
559+
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
560+
enabled in your cluster.
561+
{{< /note >}}
562+
563+
When creating an Indexed Job, you can define when a Job can be declared as succeeded using a `.spec.successPolicy`,
564+
based on the pods that succeeded.
565+
566+
By default, a Job succeeds when the number of succeeded Pods equals `.spec.completions`.
567+
These are some situations where you might want additional control for declaring a Job succeeded:
568+
569+
* When running simulations with different parameters,
570+
you might not need all the simulations to succeed for the overall Job to be successful.
571+
* When following a leader-worker pattern, only the success of the leader determines the success or
572+
failure of a Job. Examples of this are frameworks like MPI and PyTorch etc.
573+
574+
You can configure a success policy, in the `.spec.successPolicy` field,
575+
to meet the above use cases. This policy can handle Job success based on the
576+
succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods.
577+
A success policy is defined by rules. Each rule can take one of the following forms:
578+
579+
* When you specify the `succeededIndexes` only,
580+
once all indexes specified in the `succeededIndexes` succeed, the job controller marks the Job as succeeded.
581+
The `succeededIndexes` must be a list of intervals between 0 and `.spec.completions-1`.
582+
* When you specify the `succeededCount` only,
583+
once the number of succeeded indexes reaches the `succeededCount`, the job controller marks the Job as succeeded.
584+
* When you specify both `succeededIndexes` and `succeededCount`,
585+
once the number of succeeded indexes from the subset of indexes specified in the `succeededIndexes` reaches the `succeededCount`,
586+
the job controller marks the Job as succeeded.
587+
588+
Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
589+
the job controller evaluates the rules in order. Once the Job meets a rule, the job controller ignores remaining rules.
590+
591+
Here is a manifest for a Job with `successPolicy`:
592+
593+
{{% code_sample file="/controllers/job-success-policy.yaml" %}}
594+
595+
In the example above, the rule of the success policy specifies that
596+
the Job should be marked succeeded and terminate the lingering Pods
597+
if one of the 0, 2, and 3 indexes succeeded.
598+
The Job that met the success policy gets the `SuccessCriteriaMet` condition.
599+
After the removal of the lingering Pods is issued, the Job gets the `Complete` condition.
600+
601+
Note that the `succeededIndexes` is represented as intervals separated by a hyphen.
602+
The number are listed in represented by the first and last element of the series, separated by a hyphen.
603+
604+
{{< note >}}
605+
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
606+
once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy.
607+
{{< /note >}}
608+
553609
## Job termination and cleanup
554610

555611
When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
title: JobSuccessPolicy
3+
content_type: feature_gate
4+
5+
_build:
6+
list: never
7+
render: false
8+
9+
stages:
10+
- stage: alpha
11+
defaultValue: false
12+
fromVersion: "1.30"
13+
---
14+
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
spec:
4+
parallelism: 10
5+
completions: 10
6+
completionMode: Indexed # Required for the success policy
7+
successPolicy:
8+
rules:
9+
- succeededIndexes: 0,2-3
10+
succeededCount: 1
11+
template:
12+
spec:
13+
containers:
14+
- name: main
15+
image: python
16+
command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded,
17+
# the overall Job is a success.
18+
- python3
19+
- -c
20+
- |
21+
import os, sys
22+
if os.environ.get("JOB_COMPLETION_INDEX") == "2":
23+
sys.exit(0)
24+
else:
25+
sys.exit(1)

0 commit comments

Comments
 (0)