Skip to content

Commit 92a0032

Browse files
committed
KEP-3998: Add JobSuccessPolicy Documentation
Signed-off-by: Yuki Iwai <[email protected]>
1 parent d665f92 commit 92a0032

File tree

3 files changed

+96
-0
lines changed

3 files changed

+96
-0
lines changed

content/en/docs/concepts/workloads/controllers/job.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1050,6 +1050,63 @@ after the operation: the built-in Job controller and the external controller
10501050
indicated by the field value.
10511051
{{< /warning >}}
10521052

1053+
### Success policy {#success-policy}
1054+
1055+
{{< feature-state for_k8s_version="v1.29" state="alpha" >}}
1056+
1057+
{{< note >}}
1058+
You can only configure a success policy for an Indexed Job if you have the
1059+
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
1060+
enabled in your cluster.
1061+
{{< /note >}}
1062+
1063+
When you run an indexed Job, a success policy defined with the `spec.successPolicy` field,
1064+
allows you to define when a Job can be declared as succeeded based on the number of succeeded pods.
1065+
1066+
In some situations, you may want to have a better control when handling Pod
1067+
successes than the control provided by the `.spec.completins`.
1068+
There are some examples of use cases:
1069+
1070+
* To optimize costs of running workloads by avoiding unnecessary Pod running,
1071+
you can terminate a Job as soon as one of its Pods succeeds.
1072+
* To care only about a leader index in determining the success or failure of a Job
1073+
in a batch workloads such as MPI and PyTorch etc.
1074+
1075+
You can configure a success policy, in the `.spec.successPolicy` field,
1076+
to meet the above use cases. This policy can handle Job successes based on the
1077+
number of succeeded pods. After the Job meet success policy, the lingering Pods
1078+
are terminated by the Job controller.
1079+
1080+
When you specify the only `.spec.successPolicy.rules[*].succeededIndexes`,
1081+
once all indexes specified in the `succeededIndexes` succeeded, the Job is marked as succeeded.
1082+
The `succeededIndexes` must be a list within 0 to `.spec.completions-1` and
1083+
must not contain duplicate indexes. The `succeededIndexes` is represented as intervals separated by a hyphen.
1084+
The number are listed in represented by the first and last element of the series, separated by a hyphen.
1085+
For example, if you want to specify 1, 3, 4, 5 and 7, the `succeededIndexes` is represented as `1,3-5,7`.
1086+
1087+
When you specify the only `spec.successPolicy.rules[*].succeededCount`,
1088+
once the number of succeeded indexes reaches the `succeededCount`, the Job is marked as succeeded.
1089+
1090+
When you specify both `succeededIndexes` and `succeededCount`,
1091+
once the number of succeeded indexes specified in the `succeededIndexes` reaches the `succeededCount`,
1092+
the Job is marked as succeeded.
1093+
1094+
Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
1095+
the rules are evaluated in order. Once the Job meets a rule, the remaining rules are ignored.
1096+
1097+
Here is a manifest for a Job with `successPolicy`:
1098+
1099+
{{% code_sample file="/controllers/job-success-policy-example.yaml" %}}
1100+
1101+
In the example above, the rule of the success policy specifies that
1102+
the Job should be marked succeeded and terminate the lingering Pods
1103+
if one of the 0, 1, and 2 indexes succeeded.
1104+
1105+
{{< note >}}
1106+
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
1107+
once the Job meets both policies, the terminating policies are respected and a success policy is ignored.
1108+
{{< /note >}}
1109+
10531110
## Alternatives
10541111

10551112
### Bare Pods
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
title: JobSuccessPolicy
3+
content_type: feature_gate
4+
5+
_build:
6+
list: never
7+
render: false
8+
9+
stages:
10+
- stage: alpha
11+
defaultValue: false
12+
fromVersion: "1.30"
13+
---
14+
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
spec:
4+
parallelism: 10
5+
completions: 10
6+
completionMode: Indexed # Required for the feature
7+
successPolicy:
8+
rules:
9+
- succeededIndexes: 0-2
10+
succeededCount: 1
11+
template:
12+
spec:
13+
containers:
14+
- name: main
15+
image: python
16+
command: # The jobs succeed as there is one succeeded index
17+
# among indexes 0, 1, and 2.
18+
- python3
19+
- -c
20+
- |
21+
import os, sys
22+
if os.environ.get("JOB_COMPLETION_INDEX") == "1":
23+
sys.exit(0)
24+
else:
25+
sys.exit(1)

0 commit comments

Comments
 (0)