Merge pull request #45135 from tenzen-y/job-success-policy-doc

k8s-ci-robot · web-flow · commit deb1be8892be · 2024-03-25T20:09:19.000-07:00
Add JobSuccessPolicy Doc
diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md
@@ -550,6 +550,62 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be
 to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
 {{< /note >}}
 
+## Success policy {#success-policy}
+
+{{< feature-state feature_gate_name="JobSuccessPolicy" >}}
+
+{{< note >}}
+You can only configure a success policy for an Indexed Job if you have the
+`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+enabled in your cluster.
+{{< /note >}}
+
+When creating an Indexed Job, you can define when a Job can be declared as succeeded using a `.spec.successPolicy`,
+based on the pods that succeeded.
+
+By default, a Job succeeds when the number of succeeded Pods equals `.spec.completions`.
+These are some situations where you might want additional control for declaring a Job succeeded:
+
+* When running simulations with different parameters, 
+  you might not need all the simulations to succeed for the overall Job to be successful.
+* When following a leader-worker pattern, only the success of the leader determines the success or
+  failure of a Job. Examples of this are frameworks like MPI and PyTorch etc.
+
+You can configure a success policy, in the `.spec.successPolicy` field,
+to meet the above use cases. This policy can handle Job success based on the
+succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods.
+A success policy is defined by rules. Each rule can take one of the following forms:
+
+* When you specify the `succeededIndexes` only,
+  once all indexes specified in the `succeededIndexes` succeed, the job controller marks the Job as succeeded.
+  The `succeededIndexes` must be a list of intervals between 0 and `.spec.completions-1`.
+* When you specify the `succeededCount` only,
+  once the number of succeeded indexes reaches the `succeededCount`, the job controller marks the Job as succeeded.
+* When you specify both `succeededIndexes` and `succeededCount`,
+  once the number of succeeded indexes from the subset of indexes specified in the `succeededIndexes` reaches the `succeededCount`,
+  the job controller marks the Job as succeeded.
+
+Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
+the job controller evaluates the rules in order. Once the Job meets a rule, the job controller ignores remaining rules.
+
+Here is a manifest for a Job with `successPolicy`:
+
+{{% code_sample file="/controllers/job-success-policy.yaml" %}}
+
+In the example above, the rule of the success policy specifies that
+the Job should be marked succeeded and terminate the lingering Pods
+if one of the 0, 2, and 3 indexes succeeded.
+The Job that met the success policy gets the `SuccessCriteriaMet` condition. 
+After the removal of the lingering Pods is issued, the Job gets the `Complete` condition.
+
+Note that the `succeededIndexes` is represented as intervals separated by a hyphen.
+The number are listed in represented by the first and last element of the series, separated by a hyphen.
+
+{{< note >}}
+When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
+once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy.
+{{< /note >}}
+
 ## Job termination and cleanup
 
 When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md
@@ -0,0 +1,14 @@
+---
+title: JobSuccessPolicy
+content_type: feature_gate
+
+_build:
+  list: never
+  render: false
+
+stages:
+  - stage: alpha
+    defaultValue: false
+    fromVersion: "1.30"
+---
+Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.
diff --git a/content/en/examples/controllers/job-success-policy.yaml b/content/en/examples/controllers/job-success-policy.yaml
@@ -0,0 +1,25 @@
+apiVersion: batch/v1
+kind: Job
+spec:
+  parallelism: 10
+  completions: 10
+  completionMode: Indexed # Required for the success policy
+  successPolicy:
+    rules:
+      - succeededIndexes: 0,2-3
+        succeededCount: 1
+  template:
+    spec:
+      containers:
+      - name: main
+        image: python
+        command:          # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded,
+                          # the overall Job is a success.
+          - python3
+          - -c
+          - |
+            import os, sys
+            if os.environ.get("JOB_COMPLETION_INDEX") == "2":
+              sys.exit(0)
+            else:
+              sys.exit(1)