Skip to content

Commit 0d639b9

Browse files
mimowotengqmsftim
authored
Docs update as we promote JobBackoffLimitPerIndex to stable (#49811)
* Update docs as JobBackoffLimitPerIndex graduates to stable * Add an example for PodFailurePolicy with FailIndex * Review remarks Co-authored-by: Qiming Teng <[email protected]> Co-authored-by: Tim Bannister <[email protected]> * Review remark - new section and aligning the sections * Update content/en/docs/tasks/job/pod-failure-policy.md Co-authored-by: Tim Bannister <[email protected]> --------- Co-authored-by: Qiming Teng <[email protected]> Co-authored-by: Tim Bannister <[email protected]>
1 parent 38f74cf commit 0d639b9

File tree

4 files changed

+155
-49
lines changed

4 files changed

+155
-49
lines changed

content/en/docs/concepts/workloads/controllers/job.md

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -383,13 +383,7 @@ from failed Jobs is not lost inadvertently.
383383

384384
### Backoff limit per index {#backoff-limit-per-index}
385385

386-
{{< feature-state for_k8s_version="v1.29" state="beta" >}}
387-
388-
{{< note >}}
389-
You can only configure the backoff limit per index for an [Indexed](#completion-mode) Job, if you
390-
have the `JobBackoffLimitPerIndex` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
391-
enabled in your cluster.
392-
{{< /note >}}
386+
{{< feature-state feature_gate_name="JobBackoffLimitPerIndex" >}}
393387

394388
When you run an [indexed](#completion-mode) Job, you can choose to handle retries
395389
for pod failures independently for each index. To do so, set the

content/en/docs/reference/command-line-tools-reference/feature-gates/JobBackoffLimitPerIndex.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ stages:
1414
- stage: beta
1515
defaultValue: true
1616
fromVersion: "1.29"
17+
toVersion: "1.32"
18+
- stage: stable
19+
defaultValue: true
20+
fromVersion: "1.33"
1721
---
1822
Allows specifying the maximal number of pod
1923
retries per index in Indexed jobs.

content/en/docs/tasks/job/pod-failure-policy.md

Lines changed: 110 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -28,42 +28,50 @@ You should already be familiar with the basic use of [Job](/docs/concepts/worklo
2828

2929
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
3030

31-
## Using Pod failure policy to avoid unnecessary Pod retries
31+
## Usage scenarios
32+
33+
Consider the following usage scenarios for Jobs that define a Pod failure policy :
34+
- [Avoiding unnecessary Pod retries](#pod-failure-policy-failjob)
35+
- [Ignoring Pod disruptions](#pod-failure-policy-ignore)
36+
- [Avoiding unnecessary Pod retries based on custom Pod Conditions](#pod-failure-policy-config-issue)
37+
- [Avoiding unnecessary Pod retries per index](#backoff-limit-per-index-failindex)
38+
39+
### Using Pod failure policy to avoid unnecessary Pod retries {#pod-failure-policy-failjob}
3240

3341
With the following example, you can learn how to use Pod failure policy to
3442
avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
3543
software bug.
3644

37-
First, create a Job based on the config:
45+
1. Examine the following manifest:
3846

39-
{{% code_sample file="/controllers/job-pod-failure-policy-failjob.yaml" %}}
47+
{{% code_sample file="/controllers/job-pod-failure-policy-failjob.yaml" %}}
4048

41-
by running:
49+
1. Apply the manifest:
4250

43-
```sh
44-
kubectl create -f job-pod-failure-policy-failjob.yaml
45-
```
51+
```sh
52+
kubectl create -f https://k8s.io/examples/controllers/job-pod-failure-policy-failjob.yaml
53+
```
4654

47-
After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
55+
1. After around 30 seconds the entire Job should be terminated. Inspect the status of the Job by running:
4856

49-
```sh
50-
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
51-
```
57+
```sh
58+
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
59+
```
5260

53-
In the Job status, the following conditions display:
54-
- `FailureTarget` condition: has a `reason` field set to `PodFailurePolicy` and
55-
a `message` field with more information about the termination, like
56-
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
57-
The Job controller adds this condition as soon as the Job is considered a failure.
58-
For details, see [Termination of Job Pods](/docs/concepts/workloads/controllers/job/#termination-of-job-pods).
59-
- `Failed` condition: same `reason` and `message` as the `FailureTarget`
60-
condition. The Job controller adds this condition after all of the Job's Pods
61-
are terminated.
61+
In the Job status, the following conditions display:
62+
- `FailureTarget` condition: has a `reason` field set to `PodFailurePolicy` and
63+
a `message` field with more information about the termination, like
64+
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
65+
The Job controller adds this condition as soon as the Job is considered a failure.
66+
For details, see [Termination of Job Pods](/docs/concepts/workloads/controllers/job/#termination-of-job-pods).
67+
- `Failed` condition: same `reason` and `message` as the `FailureTarget`
68+
condition. The Job controller adds this condition after all of the Job's Pods
69+
are terminated.
6270

63-
For comparison, if the Pod failure policy was disabled it would take 6 retries
64-
of the Pod, taking at least 2 minutes.
71+
For comparison, if the Pod failure policy was disabled it would take 6 retries
72+
of the Pod, taking at least 2 minutes.
6573

66-
### Clean up
74+
#### Clean up
6775

6876
Delete the Job you created:
6977

@@ -73,7 +81,7 @@ kubectl delete jobs/job-pod-failure-policy-failjob
7381

7482
The cluster automatically cleans up the Pods.
7583

76-
## Using Pod failure policy to ignore Pod disruptions
84+
### Using Pod failure policy to ignore Pod disruptions {#pod-failure-policy-ignore}
7785

7886
With the following example, you can learn how to use Pod failure policy to
7987
ignore Pod disruptions from incrementing the Pod retry counter towards the
@@ -85,35 +93,35 @@ execution. In order to trigger a Pod disruption it is important to drain the
8593
node while the Pod is running on it (within 90s since the Pod is scheduled).
8694
{{< /caution >}}
8795

88-
1. Create a Job based on the config:
96+
1. Examine the following manifest:
8997

9098
{{% code_sample file="/controllers/job-pod-failure-policy-ignore.yaml" %}}
9199

92-
by running:
100+
1. Apply the manifest:
93101

94102
```sh
95-
kubectl create -f job-pod-failure-policy-ignore.yaml
103+
kubectl create -f https://k8s.io/examples/controllers/job-pod-failure-policy-ignore.yaml
96104
```
97105

98-
2. Run this command to check the `nodeName` the Pod is scheduled to:
106+
1. Run this command to check the `nodeName` the Pod is scheduled to:
99107

100108
```sh
101109
nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}')
102110
```
103111

104-
3. Drain the node to evict the Pod before it completes (within 90s):
105-
112+
1. Drain the node to evict the Pod before it completes (within 90s):
113+
106114
```sh
107115
kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
108116
```
109117

110-
4. Inspect the `.status.failed` to check the counter for the Job is not incremented:
118+
1. Inspect the `.status.failed` to check the counter for the Job is not incremented:
111119

112120
```sh
113121
kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
114122
```
115123

116-
5. Uncordon the node:
124+
1. Uncordon the node:
117125

118126
```sh
119127
kubectl uncordon nodes/$nodeName
@@ -124,7 +132,7 @@ The Job resumes and succeeds.
124132
For comparison, if the Pod failure policy was disabled the Pod disruption would
125133
result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0).
126134

127-
### Cleaning up
135+
#### Cleaning up
128136

129137
Delete the Job you created:
130138

@@ -134,7 +142,7 @@ kubectl delete jobs/job-pod-failure-policy-ignore
134142

135143
The cluster automatically cleans up the Pods.
136144

137-
## Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions
145+
### Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions {#pod-failure-policy-config-issue}
138146

139147
With the following example, you can learn how to use Pod failure policy to
140148
avoid unnecessary Pod restarts based on custom Pod Conditions.
@@ -145,19 +153,19 @@ deleted pods, in the `Pending` phase, to a terminal phase
145153
(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)).
146154
{{< /note >}}
147155

148-
1. First, create a Job based on the config:
156+
1. Examine the following manifest:
149157

150158
{{% code_sample file="/controllers/job-pod-failure-policy-config-issue.yaml" %}}
151159

152-
by running:
160+
1. Apply the manifest:
153161

154162
```sh
155-
kubectl create -f job-pod-failure-policy-config-issue.yaml
163+
kubectl create -f https://k8s.io/examples/controllers/job-pod-failure-policy-config-issue.yaml
156164
```
157165

158166
Note that, the image is misconfigured, as it does not exist.
159167

160-
2. Inspect the status of the job's Pods by running:
168+
1. Inspect the status of the job's Pods by running:
161169

162170
```sh
163171
kubectl get pods -l job-name=job-pod-failure-policy-config-issue -o yaml
@@ -181,7 +189,7 @@ deleted pods, in the `Pending` phase, to a terminal phase
181189
image could get pulled. However, in this case, the image does not exist so
182190
we indicate this fact by a custom condition.
183191

184-
3. Add the custom condition. First prepare the patch by running:
192+
1. Add the custom condition. First prepare the patch by running:
185193

186194
```sh
187195
cat <<EOF > patch.yaml
@@ -210,13 +218,13 @@ deleted pods, in the `Pending` phase, to a terminal phase
210218
pod/job-pod-failure-policy-config-issue-k6pvp patched
211219
```
212220

213-
4. Delete the pod to transition it to `Failed` phase, by running the command:
221+
1. Delete the pod to transition it to `Failed` phase, by running the command:
214222

215223
```sh
216224
kubectl delete pods/$podName
217225
```
218226

219-
5. Inspect the status of the Job by running:
227+
1. Inspect the status of the Job by running:
220228

221229
```sh
222230
kubectl get jobs -l job-name=job-pod-failure-policy-config-issue -o yaml
@@ -232,7 +240,7 @@ In a production environment, the steps 3 and 4 should be automated by a
232240
user-provided controller.
233241
{{< /note >}}
234242

235-
### Cleaning up
243+
#### Cleaning up
236244

237245
Delete the Job you created:
238246

@@ -242,6 +250,66 @@ kubectl delete jobs/job-pod-failure-policy-config-issue
242250

243251
The cluster automatically cleans up the Pods.
244252

253+
### Using Pod Failure Policy to avoid unnecessary Pod retries per index {#backoff-limit-per-index-failindex}
254+
255+
To avoid unnecessary Pod restarts per index, you can use the _Pod failure policy_ and
256+
_backoff limit per index_ features. This section of the page shows how to use these features
257+
together.
258+
259+
1. Examine the following manifest:
260+
261+
{{% code_sample file="/controllers/job-backoff-limit-per-index-failindex.yaml" %}}
262+
263+
1. Apply the manifest:
264+
265+
```sh
266+
kubectl create -f https://k8s.io/examples/controllers/job-backoff-limit-per-index-failindex.yaml
267+
```
268+
269+
1. After around 15 seconds, inspect the status of the Pods for the Job. You can do that by running:
270+
271+
```shell
272+
kubectl get pods -l job-name=job-backoff-limit-per-index-failindex -o yaml
273+
```
274+
275+
You will see output similar to this:
276+
277+
```none
278+
NAME READY STATUS RESTARTS AGE
279+
job-backoff-limit-per-index-failindex-0-4g4cm 0/1 Error 0 4s
280+
job-backoff-limit-per-index-failindex-0-fkdzq 0/1 Error 0 15s
281+
job-backoff-limit-per-index-failindex-1-2bgdj 0/1 Error 0 15s
282+
job-backoff-limit-per-index-failindex-2-vs6lt 0/1 Completed 0 11s
283+
job-backoff-limit-per-index-failindex-3-s7s47 0/1 Completed 0 6s
284+
```
285+
286+
Note that the output shows the following:
287+
288+
* Two Pods have index 0, because of the backoff limit allowed for one retry
289+
of the index.
290+
* Only one Pod has index 1, because the exit code of the failed Pod matched
291+
the Pod failure policy with the `FailIndex` action.
292+
293+
1. Inspect the status of the Job by running:
294+
295+
```sh
296+
kubectl get jobs -l job-name=job-backoff-limit-per-index-failindex -o yaml
297+
```
298+
299+
In the Job status, see that the `failedIndexes` field shows "0,1", because
300+
both indexes failed. Because the index 1 was not retried the number of failed
301+
Pods, indicated by the status field "failed" equals 3.
302+
303+
#### Cleaning up
304+
305+
Delete the Job you created:
306+
307+
```sh
308+
kubectl delete jobs/job-backoff-limit-per-index-failindex
309+
```
310+
311+
The cluster automatically cleans up the Pods.
312+
245313
## Alternatives
246314

247315
You could rely solely on the
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: job-backoff-limit-per-index-failindex
5+
spec:
6+
completions: 4
7+
parallelism: 2
8+
completionMode: Indexed
9+
backoffLimitPerIndex: 1
10+
template:
11+
spec:
12+
restartPolicy: Never
13+
containers:
14+
- name: main
15+
image: docker.io/library/python:3
16+
command:
17+
# The script:
18+
# - fails the Pod with index 0 with exit code 1, which results in one retry;
19+
# - fails the Pod with index 1 with exit code 42 which results
20+
# in failing the index without retry.
21+
# - succeeds Pods with any other index.
22+
- python3
23+
- -c
24+
- |
25+
import os, sys
26+
index = int(os.environ.get("JOB_COMPLETION_INDEX"))
27+
if index == 0:
28+
sys.exit(1)
29+
elif index == 1:
30+
sys.exit(42)
31+
else:
32+
sys.exit(0)
33+
backoffLimit: 6
34+
podFailurePolicy:
35+
rules:
36+
- action: FailIndex
37+
onExitCodes:
38+
containerName: main
39+
operator: In
40+
values: [42]

0 commit comments

Comments
 (0)