@@ -28,42 +28,50 @@ You should already be familiar with the basic use of [Job](/docs/concepts/worklo
28
28
29
29
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
30
30
31
- ## Using Pod failure policy to avoid unnecessary Pod retries
31
+ ## Usage scenarios
32
+
33
+ Consider the following usage scenarios for Jobs that define a Pod failure policy :
34
+ - [ Avoiding unnecessary Pod retries] ( #pod-failure-policy-failjob )
35
+ - [ Ignoring Pod disruptions] ( #pod-failure-policy-ignore )
36
+ - [ Avoiding unnecessary Pod retries based on custom Pod Conditions] ( #pod-failure-policy-config-issue )
37
+ - [ Avoiding unnecessary Pod retries per index] ( #backoff-limit-per-index-failindex )
38
+
39
+ ### Using Pod failure policy to avoid unnecessary Pod retries {#pod-failure-policy-failjob}
32
40
33
41
With the following example, you can learn how to use Pod failure policy to
34
42
avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
35
43
software bug.
36
44
37
- First, create a Job based on the config :
45
+ 1 . Examine the following manifest :
38
46
39
- {{% code_sample file="/controllers/job-pod-failure-policy-failjob.yaml" %}}
47
+ {{% code_sample file="/controllers/job-pod-failure-policy-failjob.yaml" %}}
40
48
41
- by running :
49
+ 1 . Apply the manifest :
42
50
43
- ``` sh
44
- kubectl create -f job-pod-failure-policy-failjob.yaml
45
- ```
51
+ ``` sh
52
+ kubectl create -f https://k8s.io/examples/controllers/ job-pod-failure-policy-failjob.yaml
53
+ ```
46
54
47
- After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
55
+ 1 . After around 30 seconds the entire Job should be terminated. Inspect the status of the Job by running:
48
56
49
- ``` sh
50
- kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
51
- ```
57
+ ``` sh
58
+ kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
59
+ ```
52
60
53
- In the Job status, the following conditions display:
54
- - ` FailureTarget ` condition: has a ` reason ` field set to ` PodFailurePolicy ` and
55
- a ` message ` field with more information about the termination, like
56
- ` Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0 ` .
57
- The Job controller adds this condition as soon as the Job is considered a failure.
58
- For details, see [ Termination of Job Pods] ( /docs/concepts/workloads/controllers/job/#termination-of-job-pods ) .
59
- - ` Failed ` condition: same ` reason ` and ` message ` as the ` FailureTarget `
60
- condition. The Job controller adds this condition after all of the Job's Pods
61
- are terminated.
61
+ In the Job status, the following conditions display:
62
+ - ` FailureTarget ` condition: has a ` reason ` field set to ` PodFailurePolicy ` and
63
+ a ` message ` field with more information about the termination, like
64
+ ` Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0 ` .
65
+ The Job controller adds this condition as soon as the Job is considered a failure.
66
+ For details, see [ Termination of Job Pods] ( /docs/concepts/workloads/controllers/job/#termination-of-job-pods ) .
67
+ - ` Failed ` condition: same ` reason ` and ` message ` as the ` FailureTarget `
68
+ condition. The Job controller adds this condition after all of the Job's Pods
69
+ are terminated.
62
70
63
- For comparison, if the Pod failure policy was disabled it would take 6 retries
64
- of the Pod, taking at least 2 minutes.
71
+ For comparison, if the Pod failure policy was disabled it would take 6 retries
72
+ of the Pod, taking at least 2 minutes.
65
73
66
- ### Clean up
74
+ #### Clean up
67
75
68
76
Delete the Job you created:
69
77
@@ -73,7 +81,7 @@ kubectl delete jobs/job-pod-failure-policy-failjob
73
81
74
82
The cluster automatically cleans up the Pods.
75
83
76
- ## Using Pod failure policy to ignore Pod disruptions
84
+ ### Using Pod failure policy to ignore Pod disruptions {#pod-failure-policy-ignore}
77
85
78
86
With the following example, you can learn how to use Pod failure policy to
79
87
ignore Pod disruptions from incrementing the Pod retry counter towards the
@@ -85,35 +93,35 @@ execution. In order to trigger a Pod disruption it is important to drain the
85
93
node while the Pod is running on it (within 90s since the Pod is scheduled).
86
94
{{< /caution >}}
87
95
88
- 1 . Create a Job based on the config :
96
+ 1 . Examine the following manifest :
89
97
90
98
{{% code_sample file="/controllers/job-pod-failure-policy-ignore.yaml" %}}
91
99
92
- by running :
100
+ 1 . Apply the manifest :
93
101
94
102
``` sh
95
- kubectl create -f job-pod-failure-policy-ignore.yaml
103
+ kubectl create -f https://k8s.io/examples/controllers/ job-pod-failure-policy-ignore.yaml
96
104
```
97
105
98
- 2 . Run this command to check the ` nodeName ` the Pod is scheduled to:
106
+ 1 . Run this command to check the ` nodeName ` the Pod is scheduled to:
99
107
100
108
``` sh
101
109
nodeName=$( kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath=' {.items[0].spec.nodeName}' )
102
110
```
103
111
104
- 3 . Drain the node to evict the Pod before it completes (within 90s):
105
-
112
+ 1 . Drain the node to evict the Pod before it completes (within 90s):
113
+
106
114
``` sh
107
115
kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
108
116
```
109
117
110
- 4 . Inspect the ` .status.failed ` to check the counter for the Job is not incremented:
118
+ 1 . Inspect the ` .status.failed ` to check the counter for the Job is not incremented:
111
119
112
120
``` sh
113
121
kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
114
122
```
115
123
116
- 5 . Uncordon the node:
124
+ 1 . Uncordon the node:
117
125
118
126
``` sh
119
127
kubectl uncordon nodes/$nodeName
@@ -124,7 +132,7 @@ The Job resumes and succeeds.
124
132
For comparison, if the Pod failure policy was disabled the Pod disruption would
125
133
result in terminating the entire Job (as the ` .spec.backoffLimit ` is set to 0).
126
134
127
- ### Cleaning up
135
+ #### Cleaning up
128
136
129
137
Delete the Job you created:
130
138
@@ -134,7 +142,7 @@ kubectl delete jobs/job-pod-failure-policy-ignore
134
142
135
143
The cluster automatically cleans up the Pods.
136
144
137
- ## Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions
145
+ ### Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions {#pod-failure-policy-config-issue}
138
146
139
147
With the following example, you can learn how to use Pod failure policy to
140
148
avoid unnecessary Pod restarts based on custom Pod Conditions.
@@ -145,19 +153,19 @@ deleted pods, in the `Pending` phase, to a terminal phase
145
153
(see: [ Pod Phase] ( /docs/concepts/workloads/pods/pod-lifecycle/#pod-phase ) ).
146
154
{{< /note >}}
147
155
148
- 1 . First, create a Job based on the config :
156
+ 1 . Examine the following manifest :
149
157
150
158
{{% code_sample file="/controllers/job-pod-failure-policy-config-issue.yaml" %}}
151
159
152
- by running :
160
+ 1 . Apply the manifest :
153
161
154
162
``` sh
155
- kubectl create -f job-pod-failure-policy-config-issue.yaml
163
+ kubectl create -f https://k8s.io/examples/controllers/ job-pod-failure-policy-config-issue.yaml
156
164
```
157
165
158
166
Note that, the image is misconfigured, as it does not exist.
159
167
160
- 2 . Inspect the status of the job's Pods by running:
168
+ 1 . Inspect the status of the job's Pods by running:
161
169
162
170
``` sh
163
171
kubectl get pods -l job-name=job-pod-failure-policy-config-issue -o yaml
@@ -181,7 +189,7 @@ deleted pods, in the `Pending` phase, to a terminal phase
181
189
image could get pulled. However, in this case, the image does not exist so
182
190
we indicate this fact by a custom condition.
183
191
184
- 3 . Add the custom condition. First prepare the patch by running :
192
+ 1 . Add the custom condition. First prepare the patch by running :
185
193
186
194
` ` ` sh
187
195
cat <<EOF > patch.yaml
@@ -210,13 +218,13 @@ deleted pods, in the `Pending` phase, to a terminal phase
210
218
pod/job-pod-failure-policy-config-issue-k6pvp patched
211
219
` ` `
212
220
213
- 4 . Delete the pod to transition it to `Failed` phase, by running the command :
221
+ 1 . Delete the pod to transition it to `Failed` phase, by running the command :
214
222
215
223
` ` ` sh
216
224
kubectl delete pods/$podName
217
225
` ` `
218
226
219
- 5 . Inspect the status of the Job by running :
227
+ 1 . Inspect the status of the Job by running :
220
228
221
229
` ` ` sh
222
230
kubectl get jobs -l job-name=job-pod-failure-policy-config-issue -o yaml
@@ -232,7 +240,7 @@ In a production environment, the steps 3 and 4 should be automated by a
232
240
user-provided controller.
233
241
{{< /note >}}
234
242
235
- # ## Cleaning up
243
+ # ### Cleaning up
236
244
237
245
Delete the Job you created :
238
246
@@ -242,6 +250,66 @@ kubectl delete jobs/job-pod-failure-policy-config-issue
242
250
243
251
The cluster automatically cleans up the Pods.
244
252
253
+ # ## Using Pod Failure Policy to avoid unnecessary Pod retries per index {#backoff-limit-per-index-failindex}
254
+
255
+ To avoid unnecessary Pod restarts per index, you can use the _Pod failure policy_ and
256
+ _backoff limit per index_ features. This section of the page shows how to use these features
257
+ together.
258
+
259
+ 1. Examine the following manifest :
260
+
261
+ {{% code_sample file="/controllers/job-backoff-limit-per-index-failindex.yaml" %}}
262
+
263
+ 1. Apply the manifest :
264
+
265
+ ` ` ` sh
266
+ kubectl create -f https://k8s.io/examples/controllers/job-backoff-limit-per-index-failindex.yaml
267
+ ` ` `
268
+
269
+ 1. After around 15 seconds, inspect the status of the Pods for the Job. You can do that by running :
270
+
271
+ ` ` ` shell
272
+ kubectl get pods -l job-name=job-backoff-limit-per-index-failindex -o yaml
273
+ ` ` `
274
+
275
+ You will see output similar to this :
276
+
277
+ ` ` ` none
278
+ NAME READY STATUS RESTARTS AGE
279
+ job-backoff-limit-per-index-failindex-0-4g4cm 0/1 Error 0 4s
280
+ job-backoff-limit-per-index-failindex-0-fkdzq 0/1 Error 0 15s
281
+ job-backoff-limit-per-index-failindex-1-2bgdj 0/1 Error 0 15s
282
+ job-backoff-limit-per-index-failindex-2-vs6lt 0/1 Completed 0 11s
283
+ job-backoff-limit-per-index-failindex-3-s7s47 0/1 Completed 0 6s
284
+ ` ` `
285
+
286
+ Note that the output shows the following :
287
+
288
+ * Two Pods have index 0, because of the backoff limit allowed for one retry
289
+ of the index.
290
+ * Only one Pod has index 1, because the exit code of the failed Pod matched
291
+ the Pod failure policy with the `FailIndex` action.
292
+
293
+ 1. Inspect the status of the Job by running :
294
+
295
+ ` ` ` sh
296
+ kubectl get jobs -l job-name=job-backoff-limit-per-index-failindex -o yaml
297
+ ` ` `
298
+
299
+ In the Job status, see that the `failedIndexes` field shows "0,1", because
300
+ both indexes failed. Because the index 1 was not retried the number of failed
301
+ Pods, indicated by the status field "failed" equals 3.
302
+
303
+ # ### Cleaning up
304
+
305
+ Delete the Job you created :
306
+
307
+ ` ` ` sh
308
+ kubectl delete jobs/job-backoff-limit-per-index-failindex
309
+ ` ` `
310
+
311
+ The cluster automatically cleans up the Pods.
312
+
245
313
# # Alternatives
246
314
247
315
You could rely solely on the
0 commit comments