Skip to content

Commit 3c9c3af

Browse files
authored
Merge pull request #41924 from kannon92/job-release-blog-post
Blog post on new Job features in 1.28 (PodReplacementPolicy and BackoffLimitPerIndex)
2 parents 4d8feb6 + 5130290 commit 3c9c3af

File tree

1 file changed

+231
-0
lines changed

1 file changed

+231
-0
lines changed
Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 1.28: Improved failure handling for Jobs"
4+
date: 2023-08-21
5+
slug: kubernetes-1-28-jobapi-update
6+
---
7+
8+
**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google)
9+
10+
This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch
11+
users: [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy)
12+
and [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index).
13+
14+
These features continue the effort started by the
15+
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
16+
to improve the handling of Pod failures in a Job.
17+
18+
## Pod replacement policy {#pod-replacement-policy}
19+
20+
By default, when a pod enters a terminating state (e.g. due to preemption or
21+
eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running
22+
at the same time. In API terms, a pod is considered terminating when it has a
23+
`deletionTimestamp` and it has a phase `Pending` or `Running`.
24+
25+
The scenario when two Pods are running at a given time is problematic for
26+
some popular machine learning frameworks, such as
27+
TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time,
28+
for a given index.
29+
Tensorflow gives the following error if two pods are running for a given index.
30+
31+
```
32+
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
33+
```
34+
35+
See more details in the ([issue](https://github.com/kubernetes/kubernetes/issues/115844)).
36+
37+
38+
Creating the replacement Pod before the previous one fully terminates can also
39+
cause problems in clusters with scarce resources or with tight budgets, such as:
40+
* cluster resources can be difficult to obtain for Pods pending to be scheduled,
41+
as Kubernetes might take a long time to find available nodes until the existing
42+
Pods are fully terminated.
43+
* if cluster autoscaler is enabled, the replacement Pods might produce undesired
44+
scale ups.
45+
46+
### How can you use it? {#pod-replacement-policy-how-to-use}
47+
48+
This is an alpha feature, which you can enable by turning on `JobPodReplacementPolicy`
49+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in
50+
your cluster.
51+
52+
Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a
53+
`podReplacementPolicy` field as shown here:
54+
55+
```yaml
56+
kind: Job
57+
metadata:
58+
name: new
59+
...
60+
spec:
61+
podReplacementPolicy: Failed
62+
...
63+
```
64+
65+
In that Job, the Pods would only be replaced once they reached the `Failed` phase,
66+
and not when they are terminating.
67+
68+
Additionally, you can inspect the `.status.terminating` field of a Job. The value
69+
of the field is the number of Pods owned by the Job that are currently terminating.
70+
71+
```shell
72+
kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}'
73+
```
74+
75+
```
76+
3 # three Pods are terminating and have not yet reached the Failed phase
77+
```
78+
79+
This can be particularly useful for external queueing controllers, such as
80+
[Kueue](https://github.com/kubernetes-sigs/kueue), that tracks quota
81+
from running Pods of a Job until the resources are reclaimed from
82+
the currently terminating Job.
83+
84+
Note that the `podReplacementPolicy: Failed` is the default when using a custom
85+
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy).
86+
87+
## Backoff limit per index {#backoff-limit-per-index}
88+
89+
By default, Pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
90+
are counted towards the global limit of retries, represented by `.spec.backoffLimit`.
91+
This means, that if there is a consistently failing index, it is restarted
92+
repeatedly until it exhausts the limit. Once the limit is reached the entire
93+
Job is marked failed and some indexes may never be even started.
94+
95+
This is problematic for use cases where you want to handle Pod failures for
96+
every index independently. For example, if you use Indexed Jobs for running
97+
integration tests where each index corresponds to a testing suite. In that case,
98+
you may want to account for possible flake tests allowing for 1 or 2 retries per
99+
suite. There might be some buggy suites, making the corresponding
100+
indexes fail consistently. In that case you may prefer to limit retries for
101+
the buggy suites, yet allowing other suites to complete.
102+
103+
The feature allows you to:
104+
* complete execution of all indexes, despite some indexes failing.
105+
* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.
106+
107+
### How can you use it? {#backoff-limit-per-index-how-to-use}
108+
109+
This is an alpha feature, which you can enable by turning on the
110+
`JobBackoffLimitPerIndex`
111+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
112+
in your cluster.
113+
114+
Once the feature is enabled in your cluster, you can create an Indexed Job with the
115+
`.spec.backoffLimitPerIndex` field specified.
116+
117+
#### Example
118+
119+
The following example demonstrates how to use this feature to make sure the
120+
Job executes all indexes (provided there is no other reason for the early Job
121+
termination, such as reaching the `activeDeadlineSeconds` timeout, or being
122+
manually deleted by the user), and the number of failures is controlled per index.
123+
124+
```yaml
125+
apiVersion: batch/v1
126+
kind: Job
127+
metadata:
128+
name: job-backoff-limit-per-index-execute-all
129+
spec:
130+
completions: 8
131+
parallelism: 2
132+
completionMode: Indexed
133+
backoffLimitPerIndex: 1
134+
template:
135+
spec:
136+
restartPolicy: Never
137+
containers:
138+
- name: example # this example container returns an error, and fails,
139+
# when it is run as the second or third index in any Job
140+
# (even after a retry)
141+
image: python
142+
command:
143+
- python3
144+
- -c
145+
- |
146+
import os, sys, time
147+
id = int(os.environ.get("JOB_COMPLETION_INDEX"))
148+
if id == 1 or id == 2:
149+
sys.exit(1)
150+
time.sleep(1)
151+
```
152+
153+
Now, inspect the Pods after the job is finished:
154+
155+
```sh
156+
kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all
157+
```
158+
159+
Returns output similar to this:
160+
```
161+
NAME READY STATUS RESTARTS AGE
162+
job-backoff-limit-per-index-execute-all-0-b26vc 0/1 Completed 0 49s
163+
job-backoff-limit-per-index-execute-all-1-6j5gd 0/1 Error 0 49s
164+
job-backoff-limit-per-index-execute-all-1-6wd82 0/1 Error 0 37s
165+
job-backoff-limit-per-index-execute-all-2-c66hg 0/1 Error 0 32s
166+
job-backoff-limit-per-index-execute-all-2-nf982 0/1 Error 0 43s
167+
job-backoff-limit-per-index-execute-all-3-cxmhf 0/1 Completed 0 33s
168+
job-backoff-limit-per-index-execute-all-4-9q6kq 0/1 Completed 0 28s
169+
job-backoff-limit-per-index-execute-all-5-z9hqf 0/1 Completed 0 28s
170+
job-backoff-limit-per-index-execute-all-6-tbkr8 0/1 Completed 0 23s
171+
job-backoff-limit-per-index-execute-all-7-hxjsq 0/1 Completed 0 22s
172+
```
173+
174+
Additionally, you can take a look at the status for that Job:
175+
176+
```sh
177+
kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml
178+
```
179+
180+
The output ends with a `status` similar to:
181+
182+
```yaml
183+
status:
184+
completedIndexes: 0,3-7
185+
failedIndexes: 1,2
186+
succeeded: 6
187+
failed: 4
188+
conditions:
189+
- message: Job has failed indexes
190+
reason: FailedIndexes
191+
status: "True"
192+
type: Failed
193+
```
194+
195+
Here, indexes `1` and `2` were both retried once. After the second failure,
196+
in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so
197+
the retries were stopped. For comparison, if the per-index backoff was disabled,
198+
then the buggy indexes would retry until the global `backoffLimit` was exceeded,
199+
and then the entire Job would be marked failed, before some of the higher
200+
indexes are started.
201+
202+
## How can you learn more?
203+
204+
- Read the user-facing documentation for [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy),
205+
[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and
206+
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
207+
- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated),
208+
[Backoff limit per index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and
209+
[Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures).
210+
211+
## Getting Involved
212+
213+
These features were sponsored by [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps). Batch use cases are actively
214+
being improved for Kubernetes users in the
215+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch).
216+
Working groups are relatively short-lived initiatives focused on specific goals.
217+
The goal of the WG Batch is to improve experience for batch workload users, offer support for
218+
batch processing use cases, and enhance the
219+
Job API for common use cases. If that interests you, please join the working
220+
group either by subscriping to our
221+
[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
222+
[Slack](https://kubernetes.slack.com/messages/wg-batch).
223+
224+
## Acknowledgments
225+
226+
As with any Kubernetes feature, multiple people contributed to getting this
227+
done, from testing and filing bugs to reviewing code.
228+
229+
We would not have been able to achieve either of these features without Aldo
230+
Culquicondor (Google) providing excellent domain knowledge and expertise
231+
throughout the Kubernetes ecosystem.

0 commit comments

Comments
 (0)