Skip to content

Commit 35565df

Browse files
mimowoKevin Hannon
authored andcommitted
backoff limit per index
Co-authored-by: Kevin Hannon <[email protected]>
1 parent 8aaded8 commit 35565df

File tree

1 file changed

+164
-22
lines changed

1 file changed

+164
-22
lines changed
Lines changed: 164 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,43 @@
11
---
22
layout: blog
3-
title: "Kubernetes 1.28: Updates to the Job API"
4-
date: 2023-07-27
3+
title: "Kubernetes 1.28: New Job features"
4+
date: 2023-08-15
55
slug: kubernetes-1-28-jobapi-update
66
---
77

88
**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google)
99

10-
This blog discusses two features to improve Jobs for batch users: PodRecreationPolicy and JobBackoffLimitPerIndex.
10+
This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch
11+
users: [PodReplacementPolicy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated)
12+
and [BackoffLimitPerIndex](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs).
1113

12-
These are two features requested from users of the Job API to enhance a user's experience.
13-
14-
## Pod Recreation Policy
14+
## Pod Replacement Policy
1515

1616
### What problem does this solve?
1717

18-
Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index. Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created and immediately fail to start.
19-
20-
Having a replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.
18+
By default, when a pod enters a terminating state (e.g. due to preemption or
19+
eviction), a replacement pod is created immediately, and both pods are running
20+
at the same time.
2121

22-
On the other hand, if a replacement Pod is not immediately created, the Job status would show that the number of active pods doesn't match the desired parallelism. To provide better visibility, the job status can have a new field to track the number of Pods currently terminating.
22+
This is problematic for some popular machine learning frameworks, such as
23+
TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one pod running at the same time,
24+
for a given index (see more details in the [issue](https://github.com/kubernetes/kubernetes/issues/115844)).
2325

24-
This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.
26+
Creating the replacement Pod before the previous one fully terminates can also
27+
cause problems in clusters with scarce resources or with tight budgets. These
28+
resources can be difficult to obtain so pods can take a long time to find
29+
resources and they may only be able to find nodes until the existing pods are
30+
fully terminated. Further, if cluster autoscaler is enabled, the replacement
31+
Pods might produce undesired scale ups.
2532

2633
### How can I use it
2734

28-
This is an alpha feature, which means you have to enable the `JobPodReplacementPolicy`
29-
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
30-
with the command line argument `--feature-gates=JobPodReplacementPolicy=true`
31-
to the kube-apiserver.
35+
This is an alpha feature, which you can enable by enabling the `JobPodReplacementPolicy`
36+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in
37+
your cluster.
38+
39+
Once the feature is enabled you can use it by creating a new Job, which specifies
40+
`podReplacementPolicy` field as shown here:
3241

3342
```yaml
3443
kind: Job
@@ -40,26 +49,159 @@ spec:
4049
...
4150
```
4251

43-
`podReplacementPolicy` can take either `Failed` or `TerminatingOrFailed`. In cases where `PodFailurePolicy` is set, you can only use `Failed`.
52+
Additionally, you can inspect the `.status.terminating` field of a Job. The value
53+
of the field is the number of Pods owned by the Job that are currently terminating.
54+
55+
```shell
56+
kubectl get jobs/myjob -o yaml
57+
```
4458

45-
This feature enables two components in the Job controller: Adds a `terminating` field to the status and adds a new API field called `podReplacementPolicy`.
59+
```yaml
60+
apiVersion: batch/v1
61+
kind: Job
62+
status:
63+
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
64+
```
4665
47-
The Job controller uses `parallelism` field in the Job API to determine the number of pods that it is expects to be active (not finished). If there is a mismatch of active pods and the pod has not finished, we would normally assume that the pod has failed and the Job controller would recreate the pod. In cases where `Failed` is specified, the Job controller will wait for the pod to be fully terminated (`DeletionTimeStamp != nil`).
66+
This can be particularly useful for external queueing controllers, such as
67+
[Kueue](https://github.com/kubernetes-sigs/kueue), that would calculate the
68+
quota and suspend the start of a new Job until the resources are reclaimed from
69+
the currently terminating Job.
4870
4971
### How can I learn more?
5072
5173
- Read the KEP: [PodReplacementPolicy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated)
5274
53-
## JobBackoffLimitPerIndex
75+
## Job Backoff Limit per Index
76+
77+
### What problem does this solve?
78+
79+
By default, pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
80+
are counted towards the global limit of retries, represented by `.spec.backoffLimit`.
81+
This means, that if there is a consistently failing index, it is restarted
82+
repeatedly until it exhausts the limit. Once the limit is exceeded the entire
83+
Job is marked failed and some indexes may never be even started.
84+
85+
This is problematic for use cases where you want to handle pod failures for
86+
every index independently. For example, if you use Indexed Jobs for running
87+
integration tests where each index corresponds to a testing suite. In that case,
88+
you may want to account for possible flake tests allowing for 1 or 2 retries per
89+
suite. Additionally, there might be some buggy suites, making the corresponding
90+
indexes fail consistently. In that case you may prefer to terminate retries for
91+
that indexes, yet allowing other suites to complete.
92+
93+
The feature allows you to:
94+
* complete execution of all indexes, despite some indexes failing,
95+
* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.
96+
97+
### How to use it?
98+
99+
This is an alpha feature, which you can enable by enabling the
100+
`JobBackoffLimitPerIndex`
101+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
102+
in your cluster.
103+
104+
Once the feature is enabled, you can create an Indexed Job with the
105+
`.spec.backoffLimitPerIndex` field specified.
106+
107+
#### Example
108+
109+
The following example demonstrates how to use this feature to make sure the
110+
Job executes all indexes, and the number of failures is controller per index.
111+
112+
```yaml
113+
apiVersion: batch/v1
114+
kind: Job
115+
metadata:
116+
name: job-backoff-limit-per-index-execute-all
117+
spec:
118+
completions: 8
119+
parallelism: 2
120+
completionMode: Indexed
121+
backoffLimitPerIndex: 1
122+
template:
123+
spec:
124+
restartPolicy: Never
125+
containers:
126+
- name: example
127+
image: python
128+
command:
129+
- python3
130+
- -c
131+
- |
132+
import os, sys, time
133+
id = int(os.environ.get("JOB_COMPLETION_INDEX"))
134+
if id == 1 or id == 2:
135+
sys.exit(1)
136+
time.sleep(1)
137+
```
138+
139+
Now, inspect the pods after the job is finished:
140+
141+
```sh
142+
kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all
143+
```
144+
145+
Returns output similar to this:
146+
```
147+
NAME READY STATUS RESTARTS AGE
148+
job-backoff-limit-per-index-execute-all-0-b26vc 0/1 Completed 0 49s
149+
job-backoff-limit-per-index-execute-all-1-6j5gd 0/1 Error 0 49s
150+
job-backoff-limit-per-index-execute-all-1-6wd82 0/1 Error 0 37s
151+
job-backoff-limit-per-index-execute-all-2-c66hg 0/1 Error 0 32s
152+
job-backoff-limit-per-index-execute-all-2-nf982 0/1 Error 0 43s
153+
job-backoff-limit-per-index-execute-all-3-cxmhf 0/1 Completed 0 33s
154+
job-backoff-limit-per-index-execute-all-4-9q6kq 0/1 Completed 0 28s
155+
job-backoff-limit-per-index-execute-all-5-z9hqf 0/1 Completed 0 28s
156+
job-backoff-limit-per-index-execute-all-6-tbkr8 0/1 Completed 0 23s
157+
job-backoff-limit-per-index-execute-all-7-hxjsq 0/1 Completed 0 22s
158+
```
159+
160+
Additionally, let's take a look at the job status:
161+
162+
```sh
163+
kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml
164+
```
165+
166+
Returns output similar to this:
167+
168+
```yaml
169+
status:
170+
completedIndexes: 0,3-7
171+
failedIndexes: 1,2
172+
succeeded: 6
173+
failed: 4
174+
conditions:
175+
- message: Job has failed indexes
176+
reason: FailedIndexes
177+
status: "True"
178+
type: Failed
179+
```
180+
181+
Here, indexes `1` and `2` were both retried once. After the second failure,
182+
in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so
183+
the retries were stopped. For comparison, if the per-index backoff was disabled,
184+
then the buggy indexes would retry until the global `backoffLimit` was exceeded,
185+
and then the entire Job would be marked failed, before some of the higher
186+
indexes are started.
54187

55188
### Getting Involved
56189

57-
These features were sponsored under the domain of SIG Apps. Batch is actively being improved for Kubernetes users in the batch working group.
58-
Working groups are relatively short-lived initatives focused on specific goals. In the case of Batch, the goal is to improve/support batch users and enhance the Job API for common use cases. If that interests you, please join the working group either by subscriping to our [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on [Slack](https://kubernetes.slack.com/messages/wg-batch).
190+
These features were sponsored under the domain of SIG Apps. Batch is actively
191+
being improved for Kubernetes users in the
192+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch).
193+
Working groups are relatively short-lived initiatives focused on specific goals.
194+
In the case of Batch, the goal is to improve/support batch users and enhance the
195+
Job API for common use cases. If that interests you, please join the working
196+
group either by subscriping to our
197+
[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
198+
[Slack](https://kubernetes.slack.com/messages/wg-batch).
59199

60200
### Acknowledgments
61201

62202
As with any Kubernetes feature, multiple people contributed to getting this
63203
done, from testing and filing bugs to reviewing code.
64204

65-
We would not have been able to achieve either of these features without Aldo Culquicondor (Google) providing excellent domain knowledge and expertise throughout the Kubernetes ecosystem.
205+
We would not have been able to achieve either of these features without Aldo
206+
Culquicondor (Google) providing excellent domain knowledge and expertise
207+
throughout the Kubernetes ecosystem.

0 commit comments

Comments
 (0)