You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and [BackoffLimitPerIndex](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs).
11
13
12
-
These are two features requested from users of the Job API to enhance a user's experience.
13
-
14
-
## Pod Recreation Policy
14
+
## Pod Replacement Policy
15
15
16
16
### What problem does this solve?
17
17
18
-
Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index. Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created and immediately fail to start.
19
-
20
-
Having a replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.
18
+
By default, when a pod enters a terminating state (e.g. due to preemption or
19
+
eviction), a replacement pod is created immediately, and both pods are running
20
+
at the same time.
21
21
22
-
On the other hand, if a replacement Pod is not immediately created, the Job status would show that the number of active pods doesn't match the desired parallelism. To provide better visibility, the job status can have a new field to track the number of Pods currently terminating.
22
+
This is problematic for some popular machine learning frameworks, such as
23
+
TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one pod running at the same time,
24
+
for a given index (see more details in the [issue](https://github.com/kubernetes/kubernetes/issues/115844)).
23
25
24
-
This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.
26
+
Creating the replacement Pod before the previous one fully terminates can also
27
+
cause problems in clusters with scarce resources or with tight budgets. These
28
+
resources can be difficult to obtain so pods can take a long time to find
29
+
resources and they may only be able to find nodes until the existing pods are
30
+
fully terminated. Further, if cluster autoscaler is enabled, the replacement
31
+
Pods might produce undesired scale ups.
25
32
26
33
### How can I use it
27
34
28
-
This is an alpha feature, which means you have to enable the `JobPodReplacementPolicy`
with the command line argument `--feature-gates=JobPodReplacementPolicy=true`
31
-
to the kube-apiserver.
35
+
This is an alpha feature, which you can enable by enabling the `JobPodReplacementPolicy`
36
+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in
37
+
your cluster.
38
+
39
+
Once the feature is enabled you can use it by creating a new Job, which specifies
40
+
`podReplacementPolicy` field as shown here:
32
41
33
42
```yaml
34
43
kind: Job
@@ -40,26 +49,159 @@ spec:
40
49
...
41
50
```
42
51
43
-
`podReplacementPolicy` can take either `Failed` or `TerminatingOrFailed`. In cases where `PodFailurePolicy` is set, you can only use `Failed`.
52
+
Additionally, you can inspect the `.status.terminating` field of a Job. The value
53
+
of the field is the number of Pods owned by the Job that are currently terminating.
54
+
55
+
```shell
56
+
kubectl get jobs/myjob -o yaml
57
+
```
44
58
45
-
This feature enables two components in the Job controller: Adds a `terminating` field to the status and adds a new API field called `podReplacementPolicy`.
59
+
```yaml
60
+
apiVersion: batch/v1
61
+
kind: Job
62
+
status:
63
+
terminating: 3# three Pods are terminating and have not yet reached the Failed phase
64
+
```
46
65
47
-
The Job controller uses `parallelism` field in the Job API to determine the number of pods that it is expects to be active (not finished). If there is a mismatch of active pods and the pod has not finished, we would normally assume that the pod has failed and the Job controller would recreate the pod. In cases where `Failed` is specified, the Job controller will wait for the pod to be fully terminated (`DeletionTimeStamp != nil`).
66
+
This can be particularly useful for external queueing controllers, such as
67
+
[Kueue](https://github.com/kubernetes-sigs/kueue), that would calculate the
68
+
quota and suspend the start of a new Job until the resources are reclaimed from
69
+
the currently terminating Job.
48
70
49
71
### How can I learn more?
50
72
51
73
- Read the KEP: [PodReplacementPolicy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated)
52
74
53
-
## JobBackoffLimitPerIndex
75
+
## Job Backoff Limit per Index
76
+
77
+
### What problem does this solve?
78
+
79
+
By default, pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
80
+
are counted towards the global limit of retries, represented by `.spec.backoffLimit`.
81
+
This means, that if there is a consistently failing index, it is restarted
82
+
repeatedly until it exhausts the limit. Once the limit is exceeded the entire
83
+
Job is marked failed and some indexes may never be even started.
84
+
85
+
This is problematic for use cases where you want to handle pod failures for
86
+
every index independently. For example, if you use Indexed Jobs for running
87
+
integration tests where each index corresponds to a testing suite. In that case,
88
+
you may want to account for possible flake tests allowing for 1 or 2 retries per
89
+
suite. Additionally, there might be some buggy suites, making the corresponding
90
+
indexes fail consistently. In that case you may prefer to terminate retries for
91
+
that indexes, yet allowing other suites to complete.
92
+
93
+
The feature allows you to:
94
+
* complete execution of all indexes, despite some indexes failing,
95
+
* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.
96
+
97
+
### How to use it?
98
+
99
+
This is an alpha feature, which you can enable by enabling the
Additionally, let's take a look at the job status:
161
+
162
+
```sh
163
+
kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml
164
+
```
165
+
166
+
Returns output similar to this:
167
+
168
+
```yaml
169
+
status:
170
+
completedIndexes: 0,3-7
171
+
failedIndexes: 1,2
172
+
succeeded: 6
173
+
failed: 4
174
+
conditions:
175
+
- message: Job has failed indexes
176
+
reason: FailedIndexes
177
+
status: "True"
178
+
type: Failed
179
+
```
180
+
181
+
Here, indexes `1` and `2` were both retried once. After the second failure,
182
+
in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so
183
+
the retries were stopped. For comparison, if the per-index backoff was disabled,
184
+
then the buggy indexes would retry until the global `backoffLimit` was exceeded,
185
+
and then the entire Job would be marked failed, before some of the higher
186
+
indexes are started.
54
187
55
188
### Getting Involved
56
189
57
-
These features were sponsored under the domain of SIG Apps. Batch is actively being improved for Kubernetes users in the batch working group.
58
-
Working groups are relatively short-lived initatives focused on specific goals. In the case of Batch, the goal is to improve/support batch users and enhance the Job API for common use cases. If that interests you, please join the working group either by subscriping to our [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on [Slack](https://kubernetes.slack.com/messages/wg-batch).
190
+
These features were sponsored under the domain of SIG Apps. Batch is actively
191
+
being improved for Kubernetes users in the
192
+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch).
193
+
Working groups are relatively short-lived initiatives focused on specific goals.
194
+
In the case of Batch, the goal is to improve/support batch users and enhance the
195
+
Job API for common use cases. If that interests you, please join the working
196
+
group either by subscriping to our
197
+
[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
As with any Kubernetes feature, multiple people contributed to getting this
63
203
done, from testing and filing bugs to reviewing code.
64
204
65
-
We would not have been able to achieve either of these features without Aldo Culquicondor (Google) providing excellent domain knowledge and expertise throughout the Kubernetes ecosystem.
205
+
We would not have been able to achieve either of these features without Aldo
206
+
Culquicondor (Google) providing excellent domain knowledge and expertise
0 commit comments