Skip to content

Commit 4dc0ff7

Browse files
committed
add pod recreation policy to the blog post
1 parent ed64c34 commit 4dc0ff7

File tree

1 file changed

+61
-2
lines changed

1 file changed

+61
-2
lines changed
Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,65 @@
11
---
22
layout: blog
33
title: "Kubernetes 1.28: Updates to the Job API"
4-
date: 2023-08-20T10:00:00-08:00
5-
slug: kubernetes-1-28-podreplacementpolicy-backoffconfigs
4+
date: 2023-07-27
5+
slug: kubernetes-1-28-jobapi-update
66
---
7+
8+
**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google)
9+
10+
This blog discusses two features to improve Jobs for batch users: PodRecreationPolicy and JobBackoffLimitPerIndex.
11+
12+
These are two features requested from users of the Job API to enhance a user's experience.
13+
14+
## Pod Recreation Policy
15+
16+
### What problem does this solve?
17+
18+
Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index. Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created and immediately fail to start.
19+
20+
Having a replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.
21+
22+
On the other hand, if a replacement Pod is not immediately created, the Job status would show that the number of active pods doesn't match the desired parallelism. To provide better visibility, the job status can have a new field to track the number of Pods currently terminating.
23+
24+
This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.
25+
26+
### How can I use it
27+
28+
This is an alpha feature, which means you have to enable the `JobPodReplacementPolicy`
29+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
30+
with the command line argument `--feature-gates=JobPodReplacementPolicy=true`
31+
to the kube-apiserver.
32+
33+
```yaml
34+
kind: Job
35+
metadata:
36+
name: new
37+
...
38+
spec:
39+
podReplacementPolicy: Failed
40+
...
41+
```
42+
43+
`podReplacementPolicy` can take either `Failed` or `TerminatingOrFailed`. In cases where `PodFailurePolicy` is set, you can only use `Failed`.
44+
45+
This feature enables two components in the Job controller: Adds a `terminating` field to the status and adds a new API field called `podReplacementPolicy`.
46+
47+
The Job controller uses `parallelism` field in the Job API to determine the number of pods that it is expects to be active (not finished). If there is a mismatch of active pods and the pod has not finished, we would normally assume that the pod has failed and the Job controller would recreate the pod. In cases where `Failed` is specified, the Job controller will wait for the pod to be fully terminated (`DeletionTimeStamp != nil`).
48+
49+
### How can I learn more?
50+
51+
- Read the KEP: [PodReplacementPolicy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated)
52+
53+
## JobBackoffLimitPerIndex
54+
55+
### Getting Involved
56+
57+
These features were sponsored under the domain of SIG Apps. Batch is actively being improved for Kubernetes users in the batch working group.
58+
Working groups are relatively short-lived initatives focused on specific goals. In the case of Batch, the goal is to improve/support batch users and enhance the Job API for common use cases. If that interests you, please join the working group either by subscriping to our [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on [Slack](https://kubernetes.slack.com/messages/wg-batch).
59+
60+
### Acknowledgments
61+
62+
As with any Kubernetes feature, multiple people contributed to getting this
63+
done, from testing and filing bugs to reviewing code.
64+
65+
We would not have been able to achieve either of these features without Aldo Culquicondor (Google) providing excellent domain knowledge and expertise throughout the Kubernetes ecosystem.

0 commit comments

Comments
 (0)