Skip to content

Commit c566908

Browse files
authored
Merge pull request #51426 from dejanzele/promote-jobs-podreplacementpolicy-to-ga
JobPodReplacementPolicy Promoted To GA Blog Post
2 parents 1b03382 + f5954cd commit c566908

File tree

1 file changed

+150
-0
lines changed

1 file changed

+150
-0
lines changed
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA"
4+
date: 2025-0X-XX
5+
draft: true
6+
slug: kubernetes-v1-34-pod-replacement-policy-for-jobs-goes-ga
7+
author: >
8+
[Dejan Zele Pejchev](https://github.com/dejanzele) (G-Research)
9+
---
10+
11+
In Kubernetes v1.34, the _Pod replacement policy_ feature has reached general availability (GA).
12+
This blog post describes the Pod replacement policy feature and how to use it in your Jobs.
13+
14+
## About Pod Replacement Policy
15+
16+
By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).
17+
18+
As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism.
19+
For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.
20+
21+
This behavior works fine for many workloads, but it can cause problems in certain cases.
22+
23+
For example, popular machine learning frameworks like TensorFlow and
24+
[JAX](https://jax.readthedocs.io/en/latest/) expect exactly one Pod per worker index.
25+
If two Pods run at the same time, you might encounter errors such as:
26+
```
27+
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
28+
```
29+
30+
Additionally, starting replacement Pods before the old ones fully terminate can lead to:
31+
- Scheduling delays by kube-scheduler as the nodes remain occupied.
32+
- Unnecessary cluster scale-ups to accommodate the replacement Pods.
33+
- Temporary bypassing of quota checks by workload orchestrators like [Kueue](https://kueue.sigs.k8s.io/).
34+
35+
With Pod replacement policy, Kubernetes gives you control over when the control plane
36+
replaces terminating Pods, helping you avoid these issues.
37+
38+
## How Pod Replacement Policy works
39+
40+
This enhancement means that Jobs in Kubernetes have an optional field `.spec.podReplacementPolicy`.
41+
You can choose one of two policies:
42+
- `TerminatingOrFailed` (default): Replaces Pods as soon as they start terminating.
43+
- `Failed`: Replaces Pods only after they fully terminate and transition to the `Failed` phase.
44+
45+
Setting the policy to `Failed` ensures that a new Pod is only created after the previous one has completely terminated.
46+
47+
For Jobs with a Pod Failure Policy, the default `podReplacementPolicy` is `Failed`, and no other value is allowed.
48+
See [Pod Failure Policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) to learn more about Pod Failure Policies for Jobs.
49+
50+
You can check how many Pods are currently terminating by inspecting the Job’s `.status.terminating` field:
51+
52+
```shell
53+
kubectl get job myjob -o=jsonpath='{.status.terminating}'
54+
```
55+
56+
## Example
57+
58+
Here’s a Job example that executes a task two times (`spec.completions: 2`) in parallel (`spec.parallelism: 2`) and
59+
replaces Pods only after they fully terminate (`spec.podReplacementPolicy: Failed`):
60+
```yaml
61+
apiVersion: batch/v1
62+
kind: Job
63+
metadata:
64+
name: example-job
65+
spec:
66+
completions: 2
67+
parallelism: 2
68+
podReplacementPolicy: Failed
69+
template:
70+
spec:
71+
restartPolicy: Never
72+
containers:
73+
- name: worker
74+
image: your-image
75+
```
76+
77+
If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating.
78+
When the container handles termination gracefully, cleanup may take some time.
79+
80+
When the Job starts, we will see two Pods running:
81+
```shell
82+
kubectl get pods
83+
84+
NAME READY STATUS RESTARTS AGE
85+
example-job-qr8kf 1/1 Running 0 2s
86+
example-job-stvb4 1/1 Running 0 2s
87+
```
88+
89+
Let's delete one of the Pods (`example-job-qr8kf`).
90+
91+
With the `TerminatingOrFailed` policy, as soon as one Pod (`example-job-qr8kf`) starts terminating, the Job controller immediately creates a new Pod (`example-job-b59zk`) to replace it.
92+
```shell
93+
kubectl get pods
94+
95+
NAME READY STATUS RESTARTS AGE
96+
example-job-b59zk 1/1 Running 0 1s
97+
example-job-qr8kf 1/1 Terminating 0 17s
98+
example-job-stvb4 1/1 Running 0 17s
99+
```
100+
101+
With the `Failed` policy, the new Pod (`example-job-b59zk`) is not created while the old Pod (`example-job-qr8kf`) is terminating.
102+
```shell
103+
kubectl get pods
104+
105+
NAME READY STATUS RESTARTS AGE
106+
example-job-qr8kf 1/1 Terminating 0 17s
107+
example-job-stvb4 1/1 Running 0 17s
108+
```
109+
110+
When the terminating Pod has fully transitioned to the `Failed` phase, a new Pod is created:
111+
```shell
112+
kubectl get pods
113+
114+
NAME READY STATUS RESTARTS AGE
115+
example-job-b59zk 1/1 Running 0 1s
116+
example-job-stvb4 1/1 Running 0 25s
117+
```
118+
119+
## How can you learn more?
120+
121+
- Read the user-facing documentation for [Pod Replacement Policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy),
122+
[Backoff Limit per Index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and
123+
[Pod Failure Policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy).
124+
- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated),
125+
[Backoff Limit per Index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and
126+
[Pod Failure Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures).
127+
128+
129+
## Acknowledgments
130+
131+
As with any Kubernetes feature, multiple people contributed to getting this
132+
done, from testing and filing bugs to reviewing code.
133+
134+
As this feature moves to stable after 2 years, we would like to thank the following people:
135+
* [Kevin Hannon](https://github.com/kannon92) - for writing the KEP and the initial implementation.
136+
* [Michał Woźniak](https://github.com/mimowo) - for guidance, mentorship, and reviews.
137+
* [Aldo Culquicondor](https://github.com/alculquicondor) - for guidance, mentorship, and reviews.
138+
* [Maciej Szulik](https://github.com/soltysh) - for guidance, mentorship, and reviews.
139+
* [Dejan Zele Pejchev](https://github.com/dejanzele) - for taking over the feature and promoting it from Alpha through Beta to GA.
140+
141+
## Get involved
142+
143+
This work was sponsored by the Kubernetes
144+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch)
145+
in close collaboration with the
146+
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) community.
147+
148+
If you are interested in working on new features in the space we recommend
149+
subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch)
150+
channel and attending the regular community meetings.

0 commit comments

Comments
 (0)