Skip to content

Commit 041e363

Browse files
committed
Blog for container restart policy
1 parent 7695532 commit 041e363

File tree

1 file changed

+204
-0
lines changed

1 file changed

+204
-0
lines changed
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.34: Finer-Grained Control Over Container Restarts"
4+
date: 2025-0X-XX
5+
draft: false
6+
slug: kubernetes-v1-34-per-container-restart-policy
7+
author: >
8+
[Yuan Wang](https://github.com/yuanwang04)
9+
---
10+
11+
With the release of Kubernetes 1.34, a new alpha feature is introduced
12+
that gives you more granular control over container restarts within a Pod. This
13+
feature, named **Container Restart Policy and Rules**, allows you to specify a
14+
restart policy for each container individually, overriding the Pod's global
15+
restart policy. In addition, it also allows you to conditionally restart
16+
individual containers based on their exit codes. This feature is available
17+
behind the alpha feature gate `ContainerRestartRules`.
18+
19+
This has been a long-requested feature. Let's dive into how it works and how you
20+
can use it.
21+
22+
## The problem with a single restart policy
23+
24+
Before this feature, the `restartPolicy` was set at the Pod level. This meant
25+
that all containers in a Pod shared the same restart policy (`Always`,
26+
`OnFailure`, or `Never`). While this works for many use cases, it can be
27+
limiting in others.
28+
29+
For example, consider a Pod with a main application container and an init
30+
container that performs some initial setup. You might want the main container
31+
to always restart on failure, but the init container should only run once and
32+
never restart. With a single Pod-level restart policy, this wasn't possible.
33+
34+
## Introducing per-container restart policies
35+
36+
With the new `ContainerRestartRules` feature gate, you can now specify a
37+
`restartPolicy` for each container in your Pod's spec. You can also define
38+
`restartPolicyRules` to control restarts based on exit codes. This gives you
39+
the fine-grained control you need to handle complex scenarios.
40+
41+
## Use cases
42+
43+
Let's look at some real-life use cases where per-container restart policies can
44+
be beneficial.
45+
46+
### In-place restarts for training jobs
47+
48+
In ML research, it's common to orchestrate a large number of long-running AI/ML
49+
training workloads. In these scenarios, workload failures are unavoidable. When
50+
a workload fails with a retriable exit code, you want the container to restart
51+
quickly without rescheduling the entire Pod, which consumes a significant amount
52+
of time and resources. Restarting the failed container "in-place" is critical
53+
for better utilization of compute resources. The container should only restart
54+
"in-place" if it failed due to a retriable error; otherwise, the container and
55+
Pod should terminate and possibly be rescheduled.
56+
57+
This can now be achieved with container-level `restartPolicyRules`. The workload
58+
can exit with different codes to represent retriable and non-retriable errors.
59+
With `restartPolicyRules`, the workload can be restarted in-place quickly, but
60+
only when the error is retriable.
61+
62+
### Try-once init containers
63+
64+
Init containers are often used to perform initialization work for the main
65+
container, such as setting up environments and credentials. Sometimes, you want
66+
the main container to always be restarted, but you don't want to retry
67+
initialization if it fails.
68+
69+
With a container-level `restartPolicy`, this is now possible. The init container
70+
can be executed only once, and its failure would be considered a Pod failure. If
71+
the initialization succeeds, the main container can be always restarted.
72+
73+
### Pods with multiple containers
74+
75+
For Pods that run multiple containers, you might have different restart
76+
requirements for each container. Some containers might have a clear definition
77+
of success and should only be restarted on failure. Others might need to be
78+
always restarted.
79+
80+
This is now possible with a container-level `restartPolicy`, allowing individual
81+
containers to have different restart policies.
82+
83+
## How to use it
84+
85+
To use this new feature, you need to enable the `ContainerRestartRules` feature
86+
gate on your Kubernetes cluster control-plane and worker nodes running
87+
Kubernetes 1.34+. Once enabled, you can specify the `restartPolicy` and
88+
`restartPolicyRules` fields in your container definitions.
89+
90+
Here are some examples:
91+
92+
### Example 1: Restarting on specific exit codes
93+
94+
In this example, the container should restart if and only if it fails with a
95+
retriable error, represented by exit code 42.
96+
97+
To achieve this, the container has `restartPolicy: Never`, and a restart
98+
policy rule that tells Kubernetes to restart the container in-place if it exits
99+
with code 42.
100+
101+
```yaml
102+
apiVersion: v1
103+
kind: Pod
104+
metadata:
105+
name: restart-on-exit-codes
106+
annotations:
107+
kubernetes.io/description: "This Pod only restart the container only when it exits with code 42."
108+
spec:
109+
restartPolicy: Never
110+
containers:
111+
- name: restart-on-exit-codes
112+
image: docker.io/library/busybox:1.28
113+
command: ['sh', '-c', 'sleep 60 && exit 0']
114+
restartPolicy: Never # Container restart policy must be specified if rules are specified
115+
restartPolicyRules: # Only restart the container if it exits with code 42
116+
- action: Restart
117+
exitCodes:
118+
operator: In
119+
values: [42]
120+
```
121+
122+
### Example 2: A try-once init container
123+
124+
In this example, a Pod should always be restarted once the initialization succeeds.
125+
However, the initialization should only be tried once.
126+
127+
To achieve this, the Pod has an `Always` restart policy. The `init-once`
128+
init container will only try once. If it fails, the Pod will fail. This allows
129+
the Pod to fail if the initialization failed, but also keep running once the
130+
initialization succeeds.
131+
132+
```yaml
133+
apiVersion: v1
134+
kind: Pod
135+
metadata:
136+
name: fail-pod-if-init-fails
137+
annotations:
138+
kubernetes.io/description: "This Pod has an init container that runs only once. After initialization succeeds, the main container will always be restarted."
139+
spec:
140+
restartPolicy: Always
141+
initContainers:
142+
- name: init-once # This init container will only try once. If it fails, the Pod will fail.
143+
image: docker.io/library/busybox:1.28
144+
command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
145+
restartPolicy: Never
146+
containers:
147+
- name: main-container # This container will always be restarted once initialization succeeds.
148+
image: docker.io/library/busybox:1.28
149+
command: ['sh', '-c', 'sleep 1800 && exit 0']
150+
```
151+
152+
### Example 3: Containers with different restart policies
153+
154+
In this example, there are two containers with different restart requirements. One
155+
should always be restarted, while the other should only be restarted on failure.
156+
157+
This is achieved by using a different container-level `restartPolicy` on each of
158+
the two containers.
159+
```yaml
160+
apiVersion: v1
161+
kind: Pod
162+
metadata:
163+
name: on-failure-pod
164+
annotations:
165+
kubernetes.io/description: "This Pod has two containers with different restart policies."
166+
spec:
167+
containers:
168+
- name: restart-on-failure
169+
image: docker.io/library/busybox:1.28
170+
command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0']
171+
restartPolicy: OnFailure
172+
- name: restart-always
173+
image: docker.io/library/busybox:1.28
174+
command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0']
175+
restartPolicy: Always
176+
```
177+
178+
## Learn more
179+
180+
- Read the documentation for
181+
[container restart policy](/docs/concepts/workloads/pod-lifecycle/#container-restart-rules).
182+
- Read the KEP for the
183+
[Container Restart Rules](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5307-container-restart-policy)
184+
185+
## Roadmap
186+
187+
More actions and signals to restart Pods and containers are coming! Notably,
188+
there are plans to add support for restarting the entire Pod. Planning and
189+
discussions on these features are in progress. Feel free to share feedback or
190+
requests with the SIG Node community!
191+
192+
## Your feedback is welcome!
193+
194+
This is an alpha feature, and the Kubernetes project would love to hear your feedback.
195+
Please try it out. This feature is driven by the
196+
[SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md).
197+
If you are interested in helping develop this feature, sharing feedback, or
198+
participating in any other ongoing SIG Node projects, please reach out to the
199+
SIG Node community!
200+
201+
You can reach SIG Node by several means:
202+
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
203+
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
204+
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)

0 commit comments

Comments
 (0)