Skip to content

Commit 65ce348

Browse files
authored
Merge pull request #46896 from mimowo/pod-failure-policy-blog
blogpost: Pod Failure Policy for Jobs Goes GA
2 parents 70e2555 + 9e7950a commit 65ce348

File tree

1 file changed

+189
-0
lines changed

1 file changed

+189
-0
lines changed
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA"
4+
date: 2024-08-19
5+
slug: kubernetes-1-31-pod-failure-policy-for-jobs-goes-ga
6+
author: >
7+
[Michał Woźniak](https://github.com/mimowo) (Google),
8+
[Shannon Kularathna](https://github.com/shannonxtreme) (Google)
9+
---
10+
11+
This post describes _Pod failure policy_, which graduates to stable in Kubernetes
12+
1.31, and how to use it in your Jobs.
13+
14+
## About Pod failure policy
15+
16+
When you run workloads on Kubernetes, Pods might fail for a variety of reasons.
17+
Ideally, workloads like Jobs should be able to ignore transient, retriable
18+
failures and continue running to completion.
19+
20+
To allow for these transient failures, Kubernetes Jobs include the `backoffLimit`
21+
field, which lets you specify a number of Pod failures that you're willing to tolerate
22+
during Job execution. However, if you set a large value for the `backoffLimit` field
23+
and rely solely on this field, you might notice unnecessary increases in operating
24+
costs as Pods restart excessively until the backoffLimit is met.
25+
26+
This becomes particularly problematic when running large-scale Jobs with
27+
thousands of long-running Pods across thousands of nodes.
28+
29+
The Pod failure policy extends the backoff limit mechanism to help you reduce
30+
costs in the following ways:
31+
32+
- Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.
33+
- Allows you to ignore retriable errors without increasing the `backoffLimit` field.
34+
35+
For example, you can use a Pod failure policy to run your workload on more affordable spot machines
36+
by ignoring Pod failures caused by
37+
[graceful node shutdown](/docs/concepts/cluster-administration/node-shutdown/#graceful-node-shutdown).
38+
39+
The policy allows you to distinguish between retriable and non-retriable Pod
40+
failures based on container exit codes or Pod conditions in a failed Pod.
41+
42+
## How it works
43+
44+
You specify a Pod failure policy in the Job specification, represented as a list
45+
of rules.
46+
47+
For each rule you define _match requirements_ based on one of the following properties:
48+
49+
- Container exit codes: the `onExitCodes` property.
50+
- Pod conditions: the `onPodConditions` property.
51+
52+
Additionally, for each rule, you specify one of the following actions to take
53+
when a Pod matches the rule:
54+
- `Ignore`: Do not count the failure towards the `backoffLimit` or `backoffLimitPerIndex`.
55+
- `FailJob`: Fail the entire Job and terminate all running Pods.
56+
- `FailIndex`: Fail the index corresponding to the failed Pod.
57+
This action works with the [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index) feature.
58+
- `Count`: Count the failure towards the `backoffLimit` or `backoffLimitPerIndex`.
59+
This is the default behavior.
60+
61+
When Pod failures occur in a running Job, Kubernetes matches the
62+
failed Pod status against the list of Pod failure policy rules, in the specified
63+
order, and takes the corresponding actions for the first matched rule.
64+
65+
Note that when specifying the Pod failure policy, you must also set the Job's
66+
Pod template with `restartPolicy: Never`. This prevents race conditions between
67+
the kubelet and the Job controller when counting Pod failures.
68+
69+
### Kubernetes-initiated Pod disruptions
70+
71+
To allow matching Pod failure policy rules against failures caused by
72+
disruptions initiated by Kubernetes, this feature introduces the `DisruptionTarget`
73+
Pod condition.
74+
75+
Kubernetes adds this condition to any Pod, regardless of whether it's managed by
76+
a Job controller, that fails because of a retriable
77+
[disruption scenario](/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions).
78+
The `DisruptionTarget` condition contains one of the following reasons that
79+
corresponds to these disruption scenarios:
80+
81+
- `PreemptionByKubeScheduler`: [Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption)
82+
by `kube-scheduler` to accommodate a new Pod that has a higher priority.
83+
- `DeletionByTaintManager` - the Pod is due to be deleted by
84+
`kube-controller-manager` due to a `NoExecute` [taint](/docs/concepts/scheduling-eviction/taint-and-toleration/)
85+
that the Pod doesn't tolerate.
86+
- `EvictionByEvictionAPI` - the Pod is due to be deleted by an
87+
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/).
88+
- `DeletionByPodGC` - the Pod is bound to a node that no longer exists, and is due to
89+
be deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
90+
- `TerminationByKubelet` - the Pod was terminated by
91+
[graceful node shutdown](/docs/concepts/cluster-administration/node-shutdown/#graceful-node-shutdown),
92+
[node pressure eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/)
93+
or preemption for [system critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/).
94+
95+
In all other disruption scenarios, like eviction due to exceeding
96+
[Pod container limits](/docs/concepts/configuration/manage-resources-containers/),
97+
Pods don't receive the `DisruptionTarget` condition because the disruptions were
98+
likely caused by the Pod and would reoccur on retry.
99+
100+
### Example
101+
102+
The Pod failure policy snippet below demonstrates an example use:
103+
104+
```yaml
105+
podFailurePolicy:
106+
rules:
107+
- action: Ignore
108+
onPodConditions:
109+
- type: DisruptionTarget
110+
- action: FailJob
111+
onPodConditions:
112+
- type: ConfigIssue
113+
- action: FailJob
114+
onExitCodes:
115+
operator: In
116+
values: [ 42 ]
117+
```
118+
119+
In this example, the Pod failure policy does the following:
120+
121+
- Ignores any failed Pods that have the built-in `DisruptionTarget`
122+
condition. These Pods don't count towards Job backoff limits.
123+
- Fails the Job if any failed Pods have the custom user-supplied
124+
`ConfigIssue` condition, which was added either by a custom controller or webhook.
125+
- Fails the Job if any containers exited with the exit code 42.
126+
- Counts all other Pod failures towards the default `backoffLimit` (or
127+
`backoffLimitPerIndex` if used).
128+
129+
## Learn more
130+
131+
- For a hands-on guide to using Pod failure policy, see
132+
[Handling retriable and non-retriable pod failures with Pod failure policy](/docs/tasks/job/pod-failure-policy/)
133+
- Read the documentation for
134+
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) and
135+
[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index)
136+
- Read the documentation for
137+
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions)
138+
- Read the KEP for [Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures)
139+
140+
## Related work
141+
142+
Based on the concepts introduced by Pod failure policy, the following additional work is in progress:
143+
- JobSet integration: [Configurable Failure Policy API](https://github.com/kubernetes-sigs/jobset/issues/262)
144+
- [Pod failure policy extension to add more granular failure reasons](https://github.com/kubernetes/enhancements/issues/4443)
145+
- Support for Pod failure policy via JobSet in [Kubeflow Training v2](https://github.com/kubeflow/training-operator/pull/2171)
146+
- Proposal: [Disrupted Pods should be removed from endpoints](https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8)
147+
148+
## Get involved
149+
150+
This work was sponsored by
151+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch)
152+
in close collaboration with the
153+
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps),
154+
and [SIG Node](https://github.com/kubernetes/community/tree/master/sig-node),
155+
and [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling)
156+
communities.
157+
158+
If you are interested in working on new features in the space we recommend
159+
subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch)
160+
channel and attending the regular community meetings.
161+
162+
## Acknowledgments
163+
164+
I would love to thank everyone who was involved in this project over the years -
165+
it's been a journey and a joint community effort! The list below is
166+
my best-effort attempt to remember and recognize people who made an impact.
167+
Thank you!
168+
169+
- [Aldo Culquicondor](https://github.com/alculquicondor/) for guidance and reviews throughout the process
170+
- [Jordan Liggitt](https://github.com/liggitt) for KEP and API reviews
171+
- [David Eads](https://github.com/deads2k) for API reviews
172+
- [Maciej Szulik](https://github.com/soltysh) for KEP reviews from SIG Apps PoV
173+
- [Clayton Coleman](https://github.com/smarterclayton) for guidance and SIG Node reviews
174+
- [Sergey Kanzhelev](https://github.com/SergeyKanzhelev) for KEP reviews from SIG Node PoV
175+
- [Dawn Chen](https://github.com/dchen1107) for KEP reviews from SIG Node PoV
176+
- [Daniel Smith](https://github.com/lavalamp) for reviews from SIG API machinery PoV
177+
- [Antoine Pelisse](https://github.com/apelisse) for reviews from SIG API machinery PoV
178+
- [John Belamaric](https://github.com/johnbelamaric) for PRR reviews
179+
- [Filip Křepinský](https://github.com/atiratree) for thorough reviews from SIG Apps PoV and bug-fixing
180+
- [David Porter](https://github.com/bobbypage) for thorough reviews from SIG Node PoV
181+
- [Jensen Lo](https://github.com/jensentanlo) for early requirements discussions, testing and reporting issues
182+
- [Daniel Vega-Myhre](https://github.com/danielvegamyhre) for advancing JobSet integration and reporting issues
183+
- [Abdullah Gharaibeh](https://github.com/ahg-g) for early design discussions and guidance
184+
- [Antonio Ojea](https://github.com/aojea) for test reviews
185+
- [Yuki Iwai](https://github.com/tenzen-y) for reviews and aligning implementation of the closely related Job features
186+
- [Kevin Hannon](https://github.com/kannon92) for reviews and aligning implementation of the closely related Job features
187+
- [Tim Bannister](https://github.com/sftim) for docs reviews
188+
- [Shannon Kularathna](https://github.com/shannonxtreme) for docs reviews
189+
- [Paola Cortés](https://github.com/cortespao) for docs reviews

0 commit comments

Comments
 (0)