|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA" |
| 4 | +date: 2024-08-19 |
| 5 | +slug: kubernetes-1-31-pod-failure-policy-for-jobs-goes-ga |
| 6 | +author: > |
| 7 | + [Michał Woźniak](https://github.com/mimowo) (Google), |
| 8 | + [Shannon Kularathna](https://github.com/shannonxtreme) (Google) |
| 9 | +--- |
| 10 | + |
| 11 | +This post describes _Pod failure policy_, which graduates to stable in Kubernetes |
| 12 | +1.31, and how to use it in your Jobs. |
| 13 | + |
| 14 | +## About Pod failure policy |
| 15 | + |
| 16 | +When you run workloads on Kubernetes, Pods might fail for a variety of reasons. |
| 17 | +Ideally, workloads like Jobs should be able to ignore transient, retriable |
| 18 | +failures and continue running to completion. |
| 19 | + |
| 20 | +To allow for these transient failures, Kubernetes Jobs include the `backoffLimit` |
| 21 | +field, which lets you specify a number of Pod failures that you're willing to tolerate |
| 22 | +during Job execution. However, if you set a large value for the `backoffLimit` field |
| 23 | +and rely solely on this field, you might notice unnecessary increases in operating |
| 24 | +costs as Pods restart excessively until the backoffLimit is met. |
| 25 | + |
| 26 | +This becomes particularly problematic when running large-scale Jobs with |
| 27 | +thousands of long-running Pods across thousands of nodes. |
| 28 | + |
| 29 | +The Pod failure policy extends the backoff limit mechanism to help you reduce |
| 30 | +costs in the following ways: |
| 31 | + |
| 32 | +- Gives you control to fail the Job as soon as a non-retriable Pod failure occurs. |
| 33 | +- Allows you to ignore retriable errors without increasing the `backoffLimit` field. |
| 34 | + |
| 35 | +For example, you can use a Pod failure policy to run your workload on more affordable spot machines |
| 36 | +by ignoring Pod failures caused by |
| 37 | +[graceful node shutdown](/docs/concepts/cluster-administration/node-shutdown/#graceful-node-shutdown). |
| 38 | + |
| 39 | +The policy allows you to distinguish between retriable and non-retriable Pod |
| 40 | +failures based on container exit codes or Pod conditions in a failed Pod. |
| 41 | + |
| 42 | +## How it works |
| 43 | + |
| 44 | +You specify a Pod failure policy in the Job specification, represented as a list |
| 45 | +of rules. |
| 46 | + |
| 47 | +For each rule you define _match requirements_ based on one of the following properties: |
| 48 | + |
| 49 | +- Container exit codes: the `onExitCodes` property. |
| 50 | +- Pod conditions: the `onPodConditions` property. |
| 51 | + |
| 52 | +Additionally, for each rule, you specify one of the following actions to take |
| 53 | +when a Pod matches the rule: |
| 54 | +- `Ignore`: Do not count the failure towards the `backoffLimit` or `backoffLimitPerIndex`. |
| 55 | +- `FailJob`: Fail the entire Job and terminate all running Pods. |
| 56 | +- `FailIndex`: Fail the index corresponding to the failed Pod. |
| 57 | + This action works with the [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index) feature. |
| 58 | +- `Count`: Count the failure towards the `backoffLimit` or `backoffLimitPerIndex`. |
| 59 | + This is the default behavior. |
| 60 | + |
| 61 | +When Pod failures occur in a running Job, Kubernetes matches the |
| 62 | +failed Pod status against the list of Pod failure policy rules, in the specified |
| 63 | +order, and takes the corresponding actions for the first matched rule. |
| 64 | + |
| 65 | +Note that when specifying the Pod failure policy, you must also set the Job's |
| 66 | +Pod template with `restartPolicy: Never`. This prevents race conditions between |
| 67 | +the kubelet and the Job controller when counting Pod failures. |
| 68 | + |
| 69 | +### Kubernetes-initiated Pod disruptions |
| 70 | + |
| 71 | +To allow matching Pod failure policy rules against failures caused by |
| 72 | +disruptions initiated by Kubernetes, this feature introduces the `DisruptionTarget` |
| 73 | +Pod condition. |
| 74 | + |
| 75 | +Kubernetes adds this condition to any Pod, regardless of whether it's managed by |
| 76 | +a Job controller, that fails because of a retriable |
| 77 | +[disruption scenario](/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions). |
| 78 | +The `DisruptionTarget` condition contains one of the following reasons that |
| 79 | +corresponds to these disruption scenarios: |
| 80 | + |
| 81 | +- `PreemptionByKubeScheduler`: [Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption) |
| 82 | + by `kube-scheduler` to accommodate a new Pod that has a higher priority. |
| 83 | +- `DeletionByTaintManager` - the Pod is due to be deleted by |
| 84 | + `kube-controller-manager` due to a `NoExecute` [taint](/docs/concepts/scheduling-eviction/taint-and-toleration/) |
| 85 | + that the Pod doesn't tolerate. |
| 86 | +- `EvictionByEvictionAPI` - the Pod is due to be deleted by an |
| 87 | + [API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/). |
| 88 | +- `DeletionByPodGC` - the Pod is bound to a node that no longer exists, and is due to |
| 89 | + be deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection). |
| 90 | +- `TerminationByKubelet` - the Pod was terminated by |
| 91 | + [graceful node shutdown](/docs/concepts/cluster-administration/node-shutdown/#graceful-node-shutdown), |
| 92 | + [node pressure eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/) |
| 93 | + or preemption for [system critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/). |
| 94 | + |
| 95 | +In all other disruption scenarios, like eviction due to exceeding |
| 96 | +[Pod container limits](/docs/concepts/configuration/manage-resources-containers/), |
| 97 | +Pods don't receive the `DisruptionTarget` condition because the disruptions were |
| 98 | +likely caused by the Pod and would reoccur on retry. |
| 99 | + |
| 100 | +### Example |
| 101 | + |
| 102 | +The Pod failure policy snippet below demonstrates an example use: |
| 103 | + |
| 104 | +```yaml |
| 105 | +podFailurePolicy: |
| 106 | + rules: |
| 107 | + - action: Ignore |
| 108 | + onPodConditions: |
| 109 | + - type: DisruptionTarget |
| 110 | + - action: FailJob |
| 111 | + onPodConditions: |
| 112 | + - type: ConfigIssue |
| 113 | + - action: FailJob |
| 114 | + onExitCodes: |
| 115 | + operator: In |
| 116 | + values: [ 42 ] |
| 117 | +``` |
| 118 | +
|
| 119 | +In this example, the Pod failure policy does the following: |
| 120 | +
|
| 121 | +- Ignores any failed Pods that have the built-in `DisruptionTarget` |
| 122 | + condition. These Pods don't count towards Job backoff limits. |
| 123 | +- Fails the Job if any failed Pods have the custom user-supplied |
| 124 | + `ConfigIssue` condition, which was added either by a custom controller or webhook. |
| 125 | +- Fails the Job if any containers exited with the exit code 42. |
| 126 | +- Counts all other Pod failures towards the default `backoffLimit` (or |
| 127 | + `backoffLimitPerIndex` if used). |
| 128 | + |
| 129 | +## Learn more |
| 130 | + |
| 131 | +- For a hands-on guide to using Pod failure policy, see |
| 132 | + [Handling retriable and non-retriable pod failures with Pod failure policy](/docs/tasks/job/pod-failure-policy/) |
| 133 | +- Read the documentation for |
| 134 | + [Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) and |
| 135 | + [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index) |
| 136 | +- Read the documentation for |
| 137 | + [Pod disruption conditions](/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions) |
| 138 | +- Read the KEP for [Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures) |
| 139 | + |
| 140 | +## Related work |
| 141 | + |
| 142 | +Based on the concepts introduced by Pod failure policy, the following additional work is in progress: |
| 143 | +- JobSet integration: [Configurable Failure Policy API](https://github.com/kubernetes-sigs/jobset/issues/262) |
| 144 | +- [Pod failure policy extension to add more granular failure reasons](https://github.com/kubernetes/enhancements/issues/4443) |
| 145 | +- Support for Pod failure policy via JobSet in [Kubeflow Training v2](https://github.com/kubeflow/training-operator/pull/2171) |
| 146 | +- Proposal: [Disrupted Pods should be removed from endpoints](https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8) |
| 147 | + |
| 148 | +## Get involved |
| 149 | + |
| 150 | +This work was sponsored by |
| 151 | +[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch) |
| 152 | +in close collaboration with the |
| 153 | +[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps), |
| 154 | +and [SIG Node](https://github.com/kubernetes/community/tree/master/sig-node), |
| 155 | +and [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling) |
| 156 | +communities. |
| 157 | + |
| 158 | +If you are interested in working on new features in the space we recommend |
| 159 | +subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch) |
| 160 | +channel and attending the regular community meetings. |
| 161 | + |
| 162 | +## Acknowledgments |
| 163 | + |
| 164 | +I would love to thank everyone who was involved in this project over the years - |
| 165 | +it's been a journey and a joint community effort! The list below is |
| 166 | +my best-effort attempt to remember and recognize people who made an impact. |
| 167 | +Thank you! |
| 168 | + |
| 169 | +- [Aldo Culquicondor](https://github.com/alculquicondor/) for guidance and reviews throughout the process |
| 170 | +- [Jordan Liggitt](https://github.com/liggitt) for KEP and API reviews |
| 171 | +- [David Eads](https://github.com/deads2k) for API reviews |
| 172 | +- [Maciej Szulik](https://github.com/soltysh) for KEP reviews from SIG Apps PoV |
| 173 | +- [Clayton Coleman](https://github.com/smarterclayton) for guidance and SIG Node reviews |
| 174 | +- [Sergey Kanzhelev](https://github.com/SergeyKanzhelev) for KEP reviews from SIG Node PoV |
| 175 | +- [Dawn Chen](https://github.com/dchen1107) for KEP reviews from SIG Node PoV |
| 176 | +- [Daniel Smith](https://github.com/lavalamp) for reviews from SIG API machinery PoV |
| 177 | +- [Antoine Pelisse](https://github.com/apelisse) for reviews from SIG API machinery PoV |
| 178 | +- [John Belamaric](https://github.com/johnbelamaric) for PRR reviews |
| 179 | +- [Filip Křepinský](https://github.com/atiratree) for thorough reviews from SIG Apps PoV and bug-fixing |
| 180 | +- [David Porter](https://github.com/bobbypage) for thorough reviews from SIG Node PoV |
| 181 | +- [Jensen Lo](https://github.com/jensentanlo) for early requirements discussions, testing and reporting issues |
| 182 | +- [Daniel Vega-Myhre](https://github.com/danielvegamyhre) for advancing JobSet integration and reporting issues |
| 183 | +- [Abdullah Gharaibeh](https://github.com/ahg-g) for early design discussions and guidance |
| 184 | +- [Antonio Ojea](https://github.com/aojea) for test reviews |
| 185 | +- [Yuki Iwai](https://github.com/tenzen-y) for reviews and aligning implementation of the closely related Job features |
| 186 | +- [Kevin Hannon](https://github.com/kannon92) for reviews and aligning implementation of the closely related Job features |
| 187 | +- [Tim Bannister](https://github.com/sftim) for docs reviews |
| 188 | +- [Shannon Kularathna](https://github.com/shannonxtreme) for docs reviews |
| 189 | +- [Paola Cortés](https://github.com/cortespao) for docs reviews |
0 commit comments