|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.28: Improved failure handling for Jobs" |
| 4 | +date: 2023-08-21 |
| 5 | +slug: kubernetes-1-28-jobapi-update |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google) |
| 9 | + |
| 10 | +This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch |
| 11 | +users: [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy) |
| 12 | +and [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index). |
| 13 | + |
| 14 | +These features continue the effort started by the |
| 15 | +[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
| 16 | +to improve the handling of Pod failures in a Job. |
| 17 | + |
| 18 | +## Pod replacement policy {#pod-replacement-policy} |
| 19 | + |
| 20 | +By default, when a pod enters a terminating state (e.g. due to preemption or |
| 21 | +eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running |
| 22 | +at the same time. In API terms, a pod is considered terminating when it has a |
| 23 | +`deletionTimestamp` and it has a phase `Pending` or `Running`. |
| 24 | + |
| 25 | +The scenario when two Pods are running at a given time is problematic for |
| 26 | +some popular machine learning frameworks, such as |
| 27 | +TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time, |
| 28 | +for a given index. |
| 29 | +Tensorflow gives the following error if two pods are running for a given index. |
| 30 | + |
| 31 | +``` |
| 32 | + /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4 |
| 33 | +``` |
| 34 | + |
| 35 | +See more details in the ([issue](https://github.com/kubernetes/kubernetes/issues/115844)). |
| 36 | + |
| 37 | + |
| 38 | +Creating the replacement Pod before the previous one fully terminates can also |
| 39 | +cause problems in clusters with scarce resources or with tight budgets, such as: |
| 40 | +* cluster resources can be difficult to obtain for Pods pending to be scheduled, |
| 41 | + as Kubernetes might take a long time to find available nodes until the existing |
| 42 | + Pods are fully terminated. |
| 43 | +* if cluster autoscaler is enabled, the replacement Pods might produce undesired |
| 44 | + scale ups. |
| 45 | + |
| 46 | +### How can you use it? {#pod-replacement-policy-how-to-use} |
| 47 | + |
| 48 | +This is an alpha feature, which you can enable by turning on `JobPodReplacementPolicy` |
| 49 | +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in |
| 50 | +your cluster. |
| 51 | + |
| 52 | +Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a |
| 53 | +`podReplacementPolicy` field as shown here: |
| 54 | + |
| 55 | +```yaml |
| 56 | +kind: Job |
| 57 | +metadata: |
| 58 | + name: new |
| 59 | + ... |
| 60 | +spec: |
| 61 | + podReplacementPolicy: Failed |
| 62 | + ... |
| 63 | +``` |
| 64 | + |
| 65 | +In that Job, the Pods would only be replaced once they reached the `Failed` phase, |
| 66 | +and not when they are terminating. |
| 67 | + |
| 68 | +Additionally, you can inspect the `.status.terminating` field of a Job. The value |
| 69 | +of the field is the number of Pods owned by the Job that are currently terminating. |
| 70 | + |
| 71 | +```shell |
| 72 | +kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}' |
| 73 | +``` |
| 74 | + |
| 75 | +``` |
| 76 | +3 # three Pods are terminating and have not yet reached the Failed phase |
| 77 | +``` |
| 78 | + |
| 79 | +This can be particularly useful for external queueing controllers, such as |
| 80 | +[Kueue](https://github.com/kubernetes-sigs/kueue), that tracks quota |
| 81 | +from running Pods of a Job until the resources are reclaimed from |
| 82 | +the currently terminating Job. |
| 83 | + |
| 84 | +Note that the `podReplacementPolicy: Failed` is the default when using a custom |
| 85 | +[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy). |
| 86 | + |
| 87 | +## Backoff limit per index {#backoff-limit-per-index} |
| 88 | + |
| 89 | +By default, Pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode) |
| 90 | +are counted towards the global limit of retries, represented by `.spec.backoffLimit`. |
| 91 | +This means, that if there is a consistently failing index, it is restarted |
| 92 | +repeatedly until it exhausts the limit. Once the limit is reached the entire |
| 93 | +Job is marked failed and some indexes may never be even started. |
| 94 | + |
| 95 | +This is problematic for use cases where you want to handle Pod failures for |
| 96 | +every index independently. For example, if you use Indexed Jobs for running |
| 97 | +integration tests where each index corresponds to a testing suite. In that case, |
| 98 | +you may want to account for possible flake tests allowing for 1 or 2 retries per |
| 99 | +suite. There might be some buggy suites, making the corresponding |
| 100 | +indexes fail consistently. In that case you may prefer to limit retries for |
| 101 | +the buggy suites, yet allowing other suites to complete. |
| 102 | + |
| 103 | +The feature allows you to: |
| 104 | +* complete execution of all indexes, despite some indexes failing. |
| 105 | +* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes. |
| 106 | + |
| 107 | +### How can you use it? {#backoff-limit-per-index-how-to-use} |
| 108 | + |
| 109 | +This is an alpha feature, which you can enable by turning on the |
| 110 | +`JobBackoffLimitPerIndex` |
| 111 | +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) |
| 112 | +in your cluster. |
| 113 | + |
| 114 | +Once the feature is enabled in your cluster, you can create an Indexed Job with the |
| 115 | +`.spec.backoffLimitPerIndex` field specified. |
| 116 | + |
| 117 | +#### Example |
| 118 | + |
| 119 | +The following example demonstrates how to use this feature to make sure the |
| 120 | +Job executes all indexes (provided there is no other reason for the early Job |
| 121 | +termination, such as reaching the `activeDeadlineSeconds` timeout, or being |
| 122 | +manually deleted by the user), and the number of failures is controlled per index. |
| 123 | + |
| 124 | +```yaml |
| 125 | +apiVersion: batch/v1 |
| 126 | +kind: Job |
| 127 | +metadata: |
| 128 | + name: job-backoff-limit-per-index-execute-all |
| 129 | +spec: |
| 130 | + completions: 8 |
| 131 | + parallelism: 2 |
| 132 | + completionMode: Indexed |
| 133 | + backoffLimitPerIndex: 1 |
| 134 | + template: |
| 135 | + spec: |
| 136 | + restartPolicy: Never |
| 137 | + containers: |
| 138 | + - name: example # this example container returns an error, and fails, |
| 139 | + # when it is run as the second or third index in any Job |
| 140 | + # (even after a retry) |
| 141 | + image: python |
| 142 | + command: |
| 143 | + - python3 |
| 144 | + - -c |
| 145 | + - | |
| 146 | + import os, sys, time |
| 147 | + id = int(os.environ.get("JOB_COMPLETION_INDEX")) |
| 148 | + if id == 1 or id == 2: |
| 149 | + sys.exit(1) |
| 150 | + time.sleep(1) |
| 151 | +``` |
| 152 | +
|
| 153 | +Now, inspect the Pods after the job is finished: |
| 154 | +
|
| 155 | +```sh |
| 156 | +kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all |
| 157 | +``` |
| 158 | + |
| 159 | +Returns output similar to this: |
| 160 | +``` |
| 161 | +NAME READY STATUS RESTARTS AGE |
| 162 | +job-backoff-limit-per-index-execute-all-0-b26vc 0/1 Completed 0 49s |
| 163 | +job-backoff-limit-per-index-execute-all-1-6j5gd 0/1 Error 0 49s |
| 164 | +job-backoff-limit-per-index-execute-all-1-6wd82 0/1 Error 0 37s |
| 165 | +job-backoff-limit-per-index-execute-all-2-c66hg 0/1 Error 0 32s |
| 166 | +job-backoff-limit-per-index-execute-all-2-nf982 0/1 Error 0 43s |
| 167 | +job-backoff-limit-per-index-execute-all-3-cxmhf 0/1 Completed 0 33s |
| 168 | +job-backoff-limit-per-index-execute-all-4-9q6kq 0/1 Completed 0 28s |
| 169 | +job-backoff-limit-per-index-execute-all-5-z9hqf 0/1 Completed 0 28s |
| 170 | +job-backoff-limit-per-index-execute-all-6-tbkr8 0/1 Completed 0 23s |
| 171 | +job-backoff-limit-per-index-execute-all-7-hxjsq 0/1 Completed 0 22s |
| 172 | +``` |
| 173 | + |
| 174 | +Additionally, you can take a look at the status for that Job: |
| 175 | + |
| 176 | +```sh |
| 177 | +kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml |
| 178 | +``` |
| 179 | + |
| 180 | +The output ends with a `status` similar to: |
| 181 | + |
| 182 | +```yaml |
| 183 | + status: |
| 184 | + completedIndexes: 0,3-7 |
| 185 | + failedIndexes: 1,2 |
| 186 | + succeeded: 6 |
| 187 | + failed: 4 |
| 188 | + conditions: |
| 189 | + - message: Job has failed indexes |
| 190 | + reason: FailedIndexes |
| 191 | + status: "True" |
| 192 | + type: Failed |
| 193 | +``` |
| 194 | +
|
| 195 | +Here, indexes `1` and `2` were both retried once. After the second failure, |
| 196 | +in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so |
| 197 | +the retries were stopped. For comparison, if the per-index backoff was disabled, |
| 198 | +then the buggy indexes would retry until the global `backoffLimit` was exceeded, |
| 199 | +and then the entire Job would be marked failed, before some of the higher |
| 200 | +indexes are started. |
| 201 | + |
| 202 | +## How can you learn more? |
| 203 | + |
| 204 | +- Read the user-facing documentation for [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy), |
| 205 | +[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and |
| 206 | +[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
| 207 | +- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated), |
| 208 | +[Backoff limit per index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and |
| 209 | +[Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures). |
| 210 | + |
| 211 | +## Getting Involved |
| 212 | + |
| 213 | +These features were sponsored by [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps). Batch use cases are actively |
| 214 | +being improved for Kubernetes users in the |
| 215 | +[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch). |
| 216 | +Working groups are relatively short-lived initiatives focused on specific goals. |
| 217 | +The goal of the WG Batch is to improve experience for batch workload users, offer support for |
| 218 | +batch processing use cases, and enhance the |
| 219 | +Job API for common use cases. If that interests you, please join the working |
| 220 | +group either by subscriping to our |
| 221 | +[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on |
| 222 | +[Slack](https://kubernetes.slack.com/messages/wg-batch). |
| 223 | + |
| 224 | +## Acknowledgments |
| 225 | + |
| 226 | +As with any Kubernetes feature, multiple people contributed to getting this |
| 227 | +done, from testing and filing bugs to reviewing code. |
| 228 | + |
| 229 | +We would not have been able to achieve either of these features without Aldo |
| 230 | +Culquicondor (Google) providing excellent domain knowledge and expertise |
| 231 | +throughout the Kubernetes ecosystem. |
0 commit comments