|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.28:Job 失效处理的改进" |
| 4 | +date: 2023-08-21 |
| 5 | +slug: kubernetes-1-28-jobapi-update |
| 6 | +--- |
| 7 | + |
| 8 | +<!-- |
| 9 | +layout: blog |
| 10 | +title: "Kubernetes 1.28: Improved failure handling for Jobs" |
| 11 | +date: 2023-08-21 |
| 12 | +slug: kubernetes-1-28-jobapi-update |
| 13 | +--> |
| 14 | + |
| 15 | +<!-- |
| 16 | +**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google) |
| 17 | +--> |
| 18 | +**作者:** Kevin Hannon (G-Research), Michał Woźniak (Google) |
| 19 | + |
| 20 | +**译者:** Xin Li (Daocloud) |
| 21 | + |
| 22 | +<!-- |
| 23 | +This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch |
| 24 | +users: [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy) |
| 25 | +and [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index). |
| 26 | +--> |
| 27 | +本博客讨论 Kubernetes 1.28 中的两个新特性,用于为批处理用户改进 Job: |
| 28 | +[Pod 更换策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-replacement-policy) |
| 29 | +和[基于索引的回退限制](/zh-cn/docs/concepts/workloads/controllers/job/#backoff-limit-per-index)。 |
| 30 | + |
| 31 | +<!-- |
| 32 | +These features continue the effort started by the |
| 33 | +[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
| 34 | +to improve the handling of Pod failures in a Job. |
| 35 | +--> |
| 36 | +这些特性延续了 [Pod 失效策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
| 37 | +为开端的工作,用来改进对 Job 中 Pod 失效的处理。 |
| 38 | + |
| 39 | +<!-- |
| 40 | +## Pod replacement policy {#pod-replacement-policy} |
| 41 | +
|
| 42 | +By default, when a pod enters a terminating state (e.g. due to preemption or |
| 43 | +eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running |
| 44 | +at the same time. In API terms, a pod is considered terminating when it has a |
| 45 | +`deletionTimestamp` and it has a phase `Pending` or `Running`. |
| 46 | +--> |
| 47 | +## Pod 更换策略 {#pod-replacement-policy} |
| 48 | + |
| 49 | +默认情况下,当 Pod 进入终止(Terminating)状态(例如由于抢占或驱逐机制)时,Kubernetes |
| 50 | +会立即创建一个替换的 Pod,因此这时会有两个 Pod 同时运行。就 API 而言,当 Pod 具有 |
| 51 | +`deletionTimestamp` 字段并且处于 `Pending` 或 `Running` 阶段时会被视为终止。 |
| 52 | + |
| 53 | +<!-- |
| 54 | +The scenario when two Pods are running at a given time is problematic for |
| 55 | +some popular machine learning frameworks, such as |
| 56 | +TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time, |
| 57 | +for a given index. |
| 58 | +Tensorflow gives the following error if two pods are running for a given index. |
| 59 | +--> |
| 60 | +对于一些流行的机器学习框架来说,在给定时间运行两个 Pod 的情况是有问题的, |
| 61 | +例如 TensorFlow 和 [JAX](https://jax.readthedocs.io/en/latest/), |
| 62 | +对于给定的索引,它们最多同时运行一个 Pod。如果两个 Pod 使用同一个索引来运行, |
| 63 | +Tensorflow 会抛出以下错误: |
| 64 | + |
| 65 | +``` |
| 66 | + /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4 |
| 67 | +``` |
| 68 | + |
| 69 | +<!-- |
| 70 | +See more details in the ([issue](https://github.com/kubernetes/kubernetes/issues/115844)). |
| 71 | +
|
| 72 | +Creating the replacement Pod before the previous one fully terminates can also |
| 73 | +cause problems in clusters with scarce resources or with tight budgets, such as: |
| 74 | +* cluster resources can be difficult to obtain for Pods pending to be scheduled, |
| 75 | + as Kubernetes might take a long time to find available nodes until the existing |
| 76 | + Pods are fully terminated. |
| 77 | +* if cluster autoscaler is enabled, the replacement Pods might produce undesired |
| 78 | + scale ups. |
| 79 | +--> |
| 80 | +可参考[问题报告](https://github.com/kubernetes/kubernetes/issues/115844)进一步了解细节。 |
| 81 | + |
| 82 | +在前一个 Pod 完全终止之前创建替换的 Pod 也可能会导致资源或预算紧张的集群出现问题,例如: |
| 83 | + |
| 84 | +* 对于待调度的 Pod 来说,很难分配到集群资源,导致 Kubernetes 需要很长时间才能找到可用节点, |
| 85 | + 直到现有 Pod 完全终止。 |
| 86 | +* 如果启用了集群自动扩缩器(Cluster Autoscaler),可能会产生不必要的集群规模扩增。 |
| 87 | + |
| 88 | +<!-- |
| 89 | +### How can you use it? {#pod-replacement-policy-how-to-use} |
| 90 | +
|
| 91 | +This is an alpha feature, which you can enable by turning on `JobPodReplacementPolicy` |
| 92 | +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in |
| 93 | +your cluster. |
| 94 | +
|
| 95 | +Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a |
| 96 | +`podReplacementPolicy` field as shown here: |
| 97 | +--> |
| 98 | +### 如何使用? {#pod-replacement-policy-how-to-use} |
| 99 | + |
| 100 | +这是一项 Alpha 级别特性,你可以通过在集群中启用 `JobPodReplacementPolicy` |
| 101 | +[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/) |
| 102 | +来启用该特性。 |
| 103 | + |
| 104 | +```yaml |
| 105 | +kind: Job |
| 106 | +metadata: |
| 107 | + name: new |
| 108 | + ... |
| 109 | +spec: |
| 110 | + podReplacementPolicy: Failed |
| 111 | + ... |
| 112 | +``` |
| 113 | + |
| 114 | +<!-- |
| 115 | +In that Job, the Pods would only be replaced once they reached the `Failed` phase, |
| 116 | +and not when they are terminating. |
| 117 | +
|
| 118 | +Additionally, you can inspect the `.status.terminating` field of a Job. The value |
| 119 | +of the field is the number of Pods owned by the Job that are currently terminating. |
| 120 | +--> |
| 121 | +在此 Job 中,Pod 仅在达到 `Failed` 阶段时才会被替换,而不是在它们处于终止过程中(Terminating)时被替换。 |
| 122 | + |
| 123 | +此外,你可以检查 Job 的 `.status.termination` 字段。该字段的值表示终止过程中的 |
| 124 | +Job 所关联的 Pod 数量。 |
| 125 | + |
| 126 | +```shell |
| 127 | +kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}' |
| 128 | +``` |
| 129 | + |
| 130 | +``` |
| 131 | +3 # three Pods are terminating and have not yet reached the Failed phase |
| 132 | +``` |
| 133 | + |
| 134 | +<!-- |
| 135 | +This can be particularly useful for external queueing controllers, such as |
| 136 | +[Kueue](https://github.com/kubernetes-sigs/kueue), that tracks quota |
| 137 | +from running Pods of a Job until the resources are reclaimed from |
| 138 | +the currently terminating Job. |
| 139 | +
|
| 140 | +Note that the `podReplacementPolicy: Failed` is the default when using a custom |
| 141 | +[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy). |
| 142 | +--> |
| 143 | +这一特性对于外部排队控制器(例如 [Kueue](https://github.com/kubernetes-sigs/kueue))特别有用, |
| 144 | +它跟踪作业的运行 Pod 的配额,直到从当前终止过程中的 Job 资源被回收为止。 |
| 145 | + |
| 146 | +请注意,使用自定义 [Pod 失败策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-failure-policy)时, |
| 147 | +`podReplacementPolicy: Failed` 是默认值。 |
| 148 | + |
| 149 | +<!-- |
| 150 | +## Backoff limit per index {#backoff-limit-per-index} |
| 151 | +
|
| 152 | +By default, Pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode) |
| 153 | +are counted towards the global limit of retries, represented by `.spec.backoffLimit`. |
| 154 | +This means, that if there is a consistently failing index, it is restarted |
| 155 | +repeatedly until it exhausts the limit. Once the limit is reached the entire |
| 156 | +Job is marked failed and some indexes may never be even started. |
| 157 | +--> |
| 158 | +## 逐索引的回退限制 {#backoff-limit-per-index} |
| 159 | + |
| 160 | +默认情况下,[带索引的 Job(Indexed Job)](/zh-cn/docs/concepts/workloads/controllers/job/#completion-mode)的 |
| 161 | +Pod 失败情况会被统计下来,受 `.spec.backoffLimit` 字段所设置的全局重试次数限制。 |
| 162 | +这意味着,如果存在某个索引值的 Pod 一直持续失败,则会 Pod 会被重新启动,直到重试次数达到限制值。 |
| 163 | +一旦达到限制值,整个 Job 将被标记为失败,并且对应某些索引的 Pod 甚至可能从不曾被启动。 |
| 164 | + |
| 165 | +<!-- |
| 166 | +This is problematic for use cases where you want to handle Pod failures for |
| 167 | +every index independently. For example, if you use Indexed Jobs for running |
| 168 | +integration tests where each index corresponds to a testing suite. In that case, |
| 169 | +you may want to account for possible flake tests allowing for 1 or 2 retries per |
| 170 | +suite. There might be some buggy suites, making the corresponding |
| 171 | +indexes fail consistently. In that case you may prefer to limit retries for |
| 172 | +the buggy suites, yet allowing other suites to complete. |
| 173 | +--> |
| 174 | +对于你想要独立处理不同索引值的 Pod 的失败的场景而言,这是有问题的。 |
| 175 | +例如,如果你使用带索引的 Job(Indexed Job)来运行集成测试,其中每个索引值对应一个测试套件。 |
| 176 | +在这种情况下,你可能需要考虑可能发生的脆弱测试(Flake Test),允许每个套件重试 1 次或 2 次。 |
| 177 | +可能存在一些有缺陷的套件,导致对应索引的 Pod 始终失败。在这种情况下, |
| 178 | +你或许更希望限制有问题的套件的重试,而允许其他套件完成。 |
| 179 | + |
| 180 | +<!-- |
| 181 | +The feature allows you to: |
| 182 | +* complete execution of all indexes, despite some indexes failing. |
| 183 | +* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes. |
| 184 | +--> |
| 185 | +此特性允许你: |
| 186 | +* 尽管某些索引值的 Pod 失败,但仍完成执行所有索引值的 Pod。 |
| 187 | +* 通过避免对持续失败的、特定索引值的 Pod 进行不必要的重试,更好地利用计算资源。 |
| 188 | + |
| 189 | +<!-- |
| 190 | +### How can you use it? {#backoff-limit-per-index-how-to-use} |
| 191 | +
|
| 192 | +This is an alpha feature, which you can enable by turning on the |
| 193 | +`JobBackoffLimitPerIndex` |
| 194 | +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) |
| 195 | +in your cluster. |
| 196 | +
|
| 197 | +Once the feature is enabled in your cluster, you can create an Indexed Job with the |
| 198 | +`.spec.backoffLimitPerIndex` field specified. |
| 199 | +--> |
| 200 | +### 可以如何使用它? {#backoff-limit-per-index-how-to-use} |
| 201 | + |
| 202 | +这是一个 Alpha 特性,你可以通过启用集群的 `JobBackoffLimitPerIndex` |
| 203 | +[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)来启用此特性。 |
| 204 | + |
| 205 | +在集群中启用该特性后,你可以在创建带索引的 Job(Indexed Job)时指定 `.spec.backoffLimitPerIndex` 字段。 |
| 206 | + |
| 207 | +<!-- |
| 208 | +#### Example |
| 209 | +
|
| 210 | +The following example demonstrates how to use this feature to make sure the |
| 211 | +Job executes all indexes (provided there is no other reason for the early Job |
| 212 | +termination, such as reaching the `activeDeadlineSeconds` timeout, or being |
| 213 | +manually deleted by the user), and the number of failures is controlled per index. |
| 214 | +--> |
| 215 | +#### 示例 |
| 216 | + |
| 217 | +下面的示例演示如何使用此功能来确保 Job 执行所有索引值的 Pod(前提是没有其他原因导致 Job 提前终止, |
| 218 | +例如达到 `activeDeadlineSeconds` 超时,或者被用户手动删除),以及按索引控制失败次数。 |
| 219 | + |
| 220 | +```yaml |
| 221 | +apiVersion: batch/v1 |
| 222 | +kind: Job |
| 223 | +metadata: |
| 224 | + name: job-backoff-limit-per-index-execute-all |
| 225 | +spec: |
| 226 | + completions: 8 |
| 227 | + parallelism: 2 |
| 228 | + completionMode: Indexed |
| 229 | + backoffLimitPerIndex: 1 |
| 230 | + template: |
| 231 | + spec: |
| 232 | + restartPolicy: Never |
| 233 | + containers: |
| 234 | + - name: example # 当此示例容器作为任何 Job 中的第二个或第三个索引运行时(即使在重试之后),它会返回错误并失败 |
| 235 | + image: python |
| 236 | + command: |
| 237 | + - python3 |
| 238 | + - -c |
| 239 | + - | |
| 240 | + import os, sys, time |
| 241 | + id = int(os.environ.get("JOB_COMPLETION_INDEX")) |
| 242 | + if id == 1 or id == 2: |
| 243 | + sys.exit(1) |
| 244 | + time.sleep(1) |
| 245 | +``` |
| 246 | +
|
| 247 | +<!-- |
| 248 | +Now, inspect the Pods after the job is finished: |
| 249 | +--> |
| 250 | +现在,在 Job 完成后检查 Pod: |
| 251 | +
|
| 252 | +```sh |
| 253 | +kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all |
| 254 | +``` |
| 255 | + |
| 256 | +<!-- |
| 257 | +Returns output similar to this: |
| 258 | +--> |
| 259 | +返回的输出类似与: |
| 260 | + |
| 261 | +``` |
| 262 | +NAME READY STATUS RESTARTS AGE |
| 263 | +job-backoff-limit-per-index-execute-all-0-b26vc 0/1 Completed 0 49s |
| 264 | +job-backoff-limit-per-index-execute-all-1-6j5gd 0/1 Error 0 49s |
| 265 | +job-backoff-limit-per-index-execute-all-1-6wd82 0/1 Error 0 37s |
| 266 | +job-backoff-limit-per-index-execute-all-2-c66hg 0/1 Error 0 32s |
| 267 | +job-backoff-limit-per-index-execute-all-2-nf982 0/1 Error 0 43s |
| 268 | +job-backoff-limit-per-index-execute-all-3-cxmhf 0/1 Completed 0 33s |
| 269 | +job-backoff-limit-per-index-execute-all-4-9q6kq 0/1 Completed 0 28s |
| 270 | +job-backoff-limit-per-index-execute-all-5-z9hqf 0/1 Completed 0 28s |
| 271 | +job-backoff-limit-per-index-execute-all-6-tbkr8 0/1 Completed 0 23s |
| 272 | +job-backoff-limit-per-index-execute-all-7-hxjsq 0/1 Completed 0 22s |
| 273 | +``` |
| 274 | + |
| 275 | +<!-- |
| 276 | +Additionally, you can take a look at the status for that Job: |
| 277 | +--> |
| 278 | +此外,你可以查看该 Job 的状态: |
| 279 | + |
| 280 | +```sh |
| 281 | +kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml |
| 282 | +``` |
| 283 | + |
| 284 | +<!-- |
| 285 | +The output ends with a `status` similar to: |
| 286 | +--> |
| 287 | +输出内容以 `status` 结尾,类似于: |
| 288 | + |
| 289 | +```yaml |
| 290 | + status: |
| 291 | + completedIndexes: 0,3-7 |
| 292 | + failedIndexes: 1,2 |
| 293 | + succeeded: 6 |
| 294 | + failed: 4 |
| 295 | + conditions: |
| 296 | + - message: Job has failed indexes |
| 297 | + reason: FailedIndexes |
| 298 | + status: "True" |
| 299 | + type: Failed |
| 300 | +``` |
| 301 | +
|
| 302 | +<!-- |
| 303 | +Here, indexes `1` and `2` were both retried once. After the second failure, |
| 304 | +in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so |
| 305 | +the retries were stopped. For comparison, if the per-index backoff was disabled, |
| 306 | +then the buggy indexes would retry until the global `backoffLimit` was exceeded, |
| 307 | +and then the entire Job would be marked failed, before some of the higher |
| 308 | +indexes are started. |
| 309 | +--> |
| 310 | +这里,索引为 `1` 和 `2` 的 Pod 都被重试了一次。这两个 Pod 在第二次失败后都超出了指定的 |
| 311 | +`.spec.backoffLimitPerIndex`,因此停止重试。相比之下,如果禁用了基于索引的回退, |
| 312 | +那么有问题的、特定索引的 Pod 将被重试,直到超出全局 `backoffLimit`,之后在启动一些索引值较高的 Pod 之前, |
| 313 | +整个 Job 将被标记为失败。 |
| 314 | + |
| 315 | +<!-- |
| 316 | +## How can you learn more? |
| 317 | + |
| 318 | +- Read the user-facing documentation for [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy), |
| 319 | +[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and |
| 320 | +[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
| 321 | +- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated), |
| 322 | +[Backoff limit per index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and |
| 323 | +[Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures). |
| 324 | +--> |
| 325 | +## 如何进一步了解 {#how-can-you-learn-more} |
| 326 | + |
| 327 | +- 阅读面向用户的 [Pod 替换策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-replacement-policy)文档、 |
| 328 | + [逐索引的回退限制](/zh-cn/docs/concepts/workloads/controllers/job/#backoff-limit-per-index)和 |
| 329 | + [Pod 失效策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-failure-policy) |
| 330 | +- 阅读 [Pod 替换策略](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated))、 |
| 331 | + [逐索引的回退限制](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs)和 |
| 332 | + [Pod 失效策略](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures)的 KEP。 |
| 333 | + |
| 334 | +<!-- |
| 335 | +## Getting Involved |
| 336 | + |
| 337 | +These features were sponsored by [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps). Batch use cases are actively |
| 338 | +being improved for Kubernetes users in the |
| 339 | +[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch). |
| 340 | +Working groups are relatively short-lived initiatives focused on specific goals. |
| 341 | +The goal of the WG Batch is to improve experience for batch workload users, offer support for |
| 342 | +batch processing use cases, and enhance the |
| 343 | +Job API for common use cases. If that interests you, please join the working |
| 344 | +group either by subscriping to our |
| 345 | +[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on |
| 346 | +[Slack](https://kubernetes.slack.com/messages/wg-batch). |
| 347 | +--> |
| 348 | +## 参与其中 {#getting-Involved} |
| 349 | + |
| 350 | +这些功能由 [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) 赞助。 |
| 351 | +社区正在为[批处理工作组](https://github.com/kubernetes/community/tree/master/wg-batch)中的 |
| 352 | +Kubernetes 用户积极改进批处理场景。 |
| 353 | +工作组是相对短暂的举措,专注于特定目标。WG Batch 的目标是改善批处理工作负载的用户体验、 |
| 354 | +提供对批处理场景的支持并增强常见场景下的 Job API。 |
| 355 | +如果你对此感兴趣,请通过订阅我们的[邮件列表](https://groups.google.com/a/kubernetes.io/g/wg-batch)或通过 |
| 356 | +[Slack](https://kubernetes.slack.com/messages/wg-batch) 加入进来。 |
| 357 | + |
| 358 | +<!-- |
| 359 | +## Acknowledgments |
| 360 | + |
| 361 | +As with any Kubernetes feature, multiple people contributed to getting this |
| 362 | +done, from testing and filing bugs to reviewing code. |
| 363 | + |
| 364 | +We would not have been able to achieve either of these features without Aldo |
| 365 | +Culquicondor (Google) providing excellent domain knowledge and expertise |
| 366 | +throughout the Kubernetes ecosystem. |
| 367 | +--> |
| 368 | +## 致谢 {#acknowledgments} |
| 369 | + |
| 370 | +与其他 Kubernetes 特性一样,从测试、报告缺陷到代码审查,很多人为此特性做出了贡献。 |
| 371 | + |
| 372 | +如果没有 Aldo Culquicondor(Google)提供出色的领域知识和跨整个 Kubernetes 生态系统的知识, |
| 373 | +我们可能无法实现这些特性。 |
0 commit comments