Merge pull request #42613 from my-git9/blog-job-update

k8s-ci-robot · web-flow · commit eb5f94138536 · 2023-08-22T03:00:54.000-07:00
[zh-cn] sync blog: 2023-08-21-job-update-post.md
diff --git a/content/zh-cn/blog/_posts/2023-08-21-job-update-post.md b/content/zh-cn/blog/_posts/2023-08-21-job-update-post.md
@@ -0,0 +1,373 @@
+---
+layout: blog
+title: "Kubernetes 1.28：Job 失效处理的改进"
+date: 2023-08-21
+slug: kubernetes-1-28-jobapi-update
+---
+
+<!--
+layout: blog
+title: "Kubernetes 1.28: Improved failure handling for Jobs"
+date: 2023-08-21
+slug: kubernetes-1-28-jobapi-update
+-->
+
+<!--
+**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google)
+-->
+**作者：** Kevin Hannon (G-Research), Michał Woźniak (Google)
+
+**译者：** Xin Li (Daocloud)
+
+<!--
+This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch
+users: [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy)
+and [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index).
+-->
+本博客讨论 Kubernetes 1.28 中的两个新特性，用于为批处理用户改进 Job：
+[Pod 更换策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-replacement-policy)
+和[基于索引的回退限制](/zh-cn/docs/concepts/workloads/controllers/job/#backoff-limit-per-index)。
+
+<!--
+These features continue the effort started by the
+[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
+to improve the handling of Pod failures in a Job.
+-->
+这些特性延续了 [Pod 失效策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-failure-policy)
+为开端的工作，用来改进对 Job 中 Pod 失效的处理。
+
+<!--
+## Pod replacement policy {#pod-replacement-policy}
+
+By default, when a pod enters a terminating state (e.g. due to preemption or
+eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running
+at the same time. In API terms, a pod is considered terminating when it has a
+`deletionTimestamp` and it has a phase `Pending` or `Running`.
+-->
+## Pod 更换策略  {#pod-replacement-policy}
+
+默认情况下，当 Pod 进入终止（Terminating）状态（例如由于抢占或驱逐机制）时，Kubernetes
+会立即创建一个替换的 Pod，因此这时会有两个 Pod 同时运行。就 API 而言，当 Pod 具有
+`deletionTimestamp` 字段并且处于 `Pending` 或 `Running` 阶段时会被视为终止。
+
+<!--
+The scenario when two Pods are running at a given time is problematic for
+some popular machine learning frameworks, such as
+TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time,
+for a given index. 
+Tensorflow gives the following error if two pods are running for a given index.
+-->
+对于一些流行的机器学习框架来说，在给定时间运行两个 Pod 的情况是有问题的，
+例如 TensorFlow 和 [JAX](https://jax.readthedocs.io/en/latest/)，
+对于给定的索引，它们最多同时运行一个 Pod。如果两个 Pod 使用同一个索引来运行，
+Tensorflow 会抛出以下错误：
+
+```
+ /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
+```
+
+<!--
+See more details in the ([issue](https://github.com/kubernetes/kubernetes/issues/115844)).
+
+Creating the replacement Pod before the previous one fully terminates can also
+cause problems in clusters with scarce resources or with tight budgets, such as:
+* cluster resources can be difficult to obtain for Pods pending to be scheduled,
+  as Kubernetes might take a long time to find available nodes until the existing
+  Pods are fully terminated.
+* if cluster autoscaler is enabled, the replacement Pods might produce undesired
+  scale ups.
+-->
+可参考[问题报告](https://github.com/kubernetes/kubernetes/issues/115844)进一步了解细节。
+
+在前一个 Pod 完全终止之前创建替换的 Pod 也可能会导致资源或预算紧张的集群出现问题，例如：
+
+* 对于待调度的 Pod 来说，很难分配到集群资源，导致 Kubernetes 需要很长时间才能找到可用节点，
+  直到现有 Pod 完全终止。
+* 如果启用了集群自动扩缩器（Cluster Autoscaler），可能会产生不必要的集群规模扩增。
+
+<!--
+### How can you use it? {#pod-replacement-policy-how-to-use}
+
+This is an alpha feature, which you can enable by turning on `JobPodReplacementPolicy`
+[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in
+your cluster.
+
+Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a
+`podReplacementPolicy` field as shown here:
+-->
+### 如何使用？  {#pod-replacement-policy-how-to-use}
+
+这是一项 Alpha 级别特性，你可以通过在集群中启用 `JobPodReplacementPolicy`
+[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)
+来启用该特性。
+
+```yaml
+kind: Job
+metadata:
+  name: new
+  ...
+spec:
+  podReplacementPolicy: Failed
+  ...
+```
+
+<!--
+In that Job, the Pods would only be replaced once they reached the `Failed` phase,
+and not when they are terminating.
+
+Additionally, you can inspect the `.status.terminating` field of a Job. The value
+of the field is the number of Pods owned by the Job that are currently terminating.
+-->
+在此 Job 中，Pod 仅在达到 `Failed` 阶段时才会被替换，而不是在它们处于终止过程中（Terminating）时被替换。
+
+此外，你可以检查 Job 的 `.status.termination` 字段。该字段的值表示终止过程中的
+Job 所关联的 Pod 数量。
+
+```shell
+kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}'
+```
+
+```
+3 # three Pods are terminating and have not yet reached the Failed phase
+```
+
+<!--
+This can be particularly useful for external queueing controllers, such as
+[Kueue](https://github.com/kubernetes-sigs/kueue), that tracks quota
+from running Pods of a Job until the resources are reclaimed from
+the currently terminating Job.
+
+Note that the `podReplacementPolicy: Failed` is the default when using a custom
+[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy).
+-->
+这一特性对于外部排队控制器（例如 [Kueue](https://github.com/kubernetes-sigs/kueue)）特别有用，
+它跟踪作业的运行 Pod 的配额，直到从当前终止过程中的 Job 资源被回收为止。
+
+请注意，使用自定义 [Pod 失败策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-failure-policy)时，
+`podReplacementPolicy: Failed` 是默认值。
+
+<!--
+## Backoff limit per index {#backoff-limit-per-index}
+
+By default, Pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
+are counted towards the global limit of retries, represented by `.spec.backoffLimit`.
+This means, that if there is a consistently failing index, it is restarted
+repeatedly until it exhausts the limit. Once the limit is reached the entire
+Job is marked failed and some indexes may never be even started.
+-->
+## 逐索引的回退限制  {#backoff-limit-per-index}
+
+默认情况下，[带索引的 Job（Indexed Job）](/zh-cn/docs/concepts/workloads/controllers/job/#completion-mode)的
+Pod 失败情况会被统计下来，受 `.spec.backoffLimit` 字段所设置的全局重试次数限制。
+这意味着，如果存在某个索引值的 Pod 一直持续失败，则会 Pod 会被重新启动，直到重试次数达到限制值。
+一旦达到限制值，整个 Job 将被标记为失败，并且对应某些索引的 Pod 甚至可能从不曾被启动。
+
+<!--
+This is problematic for use cases where you want to handle Pod failures for
+every index independently. For example, if you use Indexed Jobs for running
+integration tests where each index corresponds to a testing suite. In that case,
+you may want to account for possible flake tests allowing for 1 or 2 retries per
+suite. There might be some buggy suites, making the corresponding
+indexes fail consistently. In that case you may prefer to limit retries for
+the buggy suites, yet allowing other suites to complete.
+-->
+对于你想要独立处理不同索引值的 Pod 的失败的场景而言，这是有问题的。
+例如，如果你使用带索引的 Job（Indexed Job）来运行集成测试，其中每个索引值对应一个测试套件。
+在这种情况下，你可能需要考虑可能发生的脆弱测试（Flake Test），允许每个套件重试 1 次或 2 次。
+可能存在一些有缺陷的套件，导致对应索引的 Pod 始终失败。在这种情况下，
+你或许更希望限制有问题的套件的重试，而允许其他套件完成。
+
+<!--
+The feature allows you to:
+* complete execution of all indexes, despite some indexes failing.
+* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.
+-->
+此特性允许你：
+* 尽管某些索引值的 Pod 失败，但仍完成执行所有索引值的 Pod。
+* 通过避免对持续失败的、特定索引值的 Pod 进行不必要的重试，更好地利用计算资源。
+
+<!--
+### How can you use it? {#backoff-limit-per-index-how-to-use}
+
+This is an alpha feature, which you can enable by turning on the
+`JobBackoffLimitPerIndex`
+[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+in your cluster.
+
+Once the feature is enabled in your cluster, you can create an Indexed Job with the
+`.spec.backoffLimitPerIndex` field specified.
+-->
+### 可以如何使用它？  {#backoff-limit-per-index-how-to-use}
+
+这是一个 Alpha 特性，你可以通过启用集群的 `JobBackoffLimitPerIndex`
+[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)来启用此特性。
+
+在集群中启用该特性后，你可以在创建带索引的 Job（Indexed Job）时指定 `.spec.backoffLimitPerIndex` 字段。
+
+<!--
+#### Example
+
+The following example demonstrates how to use this feature to make sure the
+Job executes all indexes (provided there is no other reason for the early Job
+termination, such as reaching the `activeDeadlineSeconds` timeout, or being
+manually deleted by the user), and the number of failures is controlled per index.
+-->
+#### 示例
+
+下面的示例演示如何使用此功能来确保 Job 执行所有索引值的 Pod（前提是没有其他原因导致 Job 提前终止，
+例如达到 `activeDeadlineSeconds` 超时，或者被用户手动删除），以及按索引控制失败次数。
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: job-backoff-limit-per-index-execute-all
+spec:
+  completions: 8
+  parallelism: 2
+  completionMode: Indexed
+  backoffLimitPerIndex: 1
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: example # 当此示例容器作为任何 Job 中的第二个或第三个索引运行时（即使在重试之后），它会返回错误并失败
+        image: python
+        command:
+        - python3
+        - -c
+        - |
+          import os, sys, time
+          id = int(os.environ.get("JOB_COMPLETION_INDEX"))
+          if id == 1 or id == 2:
+            sys.exit(1)
+          time.sleep(1)
+```
+
+<!--
+Now, inspect the Pods after the job is finished:
+-->
+现在，在 Job 完成后检查 Pod：
+
+```sh
+kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all
+```
+
+<!--
+Returns output similar to this:
+-->
+返回的输出类似与：
+
+```
+NAME                                              READY   STATUS      RESTARTS   AGE
+job-backoff-limit-per-index-execute-all-0-b26vc   0/1     Completed   0          49s
+job-backoff-limit-per-index-execute-all-1-6j5gd   0/1     Error       0          49s
+job-backoff-limit-per-index-execute-all-1-6wd82   0/1     Error       0          37s
+job-backoff-limit-per-index-execute-all-2-c66hg   0/1     Error       0          32s
+job-backoff-limit-per-index-execute-all-2-nf982   0/1     Error       0          43s
+job-backoff-limit-per-index-execute-all-3-cxmhf   0/1     Completed   0          33s
+job-backoff-limit-per-index-execute-all-4-9q6kq   0/1     Completed   0          28s
+job-backoff-limit-per-index-execute-all-5-z9hqf   0/1     Completed   0          28s
+job-backoff-limit-per-index-execute-all-6-tbkr8   0/1     Completed   0          23s
+job-backoff-limit-per-index-execute-all-7-hxjsq   0/1     Completed   0          22s
+```
+
+<!--
+Additionally, you can take a look at the status for that Job:
+-->
+此外，你可以查看该 Job 的状态：
+
+```sh
+kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml
+```
+
+<!--
+The output ends with a `status` similar to:
+-->
+输出内容以 `status` 结尾，类似于：
+
+```yaml
+  status:
+    completedIndexes: 0,3-7
+    failedIndexes: 1,2
+    succeeded: 6
+    failed: 4
+    conditions:
+    - message: Job has failed indexes
+      reason: FailedIndexes
+      status: "True"
+      type: Failed
+```
+
+<!--
+Here, indexes `1`  and `2` were both retried once. After the second failure,
+in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so
+the retries were stopped. For comparison, if the per-index backoff was disabled,
+then the buggy indexes would retry until the global `backoffLimit` was exceeded,
+and then the entire Job would be marked failed, before some of the higher
+indexes are started.
+-->
+这里，索引为 `1` 和 `2` 的 Pod 都被重试了一次。这两个 Pod 在第二次失败后都超出了指定的
+`.spec.backoffLimitPerIndex`，因此停止重试。相比之下，如果禁用了基于索引的回退，
+那么有问题的、特定索引的 Pod 将被重试，直到超出全局 `backoffLimit`，之后在启动一些索引值较高的 Pod 之前，
+整个 Job 将被标记为失败。
+
+<!--
+## How can you learn more?
+
+- Read the user-facing documentation for [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy),
+[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and
+[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
+- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated),
+[Backoff limit per index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and
+[Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures).
+-->
+## 如何进一步了解 {#how-can-you-learn-more}
+
+- 阅读面向用户的 [Pod 替换策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-replacement-policy)文档、
+  [逐索引的回退限制](/zh-cn/docs/concepts/workloads/controllers/job/#backoff-limit-per-index)和
+  [Pod 失效策略](/zh-cn/docs/concepts/workloads/controllers/job/#pod-failure-policy)
+- 阅读 [Pod 替换策略](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated))、
+  [逐索引的回退限制](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs)和
+  [Pod 失效策略](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures)的 KEP。
+
+<!--
+## Getting Involved
+
+These features were sponsored by [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps).  Batch use cases are actively
+being improved for Kubernetes users in the
+[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch).
+Working groups are relatively short-lived initiatives focused on specific goals.
+The goal of the WG Batch is to improve experience for batch workload users, offer support for
+batch processing use cases, and enhance the
+Job API for common use cases.  If that interests you, please join the working
+group either by subscriping to our
+[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
+[Slack](https://kubernetes.slack.com/messages/wg-batch).
+-->
+## 参与其中 {#getting-Involved}
+
+这些功能由 [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) 赞助。
+社区正在为[批处理工作组](https://github.com/kubernetes/community/tree/master/wg-batch)中的
+Kubernetes 用户积极改进批处理场景。
+工作组是相对短暂的举措，专注于特定目标。WG Batch 的目标是改善批处理工作负载的用户体验、
+提供对批处理场景的支持并增强常见场景下的 Job API。
+如果你对此感兴趣，请通过订阅我们的[邮件列表](https://groups.google.com/a/kubernetes.io/g/wg-batch)或通过
+[Slack](https://kubernetes.slack.com/messages/wg-batch) 加入进来。
+
+<!--
+## Acknowledgments
+
+As with any Kubernetes feature, multiple people contributed to getting this
+done, from testing and filing bugs to reviewing code.
+
+We would not have been able to achieve either of these features without Aldo
+Culquicondor (Google) providing excellent domain knowledge and expertise
+throughout the Kubernetes ecosystem.
+-->
+## 致谢 {#acknowledgments}
+
+与其他 Kubernetes 特性一样，从测试、报告缺陷到代码审查，很多人为此特性做出了贡献。
+
+如果没有 Aldo Culquicondor（Google）提供出色的领域知识和跨整个 Kubernetes 生态系统的知识，
+我们可能无法实现这些特性。