Skip to content

Commit 6838c7e

Browse files
committed
[zh-cn] sync controllers/job.md
Signed-off-by: xin.li <[email protected]>
1 parent 9764869 commit 6838c7e

File tree

1 file changed

+167
-37
lines changed
  • content/zh-cn/docs/concepts/workloads/controllers

1 file changed

+167
-37
lines changed

content/zh-cn/docs/concepts/workloads/controllers/job.md

Lines changed: 167 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -573,9 +573,7 @@ multiple pods running at once. Therefore, your pods must also be tolerant of con
573573
为此,你的 Pod 也必须能够处理并发性问题。
574574

575575
<!--
576-
When the [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
577-
`PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled,
578-
and the `.spec.podFailurePolicy` field is set, the Job controller does not consider a terminating
576+
If you specify the `.spec.podFailurePolicy` field, the Job controller does not consider a terminating
579577
Pod (a pod that has a `.metadata.deletionTimestamp` field set) as a failure until that Pod is
580578
terminal (its `.status.phase` is `Failed` or `Succeeded`). However, the Job controller
581579
creates a replacement Pod as soon as the termination becomes apparent. Once the
@@ -586,8 +584,7 @@ If either of these requirements is not satisfied, the Job controller counts
586584
a terminating Pod as an immediate failure, even if that Pod later terminates
587585
with `phase: "Succeeded"`.
588586
-->
589-
[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)
590-
`PodDisruptionConditions``JobPodFailurePolicy` 都被启用且 `.spec.podFailurePolicy` 字段被设置时,
587+
当你指定了 `.spec.podFailurePolicy` 字段,
591588
Job 控制器不会将终止过程中的 Pod(已设置 `.metadata.deletionTimestamp` 字段的 Pod)视为失效 Pod,
592589
直到该 Pod 完全终止(其 `.status.phase``Failed``Succeeded`)。
593590
但只要终止变得显而易见,Job 控制器就会创建一个替代的 Pod。一旦 Pod 终止,Job 控制器将把这个刚终止的
@@ -741,45 +738,43 @@ kubectl get -o yaml job job-backoff-limit-per-index-example
741738
succeeded: 5 # 每 5 个成功的索引有 1 个成功的 Pod
742739
failed: 10 # 每 5 个失败的索引有 2 个失败的 Pod(1 次重试)
743740
conditions:
741+
- message: Job has failed indexes
742+
reason: FailedIndexes
743+
status: "True"
744+
type: FailureTarget
744745
- message: Job has failed indexes
745746
reason: FailedIndexes
746747
status: "True"
747748
type: Failed
748749
```
749750
751+
<!--
752+
The Job controller adds the `FailureTarget` Job condition to trigger
753+
[Job termination and cleanup](#job-termination-and-cleanup). When all of the
754+
Job Pods are terminated, the Job controller adds the `Failed` condition
755+
with the same values for `reason` and `message` as the `FailureTarget` Job
756+
condition. For details, see [Termination of Job Pods](#termination-of-job-pods).
757+
-->
758+
Job 控制器添加 `FailureTarget` Job 状况来触发 [Job 终止和清理](#job-termination-and-cleanup)。
759+
当所有 Job Pod 都终止时,Job 控制器会添加 `Failed` 状况,
760+
其 `reason` 和 `message` 的值与 `FailureTarget` Job 状况相同。
761+
有关详细信息,请参阅 [Job Pod 的终止](#termination-of-job-pods)。
762+
750763
<!--
751764
Additionally, you may want to use the per-index backoff along with a
752765
[pod failure policy](#pod-failure-policy). When using
753766
per-index backoff, there is a new `FailIndex` action available which allows you to
754767
avoid unnecessary retries within an index.
755768
-->
756-
此外,你可能想要结合使用逐索引回退与 [Pod 失败策略](#pod-failure-policy)。
769+
此外,你可能想要结合使用逐索引回退与 [Pod 失效策略](#pod-failure-policy)。
757770
在使用逐索引回退时,有一个新的 `FailIndex` 操作可用,它让你避免就某个索引进行不必要的重试。
758771

759772
<!--
760773
### Pod failure policy {#pod-failure-policy}
761774
-->
762775
### Pod 失效策略 {#pod-failure-policy}
763776

764-
{{< feature-state for_k8s_version="v1.26" state="beta" >}}
765-
766-
{{< note >}}
767-
<!--
768-
You can only configure a Pod failure policy for a Job if you have the
769-
`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
770-
enabled in your cluster. Additionally, it is recommended
771-
to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
772-
Pod disruption conditions in the Pod failure policy (see also:
773-
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)).
774-
Both feature gates are available in Kubernetes {{< skew currentVersion >}}.
775-
-->
776-
只有你在集群中启用了
777-
`JobPodFailurePolicy` [特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/),
778-
你才能为某个 Job 配置 Pod 失效策略。
779-
此外,建议启用 `PodDisruptionConditions` 特性门控以便在 Pod 失效策略中检测和处理 Pod 干扰状况
780-
(参考:[Pod 干扰状况](/zh-cn/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions))。
781-
这两个特性门控都是在 Kubernetes {{< skew currentVersion >}} 中提供的。
782-
{{< /note >}}
777+
{{< feature-state feature_gate_name="JobPodFailurePolicy" >}}
783778

784779
<!--
785780
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
@@ -948,11 +943,22 @@ Starting with Kubernetes v1.28, when Pod failure policy is used, the Job control
948943
terminating Pods only once these Pods reach the terminal `Failed` phase. This behavior is similar
949944
to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
950945
-->
951-
自 Kubernetes v1.28 开始,当使用 Pod 失败策略时,Job 控制器仅在这些 Pod 达到终止的
946+
自 Kubernetes v1.28 开始,当使用 Pod 失效策略时,Job 控制器仅在这些 Pod 达到终止的
952947
`Failed` 阶段时才会重新创建终止中的 Pod。这种行为类似于 `podReplacementPolicy: Failed`。
953948
细节参阅 [Pod 替换策略](#pod-replacement-policy)。
954949
{{< /note >}}
955950

951+
<!--
952+
When you use the `podFailurePolicy`, and the Job fails due to the pod
953+
matching the rule with the `FailJob` action, then the Job controller triggers
954+
the Job termination process by adding the `FailureTarget` condition.
955+
For more details, see [Job termination and cleanup](#job-termination-and-cleanup).
956+
-->
957+
当你使用了 `podFailurePolicy`,并且 Pod 因为与 `FailJob`
958+
操作的规则匹配而失败时,Job 控制器会通过添加
959+
`FailureTarget` 状况来触发 Job 终止流程。
960+
更多详情,请参阅 [Job 的终止和清理](#job-termination-and-cleanup)。
961+
956962
<!--
957963
## Success policy {#success-policy}
958964
-->
@@ -1036,15 +1042,15 @@ Here is a manifest for a Job with `successPolicy`:
10361042
In the example above, both `succeededIndexes` and `succeededCount` have been specified.
10371043
Therefore, the job controller will mark the Job as succeeded and terminate the lingering Pods
10381044
when either of the specified indexes, 0, 2, or 3, succeed.
1039-
The Job that meets the success policy gets the `SuccessCriteriaMet` condition.
1045+
The Job that meets the success policy gets the `SuccessCriteriaMet` condition with a `SuccessPolicy` reason.
10401046
After the removal of the lingering Pods is issued, the Job gets the `Complete` condition.
10411047

10421048
Note that the `succeededIndexes` is represented as intervals separated by a hyphen.
10431049
The number are listed in represented by the first and last element of the series, separated by a hyphen.
10441050
-->
10451051
在上面的例子中,`succeededIndexes` 和 `succeededCount` 都已被指定。
10461052
因此,当指定的索引 0、2 或 3 中的任意一个成功时,Job 控制器将 Job 标记为成功并终止剩余的 Pod。
1047-
符合成功策略的 Job 会被标记 `SuccessCriteriaMet` 状况。
1053+
符合成功策略的 Job 会被标记 `SuccessCriteriaMet` 状况,且状况的原因为 `SuccessPolicy`
10481054
在剩余的 Pod 被移除后,Job 会被标记 `Complete` 状况。
10491055

10501056
请注意,`succeededIndexes` 表示为以连字符分隔的数字序列。
@@ -1152,6 +1158,132 @@ and `.spec.backoffLimit` result in a permanent Job failure that requires manual
11521158
换言之,由 `.spec.activeDeadlineSeconds` 和 `.spec.backoffLimit` 所触发的 Job
11531159
终结机制都会导致 Job 永久性的失败,而这类状态都需要手工干预才能解决。
11541160

1161+
<!--
1162+
### Terminal Job conditions
1163+
1164+
A Job has two possible terminal states, each of which has a corresponding Job
1165+
condition:
1166+
* Succeeded: Job condition `Complete`
1167+
* Failed: Job condition `Failed`
1168+
-->
1169+
### Job 终止状况 {#terminal-job-conditions}
1170+
1171+
一个 Job 有两种可能的终止状况,每种状况都有相应的 Job 状况:
1172+
1173+
* Succeeded:Job `Complete` 状况
1174+
* Failed:Job `Failed` 状况
1175+
1176+
<!--
1177+
Jobs fail for the following reasons:
1178+
- The number of Pod failures exceeded the specified `.spec.backoffLimit` in the Job
1179+
specification. For details, see [Pod backoff failure policy](#pod-backoff-failure-policy).
1180+
- The Job runtime exceeded the specified `.spec.activeDeadlineSeconds`
1181+
- An indexed Job that used `.spec.backoffLimitPerIndex` has failed indexes.
1182+
For details, see [Backoff limit per index](#backoff-limit-per-index).
1183+
- The number of failed indexes in the Job exceeded the specified
1184+
`spec.maxFailedIndexes`. For details, see [Backoff limit per index](#backoff-limit-per-index)
1185+
- A failed Pod matches a rule in `.spec.podFailurePolicy` that has the `FailJob`
1186+
action. For details about how Pod failure policy rules might affect failure
1187+
evaluation, see [Pod failure policy](#pod-failure-policy).
1188+
-->
1189+
Job 失败的原因如下:
1190+
1191+
- Pod 失败数量超出了 Job 规约中指定的 `.spec.backoffLimit`,
1192+
详情请参见 [Pod 回退失效策略](#pod-backoff-failure-policy)。
1193+
- Job 运行时间超过了指定的 `.spec.activeDeadlineSeconds`。
1194+
- 使用 `.spec.backoffLimitPerIndex` 的索引 Job 出现索引失败。
1195+
有关详细信息,请参阅[逐索引的回退限制](#backoff-limit-per-index)。
1196+
- Job 中失败的索引数量超出了指定的 `spec.maxFailedIndexes` 值,
1197+
详情见[逐索引的回退限制](#backoff-limit-per-index)。
1198+
- 失败的 Pod 匹配了 `.spec.podFailurePolicy` 中定义的一条规则,该规则的动作为 FailJob。
1199+
有关 Pod 失效策略规则如何影响故障评估的详细信息,请参阅 [Pod 失效策略](#pod-failure-policy)。
1200+
1201+
<!--
1202+
Jobs succeed for the following reasons:
1203+
- The number of succeeded Pods reached the specified `.spec.completions`
1204+
- The criteria specified in `.spec.successPolicy` are met. For details, see
1205+
[Success policy](#success-policy).
1206+
-->
1207+
Pod 成功的原因如下:
1208+
1209+
- 成功的 Pod 的数量达到了指定的 `.spec.completions` 数量。
1210+
- `.spec.successPolicy` 中指定的标准已满足。详情请参见[成功策略](#success-policy)。
1211+
1212+
<!--
1213+
In Kubernetes v1.31 and later the Job controller delays the addition of the
1214+
terminal conditions,`Failed` or `Complete`, until all of the Job Pods are terminated.
1215+
1216+
In Kubernetes v1.30 and earlier, the Job controller added the `Complete` or the
1217+
`Failed` Job terminal conditions as soon as the Job termination process was
1218+
triggered and all Pod finalizers were removed. However, some Pods would still
1219+
be running or terminating at the moment that the terminal condition was added.
1220+
-->
1221+
在 Kubernetes v1.31 及更高版本中,Job 控制器会延迟添加终止状况 `Failed` 或
1222+
`Complete`,直到所有 Job Pod 都终止。
1223+
1224+
在 Kubernetes v1.30 及更早版本中,一旦触发 Job 终止过程并删除所有
1225+
Pod 终结器,Job 控制器就会给 Job 添加 `Complete` 或 `Failed` 终止状况。
1226+
然而,在添加终止状况时,一些 Pod 仍会运行或处于终止过程中。
1227+
1228+
<!--
1229+
In Kubernetes v1.31 and later, the controller only adds the Job terminal conditions
1230+
_after_ all of the Pods are terminated. You can enable this behavior by using the
1231+
`JobManagedBy` or the `JobPodReplacementPolicy` (enabled by default)
1232+
[feature gates](/docs/reference/command-line-tools-reference/feature-gates/).
1233+
-->
1234+
在 Kubernetes v1.31 及更高版本中,控制器仅在所有 Pod 终止后添加 Job 终止状况。
1235+
你可以使用 `JobManagedBy` 或 `JobPodReplacementPolicy`(默认启用)
1236+
启用此行为的[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)。
1237+
1238+
<!--
1239+
### Termination of Job pods
1240+
1241+
The Job controller adds the `FailureTarget` condition or the `SuccessCriteriaMet`
1242+
condition to the Job to trigger Pod termination after a Job meets either the
1243+
success or failure criteria.
1244+
-->
1245+
### Job Pod 的终止
1246+
1247+
Job 控制器将 `FailureTarget` 状况或 `SuccessCriteriaMet` 状况添加到
1248+
Job,以便在 Job 满足成功或失败标准后触发 Pod 终止。
1249+
1250+
<!--
1251+
Factors like `terminationGracePeriodSeconds` might increase the amount of time
1252+
from the moment that the Job controller adds the `FailureTarget` condition or the
1253+
`SuccessCriteriaMet` condition to the moment that all of the Job Pods terminate
1254+
and the Job controller adds a [terminal condition](#terminal-job-conditions)
1255+
(`Failed` or `Complete`).
1256+
1257+
You can use the `FailureTarget` or the `SuccessCriteriaMet` condition to evaluate
1258+
whether the Job has failed or succeeded without having to wait for the controller
1259+
to add a terminal condition.
1260+
-->
1261+
诸如 `terminationGracePeriodSeconds` 之类的因素可能会增加从
1262+
Job 控制器添加 `FailureTarget` 状况或 `SuccessCriteriaMet` 状况到所有
1263+
Job Pod 终止并且 Job 控制器添加[终止状况](#terminal-job-conditions)(`Failed` 或 `Complete`)的这段时间量。
1264+
1265+
你可以使用 `FailureTarget` 或 `SuccessCriteriaMet`
1266+
状况来评估 Job 是否失败或成功,而无需等待控制器添加终止状况。
1267+
1268+
<!--
1269+
For example, you might want to decide when to create a replacement Job
1270+
that replaces a failed Job. If you replace the failed Job when the `FailureTarget`
1271+
condition appears, your replacement Job runs sooner, but could result in Pods
1272+
from the failed and the replacement Job running at the same time, using
1273+
extra compute resources.
1274+
1275+
Alternatively, if your cluster has limited resource capacity, you could choose to
1276+
wait until the `Failed` condition appears on the Job, which would delay your
1277+
replacement Job but would ensure that you conserve resources by waiting
1278+
until all of the failed Pods are removed.
1279+
-->
1280+
例如,你可能想要决定何时创建 Job 来替代某个已失败 Job。
1281+
如果在出现 `FailureTarget` 状况时替换失败的 Job,则替换 Job 启动得会更早,
1282+
但可能会导致失败的 Job 和替换 Job 的 Pod 同时处于运行状态,进而额外耗用计算资源。
1283+
1284+
或者,如果你的集群资源容量有限,你可以选择等到 Job 上出现 `Failed` 状况后再执行替换操作。
1285+
这样做会延迟替换 Job 的启动,不过通过等待所有失败的 Pod 都被删除,可以节省资源。
1286+
11551287
<!--
11561288
## Clean up finished jobs automatically
11571289

@@ -1734,22 +1866,20 @@ observe that pods from a Job are stuck with the tracking finalizer.
17341866
-->
17351867
### 弹性索引 Job {#elastic-indexed-jobs}
17361868

1737-
{{< feature-state for_k8s_version="v1.27" state="beta" >}}
1869+
{{< feature-state feature_gate_name="ElasticIndexedJob" >}}
17381870

17391871
<!--
17401872
You can scale Indexed Jobs up or down by mutating both `.spec.parallelism`
17411873
and `.spec.completions` together such that `.spec.parallelism == .spec.completions`.
1742-
When the `ElasticIndexedJob`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
1743-
on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/)
1744-
is disabled, `.spec.completions` is immutable.
1874+
When scaling down, Kubernetes removes the Pods with higher indexes.
17451875
17461876
Use cases for elastic Indexed Jobs include batch workloads which require
17471877
scaling an indexed Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
17481878
-->
17491879
你可以通过同时改变 `.spec.parallelism``.spec.completions` 来扩大或缩小带索引 Job,
17501880
从而满足 `.spec.parallelism == .spec.completions`
1751-
[API 服务器](/zh-cn/docs/reference/command-line-tools-reference/kube-apiserver/)
1752-
上的 `ElasticIndexedJob` 特性门控被禁用时,`.spec.completions` 是不可变的。
1881+
缩减规模时,Kubernetes 会删除具有更高索引的 Pod。
1882+
17531883
弹性索引 Job 的使用场景包括需要扩展索引 Job 的批处理工作负载,例如 MPI、Horovord、Ray
17541884
和 PyTorch 训练作业。
17551885

@@ -1795,11 +1925,11 @@ See [Pod failure policy](#pod-failure-policy) to learn more about Pod failure po
17951925
-->
17961926
你可以选择仅在终止过程中的 Pod 完全终止(具有 `status.phase: Failed`)时才创建替换 Pod。
17971927
为此,可以设置 `.spec.podReplacementPolicy: Failed`
1798-
默认的替换策略取决于 Job 是否设置了 `podFailurePolicy`。对于没有定义 Pod 失败策略的 Job,
1928+
默认的替换策略取决于 Job 是否设置了 `podFailurePolicy`。对于没有定义 Pod 失效策略的 Job,
17991929
省略 `podReplacementPolicy` 字段相当于选择 `TerminatingOrFailed` 替换策略:
18001930
控制平面在 Pod 删除时立即创建替换 Pod(只要控制平面发现该 Job 的某个 Pod 被设置了 `deletionTimestamp`)。
1801-
对于设置了 Pod 失败策略的 Job,默认的 `podReplacementPolicy``Failed`,不允许其他值。
1802-
请参阅 [Pod 失败策略](#pod-failure-policy)以了解更多关于 Job 的 Pod 失败策略的信息
1931+
对于设置了 Pod 失效策略的 Job,默认的 `podReplacementPolicy``Failed`,不允许其他值。
1932+
请参阅 [Pod 失效策略](#pod-failure-policy)以了解更多关于 Job 的 Pod 失效策略的信息
18031933

18041934
```yaml
18051935
kind: Job

0 commit comments

Comments
 (0)