@@ -573,9 +573,7 @@ multiple pods running at once. Therefore, your pods must also be tolerant of con
573
573
为此,你的 Pod 也必须能够处理并发性问题。
574
574
575
575
<!--
576
- When the [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
577
- `PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled,
578
- and the `.spec.podFailurePolicy` field is set, the Job controller does not consider a terminating
576
+ If you specify the `.spec.podFailurePolicy` field, the Job controller does not consider a terminating
579
577
Pod (a pod that has a `.metadata.deletionTimestamp` field set) as a failure until that Pod is
580
578
terminal (its `.status.phase` is `Failed` or `Succeeded`). However, the Job controller
581
579
creates a replacement Pod as soon as the termination becomes apparent. Once the
@@ -586,8 +584,7 @@ If either of these requirements is not satisfied, the Job controller counts
586
584
a terminating Pod as an immediate failure, even if that Pod later terminates
587
585
with `phase: "Succeeded"`.
588
586
-->
589
- 当[ 特性门控] ( /zh-cn/docs/reference/command-line-tools-reference/feature-gates/ )
590
- ` PodDisruptionConditions ` 和 ` JobPodFailurePolicy ` 都被启用且 ` .spec.podFailurePolicy ` 字段被设置时,
587
+ 当你指定了 ` .spec.podFailurePolicy ` 字段,
591
588
Job 控制器不会将终止过程中的 Pod(已设置 ` .metadata.deletionTimestamp ` 字段的 Pod)视为失效 Pod,
592
589
直到该 Pod 完全终止(其 ` .status.phase ` 为 ` Failed ` 或 ` Succeeded ` )。
593
590
但只要终止变得显而易见,Job 控制器就会创建一个替代的 Pod。一旦 Pod 终止,Job 控制器将把这个刚终止的
@@ -741,45 +738,43 @@ kubectl get -o yaml job job-backoff-limit-per-index-example
741
738
succeeded : 5 # 每 5 个成功的索引有 1 个成功的 Pod
742
739
failed : 10 # 每 5 个失败的索引有 2 个失败的 Pod(1 次重试)
743
740
conditions :
741
+ - message : Job has failed indexes
742
+ reason : FailedIndexes
743
+ status : " True"
744
+ type : FailureTarget
744
745
- message : Job has failed indexes
745
746
reason : FailedIndexes
746
747
status : " True"
747
748
type : Failed
748
749
` ` `
749
750
751
+ <!--
752
+ The Job controller adds the ` FailureTarget` Job condition to trigger
753
+ [Job termination and cleanup](#job-termination-and-cleanup). When all of the
754
+ Job Pods are terminated, the Job controller adds the `Failed` condition
755
+ with the same values for `reason` and `message` as the `FailureTarget` Job
756
+ condition. For details, see [Termination of Job Pods](#termination-of-job-pods).
757
+ -->
758
+ Job 控制器添加 `FailureTarget` Job 状况来触发 [Job 终止和清理](#job-termination-and-cleanup)。
759
+ 当所有 Job Pod 都终止时,Job 控制器会添加 `Failed` 状况,
760
+ 其 `reason` 和 `message` 的值与 `FailureTarget` Job 状况相同。
761
+ 有关详细信息,请参阅 [Job Pod 的终止](#termination-of-job-pods)。
762
+
750
763
<!--
751
764
Additionally, you may want to use the per-index backoff along with a
752
765
[pod failure policy](#pod-failure-policy). When using
753
766
per-index backoff, there is a new `FailIndex` action available which allows you to
754
767
avoid unnecessary retries within an index.
755
768
-->
756
- 此外,你可能想要结合使用逐索引回退与 [Pod 失败策略 ](#pod-failure-policy)。
769
+ 此外,你可能想要结合使用逐索引回退与 [Pod 失效策略 ](#pod-failure-policy)。
757
770
在使用逐索引回退时,有一个新的 `FailIndex` 操作可用,它让你避免就某个索引进行不必要的重试。
758
771
759
772
<!--
760
773
# ## Pod failure policy {#pod-failure-policy}
761
774
-->
762
775
# ## Pod 失效策略 {#pod-failure-policy}
763
776
764
- {{< feature-state for_k8s_version="v1.26" state="beta" >}}
765
-
766
- {{< note >}}
767
- <!--
768
- You can only configure a Pod failure policy for a Job if you have the
769
- ` JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
770
- enabled in your cluster. Additionally, it is recommended
771
- to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
772
- Pod disruption conditions in the Pod failure policy (see also :
773
- [Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)).
774
- Both feature gates are available in Kubernetes {{< skew currentVersion >}}.
775
- -->
776
- 只有你在集群中启用了
777
- ` JobPodFailurePolicy` [特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/),
778
- 你才能为某个 Job 配置 Pod 失效策略。
779
- 此外,建议启用 `PodDisruptionConditions` 特性门控以便在 Pod 失效策略中检测和处理 Pod 干扰状况
780
- (参考:[Pod 干扰状况](/zh-cn/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions))。
781
- 这两个特性门控都是在 Kubernetes {{< skew currentVersion >}} 中提供的。
782
- {{< /note >}}
777
+ {{< feature-state feature_gate_name="JobPodFailurePolicy" >}}
783
778
784
779
<!--
785
780
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
@@ -948,11 +943,22 @@ Starting with Kubernetes v1.28, when Pod failure policy is used, the Job control
948
943
terminating Pods only once these Pods reach the terminal `Failed` phase. This behavior is similar
949
944
to `podReplacementPolicy : Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
950
945
-->
951
- 自 Kubernetes v1.28 开始,当使用 Pod 失败策略时 ,Job 控制器仅在这些 Pod 达到终止的
946
+ 自 Kubernetes v1.28 开始,当使用 Pod 失效策略时 ,Job 控制器仅在这些 Pod 达到终止的
952
947
`Failed` 阶段时才会重新创建终止中的 Pod。这种行为类似于 `podReplacementPolicy : Failed`。
953
948
细节参阅 [Pod 替换策略](#pod-replacement-policy)。
954
949
{{< /note >}}
955
950
951
+ <!--
952
+ When you use the `podFailurePolicy`, and the Job fails due to the pod
953
+ matching the rule with the `FailJob` action, then the Job controller triggers
954
+ the Job termination process by adding the `FailureTarget` condition.
955
+ For more details, see [Job termination and cleanup](#job-termination-and-cleanup).
956
+ -->
957
+ 当你使用了 `podFailurePolicy`,并且 Pod 因为与 `FailJob`
958
+ 操作的规则匹配而失败时,Job 控制器会通过添加
959
+ ` FailureTarget` 状况来触发 Job 终止流程。
960
+ 更多详情,请参阅 [Job 的终止和清理](#job-termination-and-cleanup)。
961
+
956
962
<!--
957
963
# # Success policy {#success-policy}
958
964
-->
@@ -1036,15 +1042,15 @@ Here is a manifest for a Job with `successPolicy`:
1036
1042
In the example above, both `succeededIndexes` and `succeededCount` have been specified.
1037
1043
Therefore, the job controller will mark the Job as succeeded and terminate the lingering Pods
1038
1044
when either of the specified indexes, 0, 2, or 3, succeed.
1039
- The Job that meets the success policy gets the `SuccessCriteriaMet` condition.
1045
+ The Job that meets the success policy gets the `SuccessCriteriaMet` condition with a `SuccessPolicy` reason.
1040
1046
After the removal of the lingering Pods is issued, the Job gets the `Complete` condition.
1041
1047
1042
1048
Note that the `succeededIndexes` is represented as intervals separated by a hyphen.
1043
1049
The number are listed in represented by the first and last element of the series, separated by a hyphen.
1044
1050
-->
1045
1051
在上面的例子中,`succeededIndexes` 和 `succeededCount` 都已被指定。
1046
1052
因此,当指定的索引 0、2 或 3 中的任意一个成功时,Job 控制器将 Job 标记为成功并终止剩余的 Pod。
1047
- 符合成功策略的 Job 会被标记 `SuccessCriteriaMet` 状况。
1053
+ 符合成功策略的 Job 会被标记 `SuccessCriteriaMet` 状况,且状况的原因为 `SuccessPolicy` 。
1048
1054
在剩余的 Pod 被移除后,Job 会被标记 `Complete` 状况。
1049
1055
1050
1056
请注意,`succeededIndexes` 表示为以连字符分隔的数字序列。
@@ -1152,6 +1158,132 @@ and `.spec.backoffLimit` result in a permanent Job failure that requires manual
1152
1158
换言之,由 `.spec.activeDeadlineSeconds` 和 `.spec.backoffLimit` 所触发的 Job
1153
1159
终结机制都会导致 Job 永久性的失败,而这类状态都需要手工干预才能解决。
1154
1160
1161
+ <!--
1162
+ # ## Terminal Job conditions
1163
+
1164
+ A Job has two possible terminal states, each of which has a corresponding Job
1165
+ condition :
1166
+ * Succeeded: Job condition `Complete`
1167
+ * Failed: Job condition `Failed`
1168
+ -->
1169
+ # ## Job 终止状况 {#terminal-job-conditions}
1170
+
1171
+ 一个 Job 有两种可能的终止状况,每种状况都有相应的 Job 状况:
1172
+
1173
+ * Succeeded:Job `Complete` 状况
1174
+ * Failed:Job `Failed` 状况
1175
+
1176
+ <!--
1177
+ Jobs fail for the following reasons :
1178
+ - The number of Pod failures exceeded the specified `.spec.backoffLimit` in the Job
1179
+ specification. For details, see [Pod backoff failure policy](#pod-backoff-failure-policy).
1180
+ - The Job runtime exceeded the specified `.spec.activeDeadlineSeconds`
1181
+ - An indexed Job that used `.spec.backoffLimitPerIndex` has failed indexes.
1182
+ For details, see [Backoff limit per index](#backoff-limit-per-index).
1183
+ - The number of failed indexes in the Job exceeded the specified
1184
+ ` spec.maxFailedIndexes` . For details, see [Backoff limit per index](#backoff-limit-per-index)
1185
+ - A failed Pod matches a rule in `.spec.podFailurePolicy` that has the `FailJob`
1186
+ action. For details about how Pod failure policy rules might affect failure
1187
+ evaluation, see [Pod failure policy](#pod-failure-policy).
1188
+ -->
1189
+ Job 失败的原因如下:
1190
+
1191
+ - Pod 失败数量超出了 Job 规约中指定的 `.spec.backoffLimit`,
1192
+ 详情请参见 [Pod 回退失效策略](#pod-backoff-failure-policy)。
1193
+ - Job 运行时间超过了指定的 `.spec.activeDeadlineSeconds`。
1194
+ - 使用 `.spec.backoffLimitPerIndex` 的索引 Job 出现索引失败。
1195
+ 有关详细信息,请参阅[逐索引的回退限制](#backoff-limit-per-index)。
1196
+ - Job 中失败的索引数量超出了指定的 `spec.maxFailedIndexes` 值,
1197
+ 详情见[逐索引的回退限制](#backoff-limit-per-index)。
1198
+ - 失败的 Pod 匹配了 `.spec.podFailurePolicy` 中定义的一条规则,该规则的动作为 FailJob。
1199
+ 有关 Pod 失效策略规则如何影响故障评估的详细信息,请参阅 [Pod 失效策略](#pod-failure-policy)。
1200
+
1201
+ <!--
1202
+ Jobs succeed for the following reasons :
1203
+ - The number of succeeded Pods reached the specified `.spec.completions`
1204
+ - The criteria specified in `.spec.successPolicy` are met. For details, see
1205
+ [Success policy](#success-policy).
1206
+ -->
1207
+ Pod 成功的原因如下:
1208
+
1209
+ - 成功的 Pod 的数量达到了指定的 `.spec.completions` 数量。
1210
+ - ` .spec.successPolicy` 中指定的标准已满足。详情请参见[成功策略](#success-policy)。
1211
+
1212
+ <!--
1213
+ In Kubernetes v1.31 and later the Job controller delays the addition of the
1214
+ terminal conditions,`Failed` or `Complete`, until all of the Job Pods are terminated.
1215
+
1216
+ In Kubernetes v1.30 and earlier, the Job controller added the `Complete` or the
1217
+ ` Failed` Job terminal conditions as soon as the Job termination process was
1218
+ triggered and all Pod finalizers were removed. However, some Pods would still
1219
+ be running or terminating at the moment that the terminal condition was added.
1220
+ -->
1221
+ 在 Kubernetes v1.31 及更高版本中,Job 控制器会延迟添加终止状况 `Failed` 或
1222
+ ` Complete` ,直到所有 Job Pod 都终止。
1223
+
1224
+ 在 Kubernetes v1.30 及更早版本中,一旦触发 Job 终止过程并删除所有
1225
+ Pod 终结器,Job 控制器就会给 Job 添加 `Complete` 或 `Failed` 终止状况。
1226
+ 然而,在添加终止状况时,一些 Pod 仍会运行或处于终止过程中。
1227
+
1228
+ <!--
1229
+ In Kubernetes v1.31 and later, the controller only adds the Job terminal conditions
1230
+ _after_ all of the Pods are terminated. You can enable this behavior by using the
1231
+ ` JobManagedBy` or the `JobPodReplacementPolicy` (enabled by default)
1232
+ [feature gates](/docs/reference/command-line-tools-reference/feature-gates/).
1233
+ -->
1234
+ 在 Kubernetes v1.31 及更高版本中,控制器仅在所有 Pod 终止后添加 Job 终止状况。
1235
+ 你可以使用 `JobManagedBy` 或 `JobPodReplacementPolicy`(默认启用)
1236
+ 启用此行为的[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)。
1237
+
1238
+ <!--
1239
+ # ## Termination of Job pods
1240
+
1241
+ The Job controller adds the `FailureTarget` condition or the `SuccessCriteriaMet`
1242
+ condition to the Job to trigger Pod termination after a Job meets either the
1243
+ success or failure criteria.
1244
+ -->
1245
+ # ## Job Pod 的终止
1246
+
1247
+ Job 控制器将 `FailureTarget` 状况或 `SuccessCriteriaMet` 状况添加到
1248
+ Job,以便在 Job 满足成功或失败标准后触发 Pod 终止。
1249
+
1250
+ <!--
1251
+ Factors like `terminationGracePeriodSeconds` might increase the amount of time
1252
+ from the moment that the Job controller adds the `FailureTarget` condition or the
1253
+ ` SuccessCriteriaMet` condition to the moment that all of the Job Pods terminate
1254
+ and the Job controller adds a [terminal condition](#terminal-job-conditions)
1255
+ (`Failed` or `Complete`).
1256
+
1257
+ You can use the `FailureTarget` or the `SuccessCriteriaMet` condition to evaluate
1258
+ whether the Job has failed or succeeded without having to wait for the controller
1259
+ to add a terminal condition.
1260
+ -->
1261
+ 诸如 `terminationGracePeriodSeconds` 之类的因素可能会增加从
1262
+ Job 控制器添加 `FailureTarget` 状况或 `SuccessCriteriaMet` 状况到所有
1263
+ Job Pod 终止并且 Job 控制器添加[终止状况](#terminal-job-conditions)(`Failed` 或 `Complete`)的这段时间量。
1264
+
1265
+ 你可以使用 `FailureTarget` 或 `SuccessCriteriaMet`
1266
+ 状况来评估 Job 是否失败或成功,而无需等待控制器添加终止状况。
1267
+
1268
+ <!--
1269
+ For example, you might want to decide when to create a replacement Job
1270
+ that replaces a failed Job. If you replace the failed Job when the `FailureTarget`
1271
+ condition appears, your replacement Job runs sooner, but could result in Pods
1272
+ from the failed and the replacement Job running at the same time, using
1273
+ extra compute resources.
1274
+
1275
+ Alternatively, if your cluster has limited resource capacity, you could choose to
1276
+ wait until the `Failed` condition appears on the Job, which would delay your
1277
+ replacement Job but would ensure that you conserve resources by waiting
1278
+ until all of the failed Pods are removed.
1279
+ -->
1280
+ 例如,你可能想要决定何时创建 Job 来替代某个已失败 Job。
1281
+ 如果在出现 `FailureTarget` 状况时替换失败的 Job,则替换 Job 启动得会更早,
1282
+ 但可能会导致失败的 Job 和替换 Job 的 Pod 同时处于运行状态,进而额外耗用计算资源。
1283
+
1284
+ 或者,如果你的集群资源容量有限,你可以选择等到 Job 上出现 `Failed` 状况后再执行替换操作。
1285
+ 这样做会延迟替换 Job 的启动,不过通过等待所有失败的 Pod 都被删除,可以节省资源。
1286
+
1155
1287
<!--
1156
1288
# # Clean up finished jobs automatically
1157
1289
@@ -1734,22 +1866,20 @@ observe that pods from a Job are stuck with the tracking finalizer.
1734
1866
-->
1735
1867
### 弹性索引 Job {#elastic-indexed-jobs}
1736
1868
1737
- {{< feature-state for_k8s_version="v1.27" state="beta " >}}
1869
+ {{< feature-state feature_gate_name="ElasticIndexedJob " >}}
1738
1870
1739
1871
<!--
1740
1872
You can scale Indexed Jobs up or down by mutating both `.spec.parallelism`
1741
1873
and `.spec.completions` together such that `.spec.parallelism == .spec.completions`.
1742
- When the `ElasticIndexedJob`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
1743
- on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/)
1744
- is disabled, `.spec.completions` is immutable.
1874
+ When scaling down, Kubernetes removes the Pods with higher indexes.
1745
1875
1746
1876
Use cases for elastic Indexed Jobs include batch workloads which require
1747
1877
scaling an indexed Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
1748
1878
-->
1749
1879
你可以通过同时改变 ` .spec.parallelism ` 和 ` .spec.completions ` 来扩大或缩小带索引 Job,
1750
1880
从而满足 ` .spec.parallelism == .spec.completions ` 。
1751
- 当 [ API 服务器 ] ( /zh-cn/docs/reference/command-line-tools-reference/kube-apiserver/ )
1752
- 上的 ` ElasticIndexedJob ` 特性门控被禁用时, ` .spec.completions ` 是不可变的。
1881
+ 缩减规模时,Kubernetes 会删除具有更高索引的 Pod。
1882
+
1753
1883
弹性索引 Job 的使用场景包括需要扩展索引 Job 的批处理工作负载,例如 MPI、Horovord、Ray
1754
1884
和 PyTorch 训练作业。
1755
1885
@@ -1795,11 +1925,11 @@ See [Pod failure policy](#pod-failure-policy) to learn more about Pod failure po
1795
1925
-->
1796
1926
你可以选择仅在终止过程中的 Pod 完全终止(具有 ` status.phase: Failed ` )时才创建替换 Pod。
1797
1927
为此,可以设置 ` .spec.podReplacementPolicy: Failed ` 。
1798
- 默认的替换策略取决于 Job 是否设置了 ` podFailurePolicy ` 。对于没有定义 Pod 失败策略的 Job,
1928
+ 默认的替换策略取决于 Job 是否设置了 ` podFailurePolicy ` 。对于没有定义 Pod 失效策略的 Job,
1799
1929
省略 ` podReplacementPolicy ` 字段相当于选择 ` TerminatingOrFailed ` 替换策略:
1800
1930
控制平面在 Pod 删除时立即创建替换 Pod(只要控制平面发现该 Job 的某个 Pod 被设置了 ` deletionTimestamp ` )。
1801
- 对于设置了 Pod 失败策略的 Job,默认的 ` podReplacementPolicy ` 是 ` Failed ` ,不允许其他值。
1802
- 请参阅 [ Pod 失败策略 ] ( #pod-failure-policy ) 以了解更多关于 Job 的 Pod 失败策略的信息 。
1931
+ 对于设置了 Pod 失效策略的 Job,默认的 ` podReplacementPolicy ` 是 ` Failed ` ,不允许其他值。
1932
+ 请参阅 [ Pod 失效策略 ] ( #pod-failure-policy ) 以了解更多关于 Job 的 Pod 失效策略的信息 。
1803
1933
1804
1934
``` yaml
1805
1935
kind : Job
0 commit comments