@@ -94,22 +94,22 @@ Check on the status of the Job with `kubectl`:
94
94
95
95
{{< tabs name="Check status of Job" >}}
96
96
{{< tab name="kubectl describe job pi" codelang="bash" >}}
97
- Name: pi
98
- Namespace: default
99
- Selector: controller-uid=0cd26dd5-88a2-4a5f-a203-ea19a1d5d578
100
- Labels: controller-uid=0cd26dd5-88a2-4a5f-a203-ea19a1d5d578
101
- job-name=pi
102
- Annotations: batch.kubernetes.io/job-tracking:
103
- Parallelism : 1
104
- Completions: 1
105
- Completion Mode : NonIndexed
106
- Start Time: Fri, 28 Oct 2022 13:05:18 +0530
107
- Completed At: Fri, 28 Oct 2022 13:05:21 +0530
108
- Duration: 3s
109
- Pods Statuses: 0 Active / 1 Succeeded / 0 Failed
97
+ Name: pi
98
+ Namespace: default
99
+ Selector: batch.kubernetes.io/ controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
100
+ Labels: batch.kubernetes.io/ controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
101
+ batch.kubernetes.io/ job-name=pi
102
+ ...
103
+ Annotations : batch.kubernetes.io/job-tracking: ""
104
+ Parallelism: 1
105
+ Completions : 1
106
+ Start Time: Mon, 02 Dec 2019 15:20:11 +0200
107
+ Completed At: Mon, 02 Dec 2019 15:21:16 +0200
108
+ Duration: 65s
109
+ Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
110
110
Pod Template:
111
- Labels: controller-uid=0cd26dd5-88a2-4a5f-a203-ea19a1d5d578
112
- job-name=pi
111
+ Labels: batch.kubernetes.io/ controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
112
+ batch.kubernetes.io/ job-name=pi
113
113
Containers:
114
114
pi:
115
115
Image: perl:5.34.0
@@ -133,15 +133,13 @@ Events:
133
133
apiVersion: batch/v1
134
134
kind: Job
135
135
metadata:
136
- annotations:
137
- batch.kubernetes.io/job-tracking: ""
138
- kubectl.kubernetes.io/last-applied-configuration: |
139
- {"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"pi","namespace":"default"},"spec":{"backoffLimit":4,"template":{"spec":{"containers":[ {"command":[ "perl","-Mbignum=bpi","-wle","print bpi(2000)"] ,"image":"perl:5.34.0","name":"pi"}] ,"restartPolicy":"Never"}}}}
136
+ annotations: batch.kubernetes.io/job-tracking: ""
137
+ ...
140
138
creationTimestamp: "2022-11-10T17:53:53Z"
141
139
generation: 1
142
140
labels:
143
- controller-uid: 204fb678-040b-497f-9266-35ffa8716d14
144
- job-name: pi
141
+ batch.kubernetes.io/ controller-uid: 863452e6-270d-420e-9b94-53a54146c223
142
+ batch.kubernetes.io/ job-name: pi
145
143
name: pi
146
144
namespace: default
147
145
resourceVersion: "4751"
@@ -153,14 +151,14 @@ spec:
153
151
parallelism: 1
154
152
selector:
155
153
matchLabels:
156
- controller-uid: 204fb678-040b-497f-9266-35ffa8716d14
154
+ batch.kubernetes.io/ controller-uid: 863452e6-270d-420e-9b94-53a54146c223
157
155
suspend: false
158
156
template:
159
157
metadata:
160
158
creationTimestamp: null
161
159
labels:
162
- controller-uid: 204fb678-040b-497f-9266-35ffa8716d14
163
- job-name: pi
160
+ batch.kubernetes.io/ controller-uid: 863452e6-270d-420e-9b94-53a54146c223
161
+ batch.kubernetes.io/ job-name: pi
164
162
spec:
165
163
containers:
166
164
- command:
@@ -197,7 +195,7 @@ To list all the Pods that belong to a Job in a machine readable form, you can us
197
195
要以机器可读的方式列举隶属于某 Job 的全部 Pod,你可以使用类似下面这条命令:
198
196
199
197
``` shell
200
- pods=$( kubectl get pods --selector=job-name=pi --output=jsonpath=' {.items[*].metadata.name}' )
198
+ pods=$( kubectl get pods --selector=batch.kubernetes.io/ job-name=pi --output=jsonpath=' {.items[*].metadata.name}' )
201
199
echo $pods
202
200
```
203
201
@@ -225,6 +223,15 @@ View the standard output of one of the pods:
225
223
kubectl logs $pods
226
224
```
227
225
226
+ <!--
227
+ Another way to view the logs of a Job:
228
+ -->
229
+ 另外一种查看 Job 日志的方法:
230
+
231
+ ``` shell
232
+ kubectl logs jobs/pi
233
+ ```
234
+
228
235
<!--
229
236
The output is similar to this:
230
237
-->
@@ -262,6 +269,15 @@ Job 的名字必须是合法的 [DNS 子域名](/zh-cn/docs/concepts/overview/wo
262
269
263
270
Job 配置还需要一个 [ ` .spec ` 节] ( https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status ) 。
264
271
272
+ <!--
273
+ ### Job Labels
274
+ -->
275
+ ### Job 标签
276
+
277
+ <!--
278
+ Job labels will have `batch.kubernetes.io/` prefix for `job-name` and `controller-uid`.
279
+ -->
280
+ Job 标签将为 ` job-name ` 和 ` controller-uid ` 加上 ` batch.kubernetes.io/ ` 前缀。
265
281
<!--
266
282
### Pod Template
267
283
@@ -1058,7 +1074,7 @@ Job 被恢复执行时,Pod 创建操作立即被重启执行。
1058
1074
-->
1059
1075
### 可变调度指令 {#mutable-scheduling-directives}
1060
1076
1061
- {{< feature-state for_k8s_version="v1.23 " state="beta " >}}
1077
+ {{< feature-state for_k8s_version="v1.27 " state="stable " >}}
1062
1078
1063
1079
{{< note >}}
1064
1080
<!--
@@ -1102,9 +1118,10 @@ been unsuspended before.
1102
1118
1103
1119
<!--
1104
1120
The fields in a Job's pod template that can be updated are node affinity, node selector,
1105
- tolerations, labels and annotations .
1121
+ tolerations, labels, annotations and [scheduling gates](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/) .
1106
1122
-->
1107
- Job 的 Pod 模板中可以更新的字段是节点亲和性、节点选择器、容忍、标签和注解。
1123
+ Job 的 Pod 模板中可以更新的字段是节点亲和性、节点选择器、容忍、标签、注解和
1124
+ [调度门控](/zh-cn/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)。
1108
1125
1109
1126
<!--
1110
1127
### Specifying your own Pod selector
@@ -1181,20 +1198,21 @@ metadata:
1181
1198
spec :
1182
1199
selector :
1183
1200
matchLabels :
1184
- controller-uid : a8f3d00d-c6d2-11e5-9f87-42010af00002
1201
+ batch.kubernetes.io/ controller-uid : a8f3d00d-c6d2-11e5-9f87-42010af00002
1185
1202
...
1186
1203
```
1187
1204
1188
1205
<!--
1189
1206
Then you create a new Job with name `new` and you explicitly specify the same selector.
1190
- Since the existing Pods have label `controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002`,
1207
+ Since the existing Pods have label `batch.kubernetes.io/ controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002`,
1191
1208
they are controlled by Job `new` as well.
1192
1209
1193
1210
You need to specify `manualSelector: true` in the new Job since you are not using
1194
1211
the selector that the system normally generates for you automatically.
1195
1212
-->
1196
1213
接下来你会创建名为 ` new ` 的新 Job,并显式地为其设置相同的选择算符。
1197
- 由于现有 Pod 都具有标签 ` controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002 ` ,
1214
+ 由于现有 Pod 都具有标签
1215
+ ` batch.kubernetes.io/controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002 ` ,
1198
1216
它们也会被名为 ` new ` 的 Job 所控制。
1199
1217
1200
1218
你需要在新 Job 中设置 ` manualSelector: true ` ,
@@ -1209,7 +1227,7 @@ spec:
1209
1227
manualSelector : true
1210
1228
selector :
1211
1229
matchLabels :
1212
- controller-uid : a8f3d00d-c6d2-11e5-9f87-42010af00002
1230
+ batch.kubernetes.io/ controller-uid : a8f3d00d-c6d2-11e5-9f87-42010af00002
1213
1231
...
1214
1232
```
1215
1233
@@ -1223,14 +1241,14 @@ mismatch.
1223
1241
是在告诉系统你知道自己在干什么并要求系统允许这种不匹配的存在。
1224
1242
1225
1243
<!--
1226
- ### Pod failure policy {#pod-failure-policy}
1244
+ ### Pod failure policy {#pod-failure-policy}
1227
1245
-->
1228
1246
### Pod 失效策略 {#pod-failure-policy}
1229
1247
1230
1248
{{< feature-state for_k8s_version="v1.26" state="beta" >}}
1231
1249
1232
1250
{{< note >}}
1233
- <!--
1251
+ <!--
1234
1252
You can only configure a Pod failure policy for a Job if you have the
1235
1253
`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
1236
1254
enabled in your cluster. Additionally, it is recommended
@@ -1247,23 +1265,23 @@ available in Kubernetes {{< skew currentVersion >}}.
1247
1265
这两个特性门控都是在 Kubernetes {{< skew currentVersion >}} 中提供的。
1248
1266
{{< /note >}}
1249
1267
1250
- <!--
1268
+ <!--
1251
1269
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
1252
1270
your cluster to handle Pod failures based on the container exit codes and the
1253
- Pod conditions.
1271
+ Pod conditions.
1254
1272
-->
1255
1273
Pod 失效策略使用 ` .spec.podFailurePolicy ` 字段来定义,
1256
1274
它能让你的集群根据容器的退出码和 Pod 状况来处理 Pod 失效事件。
1257
1275
1258
- <!--
1276
+ <!--
1259
1277
In some situations, you may want to have a better control when handling Pod
1260
1278
failures than the control provided by the [Pod backoff failure policy](#pod-backoff-failure-policy),
1261
- which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
1279
+ which is based on the Job's `.spec.backoffLimit`. These are some examples of use cases:
1262
1280
-->
1263
1281
在某些情况下,你可能希望更好地控制 Pod 失效的处理方式,
1264
1282
而不是仅限于 [ Pod 回退失效策略] ( #pod-backoff-failure-policy ) 所提供的控制能力,
1265
1283
后者是基于 Job 的 ` .spec.backoffLimit ` 实现的。以下是一些使用场景:
1266
- <!--
1284
+ <!--
1267
1285
* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
1268
1286
you can terminate a Job as soon as one of its Pods fails with an exit code
1269
1287
indicating a software bug.
@@ -1281,30 +1299,30 @@ which is based on the Job's `.spec.backoffLimit`. These are some examples of use
1281
1299
或基于{{< glossary_tooltip text="污点" term_id="taint" >}}的驱逐),
1282
1300
这样这些失效就不会被计入 ` .spec.backoffLimit ` 的重试限制中。
1283
1301
1284
- <!--
1302
+ <!--
1285
1303
You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
1286
1304
to meet the above use cases. This policy can handle Pod failures based on the
1287
- container exit codes and the Pod conditions.
1305
+ container exit codes and the Pod conditions.
1288
1306
-->
1289
1307
你可以在 ` .spec.podFailurePolicy ` 字段中配置 Pod 失效策略,以满足上述使用场景。
1290
1308
该策略可以根据容器退出码和 Pod 状况来处理 Pod 失效。
1291
1309
1292
- <!--
1293
- Here is a manifest for a Job that defines a `podFailurePolicy`:
1310
+ <!--
1311
+ Here is a manifest for a Job that defines a `podFailurePolicy`:
1294
1312
-->
1295
1313
下面是一个定义了 ` podFailurePolicy ` 的 Job 的清单:
1296
1314
1297
- {{< codenew file="controllers/job-pod-failure-policy-example.yaml" >}}
1315
+ {{< codenew file="/ controllers/job-pod-failure-policy-example.yaml" >}}
1298
1316
1299
- <!--
1317
+ <!--
1300
1318
In the example above, the first rule of the Pod failure policy specifies that
1301
1319
the Job should be marked failed if the `main` container fails with the 42 exit
1302
- code. The following are the rules for the `main` container specifically:
1320
+ code. The following are the rules for the `main` container specifically:
1303
1321
-->
1304
1322
在上面的示例中,Pod 失效策略的第一条规则规定如果 ` main ` 容器失败并且退出码为 42,
1305
1323
Job 将被标记为失败。以下是 ` main ` 容器的具体规则:
1306
1324
1307
- <!--
1325
+ <!--
1308
1326
- an exit code of 0 means that the container succeeded
1309
1327
- an exit code of 42 means that the **entire Job** failed
1310
1328
- any other exit code represents that the container failed, and hence the entire
@@ -1318,34 +1336,34 @@ Job 将被标记为失败。以下是 `main` 容器的具体规则:
1318
1336
如果等于 ` backoffLimit ` 所设置的次数,则代表 ** 整个 Job** 失效。
1319
1337
1320
1338
{{< note >}}
1321
- <!--
1339
+ <!--
1322
1340
Because the Pod template specifies a `restartPolicy: Never`,
1323
- the kubelet does not restart the `main` container in that particular Pod.
1341
+ the kubelet does not restart the `main` container in that particular Pod.
1324
1342
-->
1325
1343
因为 Pod 模板中指定了 ` restartPolicy: Never ` ,
1326
1344
所以 kubelet 将不会重启 Pod 中的 ` main ` 容器。
1327
1345
{{< /note >}}
1328
1346
1329
- <!--
1347
+ <!--
1330
1348
The second rule of the Pod failure policy, specifying the `Ignore` action for
1331
1349
failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
1332
- being counted towards the `.spec.backoffLimit` limit of retries.
1350
+ being counted towards the `.spec.backoffLimit` limit of retries.
1333
1351
-->
1334
1352
Pod 失效策略的第二条规则,
1335
1353
指定对于状况为 ` DisruptionTarget ` 的失效 Pod 采取 ` Ignore ` 操作,
1336
1354
统计 ` .spec.backoffLimit ` 重试次数限制时不考虑 Pod 因干扰而发生的异常。
1337
1355
1338
1356
{{< note >}}
1339
- <!--
1357
+ <!--
1340
1358
If the Job failed, either by the Pod failure policy or Pod backoff
1341
1359
failure policy, and the Job is running multiple Pods, Kubernetes terminates all
1342
- the Pods in that Job that are still Pending or Running.
1360
+ the Pods in that Job that are still Pending or Running.
1343
1361
-->
1344
1362
如果根据 Pod 失效策略或 Pod 回退失效策略判定 Pod 已经失效,
1345
1363
并且 Job 正在运行多个 Pod,Kubernetes 将终止该 Job 中仍处于 Pending 或 Running 的所有 Pod。
1346
1364
{{< /note >}}
1347
1365
1348
- <!--
1366
+ <!--
1349
1367
These are some requirements and semantics of the API:
1350
1368
- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
1351
1369
also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
@@ -1382,6 +1400,26 @@ These are some requirements and semantics of the API:
1382
1400
- ` Ignore ` :表示 ` .spec.backoffLimit ` 的计数器不应该增加,应该创建一个替换的 Pod。
1383
1401
- ` Count ` :表示 Pod 应该以默认方式处理。` .spec.backoffLimit ` 的计数器应该增加。
1384
1402
1403
+ {{< note >}}
1404
+ <!--
1405
+ When you use a `podFailurePolicy`, the job controller only matches Pods in the
1406
+ `Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
1407
+ (`Failed` or `Succeeded`) are considered still terminating. This implies that
1408
+ terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
1409
+ until they reach a terminal phase.
1410
+ Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
1411
+ (see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
1412
+ ensures that deleted pods have their finalizers removed by the Job controller.
1413
+ -->
1414
+ 当你使用 ` podFailurePolicy ` 时,Job 控制器只匹配处于 ` Failed ` 阶段的 Pod。
1415
+ 具有删除时间戳但不处于终止阶段(` Failed ` 或 ` Succeeded ` )的 Pod 被视为仍在终止中。
1416
+ 这意味着终止中的 Pod 会保留一个[ 跟踪 Finalizer] ( #job-tracking-with-finalizers ) ,
1417
+ 直到到达终止阶段。
1418
+ 从 Kubernetes 1.27 开始,kubelet 将删除的 Pod 转换到终止阶段
1419
+ (参阅 [ Pod 阶段] ( /zh-cn/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase ) )。
1420
+ 这确保已删除的 Pod 的 Finalizer 被 Job 控制器移除。
1421
+ {{< /note >}}
1422
+
1385
1423
<!--
1386
1424
### Job tracking with finalizers
1387
1425
-->
@@ -1435,6 +1473,30 @@ are tracked using Pod finalizers.
1435
1473
你** 不** 应该给 Job 手动添加或删除该注解。
1436
1474
取而代之的是你可以重新创建 Job 以确保使用 Pod Finalizer 跟踪这些 Job。
1437
1475
1476
+ <!--
1477
+ ### Elastic Indexed Jobs
1478
+ -->
1479
+ ### 弹性索引 Job {#elastic-indexed-job}
1480
+
1481
+ {{< feature-state for_k8s_version="v1.27" state="beta" >}}
1482
+
1483
+ <!--
1484
+ You can scale Indexed Jobs up or down by mutating both `.spec.parallelism`
1485
+ and `.spec.completions` together such that `.spec.parallelism == .spec.completions`.
1486
+ When the `ElasticIndexedJob`[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
1487
+ on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/)
1488
+ is disabled, `.spec.completions` is immutable.
1489
+
1490
+ Use cases for elastic Indexed Jobs include batch workloads which require
1491
+ scaling an indexed Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
1492
+ -->
1493
+ 你可以通过同时改变 ` .spec.parallelism ` 和 ` .spec.completions ` 来扩大或缩小带索引 Job,
1494
+ 从而满足 ` .spec.parallelism == .spec.completions ` 。
1495
+ 当 [ API 服务器] ( /zh-cn/docs/reference/command-line-tools-reference/kube-apiserver/ )
1496
+ 上的 ` ElasticIndexedJob ` 特性门控被禁用时,` .spec.completions ` 是不可变的。
1497
+ 弹性索引 Job 的使用场景包括需要扩展索引 Job 的批处理工作负载,例如 MPI、Horovord、Ray
1498
+ 和 PyTorch 训练作业。
1499
+
1438
1500
<!--
1439
1501
## Alternatives
1440
1502
0 commit comments