Skip to content

Commit 6e70ef7

Browse files
authored
Merge pull request #36602 from windsonsea/podfpol
[zh] Sync1.25 /tasks/job/pod-failure-policy.md
2 parents 5a2ef57 + cb3fc67 commit 6e70ef7

File tree

3 files changed

+297
-0
lines changed

3 files changed

+297
-0
lines changed
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
title: 使用 Pod 失效策略处理可重试和不可重试的 Pod 失效
3+
content_type: task
4+
min-kubernetes-server-version: v1.25
5+
weight: 60
6+
---
7+
<!--
8+
title: Handling retriable and non-retriable pod failures with Pod failure policy
9+
content_type: task
10+
min-kubernetes-server-version: v1.25
11+
weight: 60
12+
-->
13+
14+
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
15+
16+
<!-- overview -->
17+
18+
<!--
19+
This document shows you how to use the
20+
[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy),
21+
in combination with the default
22+
[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
23+
to improve the control over the handling of container- or Pod-level failure
24+
within a {{<glossary_tooltip text="Job" term_id="job">}}.
25+
-->
26+
本文向你展示如何结合默认的 [Pod 回退失效策略](/zh-cn/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy)来使用
27+
[Pod 失效策略](/zh-cn/docs/concepts/workloads/controllers/job#pod-failure-policy)
28+
以改善 {{<glossary_tooltip text="Job" term_id="job">}} 内处理容器级别或 Pod 级别的失效。
29+
30+
<!--
31+
The definition of Pod failure policy may help you to:
32+
* better utilize the computational resources by avoiding unnecessary Pod retries.
33+
* avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}},
34+
{{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
35+
or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction).
36+
-->
37+
Pod 失效策略的定义可以帮助你:
38+
* 避免不必要的 Pod 重试,以更好地利用计算资源。
39+
* 避免由于 Pod 干扰(例如{{<glossary_tooltip text="抢占" term_id="preemption" >}}、
40+
{{<glossary_tooltip text="API 发起的驱逐" term_id="api-eviction" >}}或基于{{<glossary_tooltip text="污点" term_id="taint" >}}的驱逐)
41+
而造成的 Job 失败。
42+
43+
## {{% heading "prerequisites" %}}
44+
45+
<!--
46+
You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/).
47+
-->
48+
你应该已熟悉了 [Job](/zh-cn/docs/concepts/workloads/controllers/job/) 的基本用法。
49+
50+
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
51+
52+
<!-- steps -->
53+
54+
{{< note >}}
55+
<!--
56+
As the features are in Alpha, prepare the Kubernetes cluster with the two
57+
[feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
58+
enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`.
59+
-->
60+
因为这些特性还处于 Alpha 阶段,所以在准备 Kubernetes
61+
集群时要启用两个[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)
62+
`JobPodFailurePolicy``PodDisruptionsCondition`
63+
{{< /note >}}
64+
65+
<!--
66+
## Using Pod failure policy to avoid unnecessary Pod retries
67+
68+
With the following example, you can learn how to use Pod failure policy to
69+
avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
70+
software bug.
71+
72+
First, create a Job based on the config:
73+
-->
74+
## 使用 Pod 失效策略以避免不必要的 Pod 重试 {#using-pod-failure-policy-to-avoid-unecessary-pod-retries}
75+
76+
借用以下示例,你可以学习在 Pod 失效表明有一个不可重试的软件漏洞时如何使用
77+
Pod 失效策略来避免不必要的 Pod 重启。
78+
79+
首先,基于配置创建一个 Job:
80+
81+
{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}}
82+
83+
<!--
84+
by running:
85+
-->
86+
运行以下命令:
87+
88+
```sh
89+
kubectl create -f job-pod-failure-policy-failjob.yaml
90+
```
91+
92+
<!--
93+
After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
94+
-->
95+
大约 30 秒后,整个 Job 应被终止。通过运行以下命令来查看 Job 的状态:
96+
97+
```sh
98+
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
99+
```
100+
101+
<!--
102+
In the Job status, see a job `Failed` condition with the field `reason`
103+
equal `PodFailurePolicy`. Additionally, the `message` field contains a
104+
more detailed information about the Job termination, such as:
105+
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
106+
107+
For comparison, if the Pod failure policy was disabled it would take 6 retries
108+
of the Pod, taking at least 2 minutes.
109+
-->
110+
在 Job 状态中,看到一个任务状况为 `Failed`,其 `reason` 字段等于 `PodFailurePolicy`
111+
此外,`message` 字段包含有关 Job 终止更详细的信息,例如:
112+
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`
113+
114+
为了比较,如果 Pod 失效策略被禁用,将会让 Pod 重试 6 次,用时至少 2 分钟。
115+
116+
<!--
117+
### Clean up
118+
119+
Delete the Job you created:
120+
-->
121+
### 清理
122+
123+
删除你创建的 Job:
124+
125+
```sh
126+
kubectl delete jobs/job-pod-failure-policy-failjob
127+
```
128+
129+
<!--
130+
The cluster automatically cleans up the Pods.
131+
132+
## Using Pod failure policy to ignore Pod disruptions
133+
134+
With the following example, you can learn how to use Pod failure policy to
135+
ignore Pod disruptions from incrementing the Pod retry counter towards the
136+
`.spec.backoffLimit` limit.
137+
-->
138+
集群自动清理这些 Pod。
139+
140+
## 使用 Pod 失效策略来忽略 Pod 干扰 {#using-pod-failure-policy-to-ignore-pod-disruptions}
141+
142+
通过以下示例,你可以学习如何使用 Pod 失效策略将 Pod 重试计数器朝着 `.spec.backoffLimit` 限制递增来忽略 Pod 干扰。
143+
144+
{{< caution >}}
145+
<!--
146+
Timing is important for this example, so you may want to read the steps before
147+
execution. In order to trigger a Pod disruption it is important to drain the
148+
node while the Pod is running on it (within 90s since the Pod is scheduled).
149+
-->
150+
这个示例的时机比较重要,因此你可能需要在执行之前阅读这些步骤。
151+
为了触发 Pod 干扰,重要的是在 Pod 在其上运行时(自 Pod 调度后的 90 秒内)腾空节点。
152+
{{< /caution >}}
153+
154+
<!--
155+
1. Create a Job based on the config:
156+
-->
157+
1. 基于配置创建 Job:
158+
159+
{{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}}
160+
161+
<!--
162+
by running:
163+
-->
164+
运行以下命令:
165+
166+
```sh
167+
kubectl create -f job-pod-failure-policy-ignore.yaml
168+
```
169+
170+
<!--
171+
2. Run this command to check the `nodeName` the Pod is scheduled to:
172+
-->
173+
2. 运行以下这条命令检查将 Pod 调度到的 `nodeName`
174+
175+
```sh
176+
nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}')
177+
```
178+
179+
<!--
180+
3. Drain the node to evict the Pod before it completes (within 90s):
181+
-->
182+
3. 腾空该节点以便在 Pod 完成任务之前将其驱逐(90 秒内):
183+
184+
```sh
185+
kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
186+
```
187+
188+
<!--
189+
4. Inspect the `.status.failed` to check the counter for the Job is not incremented:
190+
-->
191+
4. 查看 `.status.failed` 以检查针对 Job 的计数器未递增:
192+
193+
```sh
194+
kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
195+
```
196+
197+
<!--
198+
5. Uncordon the node:
199+
-->
200+
5. 解除节点的保护:
201+
202+
```sh
203+
kubectl uncordon nodes/$nodeName
204+
```
205+
206+
<!--
207+
The Job resumes and succeeds.
208+
209+
For comparison, if the Pod failure policy was disabled the Pod disruption would
210+
result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0).
211+
-->
212+
Job 恢复并成功完成。
213+
214+
为了比较,如果 Pod 失效策略被禁用,Pod 干扰将使得整个 Job 终止(随着 `.spec.backoffLimit` 设置为 0)。
215+
216+
<!--
217+
### Cleaning up
218+
219+
Delete the Job you created:
220+
-->
221+
### 清理
222+
223+
删除你创建的 Job:
224+
225+
```sh
226+
kubectl delete jobs/job-pod-failure-policy-ignore
227+
```
228+
229+
<!--
230+
The cluster automatically cleans up the Pods.
231+
-->
232+
集群自动清理 Pod。
233+
234+
<!--
235+
## Alternatives
236+
237+
You could rely solely on the
238+
[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
239+
by specifying the Job's `.spec.backoffLimit` field. However, in many situations
240+
it is problematic to find a balance between setting the a low value for `.spec.backoffLimit`
241+
to avoid unnecessary Pod retries, yet high enough to make sure the Job would
242+
not be terminated by Pod disruptions.
243+
-->
244+
## 替代方案 {#alternatives}
245+
246+
通过指定 Job 的 `.spec.backoffLimit` 字段,你可以完全依赖
247+
[Pod 回退失效策略](/zh-cn/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy)
248+
然而在许多情况下,难题在于如何找到一个平衡,为 `.spec.backoffLimit` 设置一个较小的值以避免不必要的 Pod 重试,
249+
或者设置一个足够大的值以确保 Job 不会因 Pod 干扰而终止。
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: job-pod-failure-policy-failjob
5+
spec:
6+
completions: 8
7+
parallelism: 2
8+
template:
9+
spec:
10+
restartPolicy: Never
11+
containers:
12+
- name: main
13+
image: docker.io/library/bash:5
14+
command: ["bash"]
15+
args:
16+
- -c
17+
- echo "Hello world! I'm going to exit with 42 to simulate a software bug." && sleep 30 && exit 42
18+
backoffLimit: 6
19+
podFailurePolicy:
20+
rules:
21+
- action: FailJob
22+
onExitCodes:
23+
containerName: main
24+
operator: In
25+
values: [42]
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: job-pod-failure-policy-ignore
5+
spec:
6+
completions: 4
7+
parallelism: 2
8+
template:
9+
spec:
10+
restartPolicy: Never
11+
containers:
12+
- name: main
13+
image: docker.io/library/bash:5
14+
command: ["bash"]
15+
args:
16+
- -c
17+
- echo "Hello world! I'm going to exit with 0 (success)." && sleep 90 && exit 0
18+
backoffLimit: 0
19+
podFailurePolicy:
20+
rules:
21+
- action: Ignore
22+
onPodConditions:
23+
- type: DisruptionTarget

0 commit comments

Comments
 (0)