|
| 1 | +--- |
| 2 | +title: 使用 Pod 失效策略处理可重试和不可重试的 Pod 失效 |
| 3 | +content_type: task |
| 4 | +min-kubernetes-server-version: v1.25 |
| 5 | +weight: 60 |
| 6 | +--- |
| 7 | +<!-- |
| 8 | +title: Handling retriable and non-retriable pod failures with Pod failure policy |
| 9 | +content_type: task |
| 10 | +min-kubernetes-server-version: v1.25 |
| 11 | +weight: 60 |
| 12 | +--> |
| 13 | + |
| 14 | +{{< feature-state for_k8s_version="v1.25" state="alpha" >}} |
| 15 | + |
| 16 | +<!-- overview --> |
| 17 | + |
| 18 | +<!-- |
| 19 | +This document shows you how to use the |
| 20 | +[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy), |
| 21 | +in combination with the default |
| 22 | +[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy), |
| 23 | +to improve the control over the handling of container- or Pod-level failure |
| 24 | +within a {{<glossary_tooltip text="Job" term_id="job">}}. |
| 25 | +--> |
| 26 | +本文向你展示如何结合默认的 [Pod 回退失效策略](/zh-cn/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy)来使用 |
| 27 | +[Pod 失效策略](/zh-cn/docs/concepts/workloads/controllers/job#pod-failure-policy), |
| 28 | +以改善 {{<glossary_tooltip text="Job" term_id="job">}} 内处理容器级别或 Pod 级别的失效。 |
| 29 | + |
| 30 | +<!-- |
| 31 | +The definition of Pod failure policy may help you to: |
| 32 | +* better utilize the computational resources by avoiding unnecessary Pod retries. |
| 33 | +* avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}}, |
| 34 | +{{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}} |
| 35 | +or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction). |
| 36 | +--> |
| 37 | +Pod 失效策略的定义可以帮助你: |
| 38 | +* 避免不必要的 Pod 重试,以更好地利用计算资源。 |
| 39 | +* 避免由于 Pod 干扰(例如{{<glossary_tooltip text="抢占" term_id="preemption" >}}、 |
| 40 | + {{<glossary_tooltip text="API 发起的驱逐" term_id="api-eviction" >}}或基于{{<glossary_tooltip text="污点" term_id="taint" >}}的驱逐) |
| 41 | + 而造成的 Job 失败。 |
| 42 | + |
| 43 | +## {{% heading "prerequisites" %}} |
| 44 | + |
| 45 | +<!-- |
| 46 | +You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/). |
| 47 | +--> |
| 48 | +你应该已熟悉了 [Job](/zh-cn/docs/concepts/workloads/controllers/job/) 的基本用法。 |
| 49 | + |
| 50 | +{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}} |
| 51 | + |
| 52 | +<!-- steps --> |
| 53 | + |
| 54 | +{{< note >}} |
| 55 | +<!-- |
| 56 | +As the features are in Alpha, prepare the Kubernetes cluster with the two |
| 57 | +[feature gates](/docs/reference/command-line-tools-reference/feature-gates/) |
| 58 | +enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`. |
| 59 | +--> |
| 60 | +因为这些特性还处于 Alpha 阶段,所以在准备 Kubernetes |
| 61 | +集群时要启用两个[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/): |
| 62 | +`JobPodFailurePolicy` 和 `PodDisruptionsCondition`。 |
| 63 | +{{< /note >}} |
| 64 | + |
| 65 | +<!-- |
| 66 | +## Using Pod failure policy to avoid unnecessary Pod retries |
| 67 | +
|
| 68 | +With the following example, you can learn how to use Pod failure policy to |
| 69 | +avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable |
| 70 | +software bug. |
| 71 | +
|
| 72 | +First, create a Job based on the config: |
| 73 | +--> |
| 74 | +## 使用 Pod 失效策略以避免不必要的 Pod 重试 {#using-pod-failure-policy-to-avoid-unecessary-pod-retries} |
| 75 | + |
| 76 | +借用以下示例,你可以学习在 Pod 失效表明有一个不可重试的软件漏洞时如何使用 |
| 77 | +Pod 失效策略来避免不必要的 Pod 重启。 |
| 78 | + |
| 79 | +首先,基于配置创建一个 Job: |
| 80 | + |
| 81 | +{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}} |
| 82 | + |
| 83 | +<!-- |
| 84 | +by running: |
| 85 | +--> |
| 86 | +运行以下命令: |
| 87 | + |
| 88 | +```sh |
| 89 | +kubectl create -f job-pod-failure-policy-failjob.yaml |
| 90 | +``` |
| 91 | + |
| 92 | +<!-- |
| 93 | +After around 30s the entire Job should be terminated. Inspect the status of the Job by running: |
| 94 | +--> |
| 95 | +大约 30 秒后,整个 Job 应被终止。通过运行以下命令来查看 Job 的状态: |
| 96 | + |
| 97 | +```sh |
| 98 | +kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml |
| 99 | +``` |
| 100 | + |
| 101 | +<!-- |
| 102 | +In the Job status, see a job `Failed` condition with the field `reason` |
| 103 | +equal `PodFailurePolicy`. Additionally, the `message` field contains a |
| 104 | +more detailed information about the Job termination, such as: |
| 105 | +`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`. |
| 106 | +
|
| 107 | +For comparison, if the Pod failure policy was disabled it would take 6 retries |
| 108 | +of the Pod, taking at least 2 minutes. |
| 109 | +--> |
| 110 | +在 Job 状态中,看到一个任务状况为 `Failed`,其 `reason` 字段等于 `PodFailurePolicy`。 |
| 111 | +此外,`message` 字段包含有关 Job 终止更详细的信息,例如: |
| 112 | +`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`。 |
| 113 | + |
| 114 | +为了比较,如果 Pod 失效策略被禁用,将会让 Pod 重试 6 次,用时至少 2 分钟。 |
| 115 | + |
| 116 | +<!-- |
| 117 | +### Clean up |
| 118 | +
|
| 119 | +Delete the Job you created: |
| 120 | +--> |
| 121 | +### 清理 |
| 122 | + |
| 123 | +删除你创建的 Job: |
| 124 | + |
| 125 | +```sh |
| 126 | +kubectl delete jobs/job-pod-failure-policy-failjob |
| 127 | +``` |
| 128 | + |
| 129 | +<!-- |
| 130 | +The cluster automatically cleans up the Pods. |
| 131 | +
|
| 132 | +## Using Pod failure policy to ignore Pod disruptions |
| 133 | +
|
| 134 | +With the following example, you can learn how to use Pod failure policy to |
| 135 | +ignore Pod disruptions from incrementing the Pod retry counter towards the |
| 136 | +`.spec.backoffLimit` limit. |
| 137 | +--> |
| 138 | +集群自动清理这些 Pod。 |
| 139 | + |
| 140 | +## 使用 Pod 失效策略来忽略 Pod 干扰 {#using-pod-failure-policy-to-ignore-pod-disruptions} |
| 141 | + |
| 142 | +通过以下示例,你可以学习如何使用 Pod 失效策略将 Pod 重试计数器朝着 `.spec.backoffLimit` 限制递增来忽略 Pod 干扰。 |
| 143 | + |
| 144 | +{{< caution >}} |
| 145 | +<!-- |
| 146 | +Timing is important for this example, so you may want to read the steps before |
| 147 | +execution. In order to trigger a Pod disruption it is important to drain the |
| 148 | +node while the Pod is running on it (within 90s since the Pod is scheduled). |
| 149 | +--> |
| 150 | +这个示例的时机比较重要,因此你可能需要在执行之前阅读这些步骤。 |
| 151 | +为了触发 Pod 干扰,重要的是在 Pod 在其上运行时(自 Pod 调度后的 90 秒内)腾空节点。 |
| 152 | +{{< /caution >}} |
| 153 | + |
| 154 | +<!-- |
| 155 | +1. Create a Job based on the config: |
| 156 | +--> |
| 157 | +1. 基于配置创建 Job: |
| 158 | + |
| 159 | + {{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}} |
| 160 | + |
| 161 | + <!-- |
| 162 | + by running: |
| 163 | + --> |
| 164 | + 运行以下命令: |
| 165 | + |
| 166 | + ```sh |
| 167 | + kubectl create -f job-pod-failure-policy-ignore.yaml |
| 168 | + ``` |
| 169 | + |
| 170 | +<!-- |
| 171 | +2. Run this command to check the `nodeName` the Pod is scheduled to: |
| 172 | +--> |
| 173 | +2. 运行以下这条命令检查将 Pod 调度到的 `nodeName`: |
| 174 | + |
| 175 | + ```sh |
| 176 | + nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}') |
| 177 | + ``` |
| 178 | + |
| 179 | +<!-- |
| 180 | +3. Drain the node to evict the Pod before it completes (within 90s): |
| 181 | +--> |
| 182 | +3. 腾空该节点以便在 Pod 完成任务之前将其驱逐(90 秒内): |
| 183 | + |
| 184 | + ```sh |
| 185 | + kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0 |
| 186 | + ``` |
| 187 | + |
| 188 | +<!-- |
| 189 | +4. Inspect the `.status.failed` to check the counter for the Job is not incremented: |
| 190 | +--> |
| 191 | +4. 查看 `.status.failed` 以检查针对 Job 的计数器未递增: |
| 192 | + |
| 193 | + ```sh |
| 194 | + kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml |
| 195 | + ``` |
| 196 | + |
| 197 | +<!-- |
| 198 | +5. Uncordon the node: |
| 199 | +--> |
| 200 | +5. 解除节点的保护: |
| 201 | + |
| 202 | + ```sh |
| 203 | + kubectl uncordon nodes/$nodeName |
| 204 | + ``` |
| 205 | + |
| 206 | +<!-- |
| 207 | +The Job resumes and succeeds. |
| 208 | +
|
| 209 | +For comparison, if the Pod failure policy was disabled the Pod disruption would |
| 210 | +result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0). |
| 211 | +--> |
| 212 | +Job 恢复并成功完成。 |
| 213 | + |
| 214 | +为了比较,如果 Pod 失效策略被禁用,Pod 干扰将使得整个 Job 终止(随着 `.spec.backoffLimit` 设置为 0)。 |
| 215 | + |
| 216 | +<!-- |
| 217 | +### Cleaning up |
| 218 | +
|
| 219 | +Delete the Job you created: |
| 220 | +--> |
| 221 | +### 清理 |
| 222 | + |
| 223 | +删除你创建的 Job: |
| 224 | + |
| 225 | +```sh |
| 226 | +kubectl delete jobs/job-pod-failure-policy-ignore |
| 227 | +``` |
| 228 | + |
| 229 | +<!-- |
| 230 | +The cluster automatically cleans up the Pods. |
| 231 | +--> |
| 232 | +集群自动清理 Pod。 |
| 233 | + |
| 234 | +<!-- |
| 235 | +## Alternatives |
| 236 | +
|
| 237 | +You could rely solely on the |
| 238 | +[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy), |
| 239 | +by specifying the Job's `.spec.backoffLimit` field. However, in many situations |
| 240 | +it is problematic to find a balance between setting the a low value for `.spec.backoffLimit` |
| 241 | + to avoid unnecessary Pod retries, yet high enough to make sure the Job would |
| 242 | +not be terminated by Pod disruptions. |
| 243 | +--> |
| 244 | +## 替代方案 {#alternatives} |
| 245 | + |
| 246 | +通过指定 Job 的 `.spec.backoffLimit` 字段,你可以完全依赖 |
| 247 | +[Pod 回退失效策略](/zh-cn/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy)。 |
| 248 | +然而在许多情况下,难题在于如何找到一个平衡,为 `.spec.backoffLimit` 设置一个较小的值以避免不必要的 Pod 重试, |
| 249 | +或者设置一个足够大的值以确保 Job 不会因 Pod 干扰而终止。 |
0 commit comments