Skip to content

Commit 605826c

Browse files
committed
[zh-cn] resync /tasks/debug/debug-cluster/
1 parent 4b435b4 commit 605826c

File tree

1 file changed

+76
-70
lines changed
  • content/zh-cn/docs/tasks/debug/debug-cluster

1 file changed

+76
-70
lines changed

content/zh-cn/docs/tasks/debug/debug-cluster/_index.md

Lines changed: 76 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ Sometimes when debugging it can be useful to look at the status of a node -- for
6464
-->
6565
### 示例:调试关闭/无法访问的节点 {#example-debugging-a-down-unreachable-node}
6666

67-
有时在调试时查看节点的状态很有用——例如,因为你注意到在节点上运行的 Pod 的奇怪行为,
67+
有时在调试时查看节点的状态很有用 —— 例如,因为你注意到在节点上运行的 Pod 的奇怪行为,
6868
或者找出为什么 Pod 不会调度到节点上。与 Pod 一样,你可以使用 `kubectl describe node`
6969
`kubectl get node -o yaml` 来检索有关节点的详细信息。
7070
例如,如果节点关闭(与网络断开连接,或者 kubelet 进程挂起并且不会重新启动等),
@@ -260,28 +260,30 @@ of the relevant log files. On systemd-based systems, you may need to use `journ
260260
<!--
261261
### Control Plane nodes
262262

263-
* `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
264-
* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
265-
* `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in {{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling (the kube-scheduler handles scheduling).
263+
* `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
264+
* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
265+
* `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in
266+
{{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling
267+
(the kube-scheduler handles scheduling).
266268
-->
267269
### 控制平面节点 {#control-plane-nodes}
268270

269-
* `/var/log/kube-apiserver.log` —— API 服务器 API
270-
* `/var/log/kube-scheduler.log` —— 调度器,负责制定调度决策
271-
* `/var/log/kube-controller-manager.log` —— 运行大多数 Kubernetes
272-
内置{{<glossary_tooltip text="控制器" term_id="controller">}}的组件,除了调度(kube-scheduler 处理调度)。
271+
* `/var/log/kube-apiserver.log` —— API 服务器,负责提供 API 服务
272+
* `/var/log/kube-scheduler.log` —— 调度器,负责制定调度决策
273+
* `/var/log/kube-controller-manager.log` —— 运行大多数 Kubernetes
274+
内置{{<glossary_tooltip text="控制器" term_id="controller">}}的组件,除了调度(kube-scheduler 处理调度)。
273275

274276
<!--
275277
### Worker Nodes
276278

277-
* `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
278-
* `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
279+
* `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
280+
* `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
279281
-->
280282

281283
### 工作节点 {#worker-nodes}
282284

283-
* `/var/log/kubelet.log` —— 来自 `kubelet` 的日志,负责在节点运行容器
284-
* `/var/log/kube-proxy.log` —— 来自 `kube-proxy` 的日志,负责将流量转发到服务端点
285+
* `/var/log/kubelet.log` —— 来自 `kubelet` 的日志,负责在节点运行容器
286+
* `/var/log/kube-proxy.log` —— 来自 `kube-proxy` 的日志,负责将流量转发到服务端点
285287

286288
<!--
287289
## Cluster failure modes
@@ -295,32 +297,32 @@ This is an incomplete list of things that could go wrong, and how to adjust your
295297
<!--
296298
### Contributing causes
297299

298-
- VM(s) shutdown
299-
- Network partition within cluster, or between cluster and users
300-
- Crashes in Kubernetes software
301-
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
302-
- Operator error, for example misconfigured Kubernetes software or application software
300+
- VM(s) shutdown
301+
- Network partition within cluster, or between cluster and users
302+
- Crashes in Kubernetes software
303+
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
304+
- Operator error, for example misconfigured Kubernetes software or application software
303305
-->
304-
### 造成原因 {#contributing-causes}
306+
### 故障原因 {#contributing-causes}
305307

306-
- 虚拟机关闭
307-
- 集群内或集群与用户之间的网络分区
308-
- Kubernetes 软件崩溃
309-
- 持久存储(例如 GCE PD 或 AWS EBS 卷)的数据丢失或不可用
310-
- 操作员错误,例如配置错误的 Kubernetes 软件或应用程序软件
308+
- 虚拟机关闭
309+
- 集群内或集群与用户之间的网络分区
310+
- Kubernetes 软件崩溃
311+
- 持久存储(例如 GCE PD 或 AWS EBS 卷)的数据丢失或不可用
312+
- 操作员错误,例如配置错误的 Kubernetes 软件或应用程序软件
311313

312314
<!--
313315
### Specific scenarios
314316

315-
- API server VM shutdown or apiserver crashing
316-
- Results
317-
- unable to stop, update, or start new pods, services, replication controller
318-
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
319-
- API server backing storage lost
320-
- Results
321-
- the kube-apiserver component fails to start successfully and become healthy
322-
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
323-
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
317+
- API server VM shutdown or apiserver crashing
318+
- Results
319+
- unable to stop, update, or start new pods, services, replication controller
320+
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
321+
- API server backing storage lost
322+
- Results
323+
- the kube-apiserver component fails to start successfully and become healthy
324+
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
325+
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
324326
-->
325327
### 具体情况 {#specific-scenarios}
326328

@@ -334,16 +336,17 @@ This is an incomplete list of things that could go wrong, and how to adjust your
334336
- kubelet 将不能访问 API 服务器,但是能够继续运行之前的 Pod 和提供相同的服务代理
335337
- 在 API 服务器重启之前,需要手动恢复或者重建 API 服务器的状态
336338
<!--
337-
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
338-
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
339-
- in future, these will be replicated as well and may not be co-located
340-
- they do not have their own persistent state
341-
- Individual node (VM or physical machine) shuts down
342-
- Results
343-
- pods on that Node stop running
344-
- Network partition
345-
- Results
346-
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
339+
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
340+
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
341+
- in future, these will be replicated as well and may not be co-located
342+
- they do not have their own persistent state
343+
- Individual node (VM or physical machine) shuts down
344+
- Results
345+
- pods on that Node stop running
346+
- Network partition
347+
- Results
348+
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down.
349+
(Assuming the master VM ends up in partition A.)
347350
-->
348351
- Kubernetes 服务组件(节点控制器、副本控制器管理器、调度器等)所在的 VM 关机或者崩溃
349352
- 当前,这些控制器是和 API 服务器在一起运行的,它们不可用的现象是与 API 服务器类似的
@@ -357,18 +360,18 @@ This is an incomplete list of things that could go wrong, and how to adjust your
357360
- 分区 A 认为分区 B 中所有的节点都已宕机;分区 B 认为 API 服务器宕机
358361
(假定主控节点所在的 VM 位于分区 A 内)。
359362
<!--
360-
- Kubelet software fault
361-
- Results
362-
- crashing kubelet cannot start new pods on the node
363-
- kubelet might delete the pods or not
364-
- node marked unhealthy
365-
- replication controllers start new pods elsewhere
366-
- Cluster operator error
367-
- Results
368-
- loss of pods, services, etc
369-
- lost of apiserver backing store
370-
- users unable to read API
371-
- etc.
363+
- Kubelet software fault
364+
- Results
365+
- crashing kubelet cannot start new pods on the node
366+
- kubelet might delete the pods or not
367+
- node marked unhealthy
368+
- replication controllers start new pods elsewhere
369+
- Cluster operator error
370+
- Results
371+
- loss of pods, services, etc
372+
- lost of apiserver backing store
373+
- users unable to read API
374+
- etc.
372375
-->
373376
- kubelet 软件故障
374377
- 结果
@@ -380,11 +383,11 @@ This is an incomplete list of things that could go wrong, and how to adjust your
380383
- 结果
381384
- 丢失 Pod 或服务等等
382385
- 丢失 API 服务器的后端存储
383-
- 用户无法读取API
386+
- 用户无法读取 API
384387
- 等等
385388

386389
<!--
387-
### Mitigations:
390+
### Mitigations
388391

389392
- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
390393
- Mitigates: Apiserver VM shutdown or apiserver crashing
@@ -409,7 +412,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
409412
- 缓解:API 服务器后端存储的丢失
410413

411414
- 措施:使用[高可用性](/zh-cn/docs/setup/production-environment/tools/kubeadm/high-availability/)的配置
412-
- 缓解:主控节点 VM 关机或者主控节点组件(调度器、API 服务器、控制器管理器)崩馈
415+
- 缓解:主控节点 VM 关机或者主控节点组件(调度器、API 服务器、控制器管理器)崩溃
413416
- 将容许一个或多个节点或组件同时出现故障
414417
- 缓解:API 服务器后端存储(例如 etcd 的数据目录)丢失
415418
- 假定你使用了高可用的 etcd 配置
@@ -428,7 +431,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
428431
- Mitigates: Node shutdown
429432
- Mitigates: Kubelet software fault
430433
-->
431-
- 措施:定期对 API 服务器的 PDs/EBS 卷执行快照操作
434+
- 措施:定期对 API 服务器的 PD 或 EBS 卷执行快照操作
432435
- 缓解:API 服务器后端存储丢失
433436
- 缓解:一些操作错误的场景
434437
- 缓解:一些 Kubernetes 软件本身故障的场景
@@ -444,16 +447,19 @@ This is an incomplete list of things that could go wrong, and how to adjust your
444447
## {{% heading "whatsnext" %}}
445448

446449
<!--
447-
* Learn about the metrics available in the [Resource Metrics Pipeline](resource-metrics-pipeline)
448-
* Discover additional tools for [monitoring resource usage](resource-usage-monitoring)
449-
* Use Node Problem Detector to [monitor node health](monitor-node-health)
450-
* Use `crictl` to [debug Kubernetes nodes](crictl)
451-
* Get more information about [Kubernetes auditing](audit)
452-
* Use `telepresence` to [develop and debug services locally](local-debugging)
450+
* Learn about the metrics available in the
451+
[Resource Metrics Pipeline](/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/)
452+
* Discover additional tools for
453+
[monitoring resource usage](/docs/tasks/debug/debug-cluster/resource-usage-monitoring/)
454+
* Use Node Problem Detector to
455+
[monitor node health](/docs/tasks/debug/debug-cluster/monitor-node-health/)
456+
* Use `crictl` to [debug Kubernetes nodes](/docs/tasks/debug/debug-cluster/crictl/)
457+
* Get more information about [Kubernetes auditing](/docs/tasks/debug/debug-cluster/audit/)
458+
* Use `telepresence` to [develop and debug services locally](/docs/tasks/debug/debug-cluster/local-debugging/)
453459
-->
454-
* 了解[资源指标管道](resource-metrics-pipeline)中可用的指标
455-
* 发现用于[监控资源使用](resource-usage-monitoring)的其他工具
456-
* 使用节点问题检测器[监控节点健康](monitor-node-health)
457-
* 使用 `crictl` 来[调试 Kubernetes 节点](crictl)
458-
* 获取更多关于 [Kubernetes 审计](audit)的信息
459-
* 使用 `telepresence` [本地开发和调试服务](local-debugging)
460+
* 了解[资源指标管道](/zh-cn/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/)中可用的指标
461+
* 发现用于[监控资源使用](/zh-cn/docs/tasks/debug/debug-cluster/resource-usage-monitoring/)的其他工具
462+
* 使用节点问题检测器[监控节点健康](/zh-cn/docs/tasks/debug/debug-cluster/monitor-node-health/)
463+
* 使用 `crictl` 来[调试 Kubernetes 节点](/zh-cn/docs/tasks/debug/debug-cluster/crictl/)
464+
* 获取更多关于 [Kubernetes 审计](/zh-cn/docs/tasks/debug/debug-cluster/audit/)的信息
465+
* 使用 `telepresence` [本地开发和调试服务](/zh-cn/docs/tasks/debug/debug-cluster/local-debugging/)

0 commit comments

Comments
 (0)