Skip to content

Commit 0c0e2ed

Browse files
committed
[zh] Sync debug-cluster
1 parent 02ac2d3 commit 0c0e2ed

File tree

1 file changed

+19
-17
lines changed
  • content/zh/docs/tasks/debug/debug-cluster

1 file changed

+19
-17
lines changed

content/zh/docs/tasks/debug/debug-cluster/_index.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ The first thing to debug in your cluster is if your nodes are all registered cor
3434
3535
Run the following command:
3636
-->
37-
## 列举集群节点
37+
## 列举集群节点 {#listing-your-cluster}
3838

3939
调试的第一步是查看所有的节点是否都已正确注册。
4040

@@ -62,7 +62,7 @@ kubectl cluster-info dump
6262
6363
Sometimes when debugging it can be useful to look at the status of a node -- for example, because you've noticed strange behavior of a Pod that's running on the node, or to find out why a Pod won't schedule onto the node. As with Pods, you can use `kubectl describe node` and `kubectl get node -o yaml` to retrieve detailed information about nodes. For example, here's what you'll see if a node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Notice the events that show the node is NotReady, and also notice that the pods are no longer running (they are evicted after five minutes of NotReady status).
6464
-->
65-
### 示例:调试关闭/无法访问的节点
65+
### 示例:调试关闭/无法访问的节点 {#example-debugging-a-down-unreachable-node}
6666

6767
有时在调试时查看节点的状态很有用——例如,因为你注意到在节点上运行的 Pod 的奇怪行为,
6868
或者找出为什么 Pod 不会调度到节点上。与 Pod 一样,你可以使用 `kubectl describe node`
@@ -247,10 +247,12 @@ status:
247247
```
248248
249249
<!--
250+
## Looking at logs
251+
250252
For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations
251253
of the relevant log files. On systemd-based systems, you may need to use `journalctl` instead of examining log files.
252254
-->
253-
## 查看日志
255+
## 查看日志 {#looking-at-logs}
254256

255257
目前,深入挖掘集群需要登录相关机器。以下是相关日志文件的位置。
256258
在基于 systemd 的系统上,你可能需要使用 `journalctl` 而不是检查日志文件。
@@ -262,7 +264,7 @@ of the relevant log files. On systemd-based systems, you may need to use `journ
262264
* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
263265
* `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in {{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling (the kube-scheduler handles scheduling).
264266
-->
265-
### 控制平面节点
267+
### 控制平面节点 {#control-plane-nodes}
266268

267269
* `/var/log/kube-apiserver.log` —— API 服务器 API
268270
* `/var/log/kube-scheduler.log` —— 调度器,负责制定调度决策
@@ -276,17 +278,17 @@ of the relevant log files. On systemd-based systems, you may need to use `journ
276278
* `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
277279
-->
278280

279-
### 工作节点
281+
### 工作节点 {#worker-nodes}
280282

281283
* `/var/log/kubelet.log` —— 来自 `kubelet` 的日志,负责在节点运行容器
282-
* `/var/log/kube-proxy.log` —— 来自 `kube-proxy` 的日志, 负责将流量转发到服务端点
284+
* `/var/log/kube-proxy.log` —— 来自 `kube-proxy` 的日志负责将流量转发到服务端点
283285

284286
<!--
285287
## Cluster failure modes
286288

287289
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
288290
-->
289-
## 集群故障模式
291+
## 集群故障模式 {#cluster-failure-modes}
290292

291293
这是可能出错的事情的不完整列表,以及如何调整集群设置以缓解问题。
292294

@@ -299,7 +301,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
299301
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
300302
- Operator error, for example misconfigured Kubernetes software or application software
301303
-->
302-
### 贡献原因
304+
### 造成原因 {#contributing-causes}
303305

304306
- 虚拟机关闭
305307
- 集群内或集群与用户之间的网络分区
@@ -308,27 +310,27 @@ This is an incomplete list of things that could go wrong, and how to adjust your
308310
- 操作员错误,例如配置错误的 Kubernetes 软件或应用程序软件
309311

310312
<!--
311-
### Specific scenarios:
313+
### Specific scenarios
312314

313-
- Apiserver VM shutdown or apiserver crashing
315+
- API server VM shutdown or apiserver crashing
314316
- Results
315317
- unable to stop, update, or start new pods, services, replication controller
316318
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
317-
- Apiserver backing storage lost
319+
- API server backing storage lost
318320
- Results
319-
- apiserver should fail to come up
321+
- the kube-apiserver component fails to start successfully and become healthy
320322
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
321323
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
322324
-->
323-
### 具体情况
325+
### 具体情况 {#specific-scenarios}
324326

325327
- API 服务器所在的 VM 关机或者 API 服务器崩溃
326328
- 结果
327329
- 不能停止、更新或者启动新的 Pod、服务或副本控制器
328330
- 现有的 Pod 和服务在不依赖 Kubernetes API 的情况下应该能继续正常工作
329331
- API 服务器的后端存储丢失
330332
- 结果
331-
- API 服务器应该不能启动
333+
- kube-apiserver 组件未能成功启动并变健康
332334
- kubelet 将不能访问 API 服务器,但是能够继续运行之前的 Pod 和提供相同的服务代理
333335
- 在 API 服务器重启之前,需要手动恢复或者重建 API 服务器的状态
334336
<!--
@@ -353,7 +355,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
353355
- 网络分裂
354356
- 结果
355357
- 分区 A 认为分区 B 中所有的节点都已宕机;分区 B 认为 API 服务器宕机
356-
(假定主控节点所在的 VM 位于分区 A 内)
358+
(假定主控节点所在的 VM 位于分区 A 内
357359
<!--
358360
- Kubelet software fault
359361
- Results
@@ -397,7 +399,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
397399
- Mitigates: API server backing storage (i.e., etcd's data directory) lost
398400
- Assumes HA (highly-available) etcd configuration
399401
-->
400-
### 缓解措施
402+
### 缓解措施 {#mitigations}
401403

402404
- 措施:对于 IaaS 上的 VM,使用 IaaS 的自动 VM 重启功能
403405
- 缓解:API 服务器 VM 关机或 API 服务器崩溃
@@ -449,7 +451,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
449451
* Get more information about [Kubernetes auditing](audit)
450452
* Use `telepresence` to [develop and debug services locally](local-debugging)
451453
-->
452-
* 了解 [资源指标管道](resource-metrics-pipeline) 中可用的指标
454+
* 了解[资源指标管道](resource-metrics-pipeline)中可用的指标
453455
* 发现用于[监控资源使用](resource-usage-monitoring)的其他工具
454456
* 使用节点问题检测器[监控节点健康](monitor-node-health)
455457
* 使用 `crictl` 来[调试 Kubernetes 节点](crictl)

0 commit comments

Comments
 (0)