@@ -64,7 +64,7 @@ Sometimes when debugging it can be useful to look at the status of a node -- for
64
64
-->
65
65
### 示例:调试关闭/无法访问的节点 {#example-debugging-a-down-unreachable-node}
66
66
67
- 有时在调试时查看节点的状态很有用—— 例如,因为你注意到在节点上运行的 Pod 的奇怪行为,
67
+ 有时在调试时查看节点的状态很有用 —— 例如,因为你注意到在节点上运行的 Pod 的奇怪行为,
68
68
或者找出为什么 Pod 不会调度到节点上。与 Pod 一样,你可以使用 ` kubectl describe node `
69
69
和 ` kubectl get node -o yaml ` 来检索有关节点的详细信息。
70
70
例如,如果节点关闭(与网络断开连接,或者 kubelet 进程挂起并且不会重新启动等),
@@ -260,28 +260,30 @@ of the relevant log files. On systemd-based systems, you may need to use `journ
260
260
<!--
261
261
# ## Control Plane nodes
262
262
263
- * `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
264
- * `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
265
- * `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in {{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling (the kube-scheduler handles scheduling).
263
+ * `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
264
+ * `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
265
+ * `/var/log/kube-controller-manager.log` - a component that runs most Kubernetes built-in
266
+ {{<glossary_tooltip text="controllers" term_id="controller">}}, with the notable exception of scheduling
267
+ (the kube-scheduler handles scheduling).
266
268
-->
267
269
# ## 控制平面节点 {#control-plane-nodes}
268
270
269
- * `/var/log/kube-apiserver.log` —— API 服务器 API
270
- * `/var/log/kube-scheduler.log` —— 调度器,负责制定调度决策
271
- * `/var/log/kube-controller-manager.log` —— 运行大多数 Kubernetes
272
- 内置{{<glossary_tooltip text="控制器" term_id="controller">}}的组件,除了调度(kube-scheduler 处理调度)。
271
+ * `/var/log/kube-apiserver.log` —— API 服务器,负责提供 API 服务
272
+ * `/var/log/kube-scheduler.log` —— 调度器,负责制定调度决策
273
+ * `/var/log/kube-controller-manager.log` —— 运行大多数 Kubernetes
274
+ 内置{{<glossary_tooltip text="控制器" term_id="controller">}}的组件,除了调度(kube-scheduler 处理调度)。
273
275
274
276
<!--
275
277
# ## Worker Nodes
276
278
277
- * `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
278
- * `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
279
+ * `/var/log/kubelet.log` - logs from the kubelet, responsible for running containers on the node
280
+ * `/var/log/kube-proxy.log` - logs from `kube-proxy`, which is responsible for directing traffic to Service endpoints
279
281
-->
280
282
281
283
# ## 工作节点 {#worker-nodes}
282
284
283
- * `/var/log/kubelet.log` —— 来自 `kubelet` 的日志,负责在节点运行容器
284
- * `/var/log/kube-proxy.log` —— 来自 `kube-proxy` 的日志,负责将流量转发到服务端点
285
+ * `/var/log/kubelet.log` —— 来自 `kubelet` 的日志,负责在节点运行容器
286
+ * `/var/log/kube-proxy.log` —— 来自 `kube-proxy` 的日志,负责将流量转发到服务端点
285
287
286
288
<!--
287
289
# # Cluster failure modes
@@ -295,32 +297,32 @@ This is an incomplete list of things that could go wrong, and how to adjust your
295
297
<!--
296
298
# ## Contributing causes
297
299
298
- - VM(s) shutdown
299
- - Network partition within cluster, or between cluster and users
300
- - Crashes in Kubernetes software
301
- - Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
302
- - Operator error, for example misconfigured Kubernetes software or application software
300
+ - VM(s) shutdown
301
+ - Network partition within cluster, or between cluster and users
302
+ - Crashes in Kubernetes software
303
+ - Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
304
+ - Operator error, for example misconfigured Kubernetes software or application software
303
305
-->
304
- # ## 造成原因 {#contributing-causes}
306
+ # ## 故障原因 {#contributing-causes}
305
307
306
- - 虚拟机关闭
307
- - 集群内或集群与用户之间的网络分区
308
- - Kubernetes 软件崩溃
309
- - 持久存储(例如 GCE PD 或 AWS EBS 卷)的数据丢失或不可用
310
- - 操作员错误,例如配置错误的 Kubernetes 软件或应用程序软件
308
+ - 虚拟机关闭
309
+ - 集群内或集群与用户之间的网络分区
310
+ - Kubernetes 软件崩溃
311
+ - 持久存储(例如 GCE PD 或 AWS EBS 卷)的数据丢失或不可用
312
+ - 操作员错误,例如配置错误的 Kubernetes 软件或应用程序软件
311
313
312
314
<!--
313
315
# ## Specific scenarios
314
316
315
- - API server VM shutdown or apiserver crashing
316
- - Results
317
- - unable to stop, update, or start new pods, services, replication controller
318
- - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
319
- - API server backing storage lost
320
- - Results
321
- - the kube-apiserver component fails to start successfully and become healthy
322
- - kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
323
- - manual recovery or recreation of apiserver state necessary before apiserver is restarted
317
+ - API server VM shutdown or apiserver crashing
318
+ - Results
319
+ - unable to stop, update, or start new pods, services, replication controller
320
+ - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
321
+ - API server backing storage lost
322
+ - Results
323
+ - the kube-apiserver component fails to start successfully and become healthy
324
+ - kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
325
+ - manual recovery or recreation of apiserver state necessary before apiserver is restarted
324
326
-->
325
327
# ## 具体情况 {#specific-scenarios}
326
328
@@ -334,16 +336,17 @@ This is an incomplete list of things that could go wrong, and how to adjust your
334
336
- kubelet 将不能访问 API 服务器,但是能够继续运行之前的 Pod 和提供相同的服务代理
335
337
- 在 API 服务器重启之前,需要手动恢复或者重建 API 服务器的状态
336
338
<!--
337
- - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
338
- - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
339
- - in future, these will be replicated as well and may not be co-located
340
- - they do not have their own persistent state
341
- - Individual node (VM or physical machine) shuts down
342
- - Results
343
- - pods on that Node stop running
344
- - Network partition
345
- - Results
346
- - partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
339
+ - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
340
+ - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
341
+ - in future, these will be replicated as well and may not be co-located
342
+ - they do not have their own persistent state
343
+ - Individual node (VM or physical machine) shuts down
344
+ - Results
345
+ - pods on that Node stop running
346
+ - Network partition
347
+ - Results
348
+ - partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down.
349
+ (Assuming the master VM ends up in partition A.)
347
350
-->
348
351
- Kubernetes 服务组件(节点控制器、副本控制器管理器、调度器等)所在的 VM 关机或者崩溃
349
352
- 当前,这些控制器是和 API 服务器在一起运行的,它们不可用的现象是与 API 服务器类似的
@@ -357,18 +360,18 @@ This is an incomplete list of things that could go wrong, and how to adjust your
357
360
- 分区 A 认为分区 B 中所有的节点都已宕机;分区 B 认为 API 服务器宕机
358
361
(假定主控节点所在的 VM 位于分区 A 内)。
359
362
<!--
360
- - Kubelet software fault
361
- - Results
362
- - crashing kubelet cannot start new pods on the node
363
- - kubelet might delete the pods or not
364
- - node marked unhealthy
365
- - replication controllers start new pods elsewhere
366
- - Cluster operator error
367
- - Results
368
- - loss of pods, services, etc
369
- - lost of apiserver backing store
370
- - users unable to read API
371
- - etc.
363
+ - Kubelet software fault
364
+ - Results
365
+ - crashing kubelet cannot start new pods on the node
366
+ - kubelet might delete the pods or not
367
+ - node marked unhealthy
368
+ - replication controllers start new pods elsewhere
369
+ - Cluster operator error
370
+ - Results
371
+ - loss of pods, services, etc
372
+ - lost of apiserver backing store
373
+ - users unable to read API
374
+ - etc.
372
375
-->
373
376
- kubelet 软件故障
374
377
- 结果
@@ -380,11 +383,11 @@ This is an incomplete list of things that could go wrong, and how to adjust your
380
383
- 结果
381
384
- 丢失 Pod 或服务等等
382
385
- 丢失 API 服务器的后端存储
383
- - 用户无法读取API
386
+ - 用户无法读取 API
384
387
- 等等
385
388
386
389
<!--
387
- # ## Mitigations:
390
+ # ## Mitigations
388
391
389
392
- Action : Use IaaS provider's automatic VM restarting feature for IaaS VMs
390
393
- Mitigates : Apiserver VM shutdown or apiserver crashing
@@ -409,7 +412,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
409
412
- 缓解:API 服务器后端存储的丢失
410
413
411
414
- 措施:使用[高可用性](/zh-cn/docs/setup/production-environment/tools/kubeadm/high-availability/)的配置
412
- - 缓解:主控节点 VM 关机或者主控节点组件(调度器、API 服务器、控制器管理器)崩馈
415
+ - 缓解:主控节点 VM 关机或者主控节点组件(调度器、API 服务器、控制器管理器)崩溃
413
416
- 将容许一个或多个节点或组件同时出现故障
414
417
- 缓解:API 服务器后端存储(例如 etcd 的数据目录)丢失
415
418
- 假定你使用了高可用的 etcd 配置
@@ -428,7 +431,7 @@ This is an incomplete list of things that could go wrong, and how to adjust your
428
431
- Mitigates : Node shutdown
429
432
- Mitigates : Kubelet software fault
430
433
-->
431
- - 措施:定期对 API 服务器的 PDs/ EBS 卷执行快照操作
434
+ - 措施:定期对 API 服务器的 PD 或 EBS 卷执行快照操作
432
435
- 缓解:API 服务器后端存储丢失
433
436
- 缓解:一些操作错误的场景
434
437
- 缓解:一些 Kubernetes 软件本身故障的场景
@@ -444,16 +447,19 @@ This is an incomplete list of things that could go wrong, and how to adjust your
444
447
# # {{% heading "whatsnext" %}}
445
448
446
449
<!--
447
- * Learn about the metrics available in the [Resource Metrics Pipeline](resource-metrics-pipeline)
448
- * Discover additional tools for [monitoring resource usage](resource-usage-monitoring)
449
- * Use Node Problem Detector to [monitor node health](monitor-node-health)
450
- * Use `crictl` to [debug Kubernetes nodes](crictl)
451
- * Get more information about [Kubernetes auditing](audit)
452
- * Use `telepresence` to [develop and debug services locally](local-debugging)
450
+ * Learn about the metrics available in the
451
+ [Resource Metrics Pipeline](/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/)
452
+ * Discover additional tools for
453
+ [monitoring resource usage](/docs/tasks/debug/debug-cluster/resource-usage-monitoring/)
454
+ * Use Node Problem Detector to
455
+ [monitor node health](/docs/tasks/debug/debug-cluster/monitor-node-health/)
456
+ * Use `crictl` to [debug Kubernetes nodes](/docs/tasks/debug/debug-cluster/crictl/)
457
+ * Get more information about [Kubernetes auditing](/docs/tasks/debug/debug-cluster/audit/)
458
+ * Use `telepresence` to [develop and debug services locally](/docs/tasks/debug/debug-cluster/local-debugging/)
453
459
-->
454
- * 了解[资源指标管道](resource-metrics-pipeline)中可用的指标
455
- * 发现用于[监控资源使用](resource-usage-monitoring)的其他工具
456
- * 使用节点问题检测器[监控节点健康](monitor-node-health)
457
- * 使用 `crictl` 来[调试 Kubernetes 节点](crictl)
458
- * 获取更多关于 [Kubernetes 审计](audit)的信息
459
- * 使用 `telepresence` [本地开发和调试服务](local-debugging)
460
+ * 了解[资源指标管道](/zh-cn/docs/tasks/debug/debug-cluster/ resource-metrics-pipeline/ )中可用的指标
461
+ * 发现用于[监控资源使用](/zh-cn/docs/tasks/debug/debug-cluster/ resource-usage-monitoring/ )的其他工具
462
+ * 使用节点问题检测器[监控节点健康](/zh-cn/docs/tasks/debug/debug-cluster/ monitor-node-health/ )
463
+ * 使用 `crictl` 来[调试 Kubernetes 节点](/zh-cn/docs/tasks/debug/debug-cluster/ crictl/ )
464
+ * 获取更多关于 [Kubernetes 审计](/zh-cn/docs/tasks/debug/debug-cluster/ audit/ )的信息
465
+ * 使用 `telepresence` [本地开发和调试服务](/zh-cn/docs/tasks/debug/debug-cluster/ local-debugging/ )
0 commit comments