Skip to content

Commit 929c83e

Browse files
author
zhuzhenghao
committed
resync NPD
1 parent 3906c45 commit 929c83e

File tree

1 file changed

+126
-101
lines changed

1 file changed

+126
-101
lines changed

content/zh-cn/docs/tasks/debug/debug-cluster/monitor-node-health.md

Lines changed: 126 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -13,23 +13,23 @@ weight: 20
1313
-->
1414

1515
<!-- overview -->
16-
<!--
16+
<!--
1717
*Node Problem Detector* is a daemon for monitoring and reporting about a node's health.
1818
You can run Node Problem Detector as a `DaemonSet` or as a standalone daemon.
1919
Node Problem Detector collects information about node problems from various daemons
20-
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
21-
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
20+
and reports these conditions to the API server as Node [Condition](/docs/concepts/architecture/nodes/#condition)s
21+
or as [Event](/docs/reference/kubernetes-api/cluster-resources/event-v1)s.
2222
2323
To learn how to install and use Node Problem Detector, see
2424
[Node Problem Detector project documentation](https://github.com/kubernetes/node-problem-detector).
2525
-->
2626

2727
*节点问题检测器(Node Problem Detector)* 是一个守护程序,用于监视和报告节点的健康状况。
2828
你可以将节点问题探测器以 `DaemonSet` 或独立守护程序运行。
29-
节点问题检测器从各种守护进程收集节点问题,并以
30-
[NodeCondition](/zh-cn/docs/concepts/architecture/nodes/#condition)
31-
[Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core)
32-
的形式报告给 API 服务器。
29+
节点问题检测器从各种守护进程收集节点问题,并以节点
30+
[Condition](/zh-cn/docs/concepts/architecture/nodes/#condition)
31+
[Event](/zh-cn/docs/reference/kubernetes-api/cluster-resources/event-v1)
32+
的形式报告给 API 服务器。
3333

3434
要了解如何安装和使用节点问题检测器,请参阅
3535
[节点问题探测器项目文档](https://github.com/kubernetes/node-problem-detector)
@@ -40,46 +40,41 @@ To learn how to install and use Node Problem Detector, see
4040

4141
<!-- steps -->
4242

43-
<!--
44-
## Limitations
45-
46-
* Node Problem Detector only supports file based kernel log.
47-
Log tools such as `journald` are not supported.
43+
<!--
44+
## Limitations
4845
4946
* Node Problem Detector uses the kernel log format for reporting kernel issues.
5047
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
5148
-->
5249
## 局限性 {#limitations}
5350

54-
* 节点问题检测器只支持基于文件类型的内核日志。
55-
它不支持像 journald 这样的命令行日志工具。
5651
* 节点问题检测器使用内核日志格式来报告内核问题。
5752
要了解如何扩展内核日志格式,请参阅[添加对另一个日志格式的支持](#support-other-log-format)
5853

59-
<!--
54+
<!--
6055
## Enabling Node Problem Detector
6156
6257
Some cloud providers enable Node Problem Detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
63-
You can also enable Node Problem Detector with `kubectl` or by creating an Addon pod.
58+
You can also enable Node Problem Detector with `kubectl` or by creating an Addon DaemonSet.
6459
-->
6560
## 启用节点问题检测器
6661

6762
一些云供应商将节点问题检测器以{{< glossary_tooltip text="插件" term_id="addons" >}}形式启用。
68-
你还可以使用 `kubectl` 或创建插件 Pod 来启用节点问题探测器。
63+
你还可以使用 `kubectl` 或创建插件 DaemonSet 来启用节点问题探测器。
6964

70-
<!--
71-
## Using kubectl to enable Node Problem Detector {#using-kubectl}
65+
<!--
66+
### Using kubectl to enable Node Problem Detector {#using-kubectl}
7267
7368
`kubectl` provides the most flexible management of Node Problem Detector.
7469
You can overwrite the default configuration to fit it into your environment or
7570
to detect customized node problems. For example:
7671
-->
77-
## 使用 kubectl 启用节点问题检测器 {#using-kubectl}
72+
### 使用 kubectl 启用节点问题检测器 {#using-kubectl}
7873

7974
`kubectl` 提供了节点问题探测器最灵活的管理。
8075
你可以覆盖默认配置使其适合你的环境或检测自定义节点问题。例如:
8176

82-
<!--
77+
<!--
8378
1. Create a Node Problem Detector configuration similar to `node-problem-detector.yaml`:
8479
8580
{{< codenew file="debug/node-problem-detector.yaml" >}}
@@ -107,7 +102,7 @@ to detect customized node problems. For example:
107102
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
108103
```
109104

110-
<!--
105+
<!--
111106
### Using an Addon pod to enable Node Problem Detector {#using-addon-pod}
112107
113108
If you are using a custom cluster bootstrap solution and don't need
@@ -117,33 +112,33 @@ further automate the deployment.
117112
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
118113
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
119114
-->
120-
### 使用插件 pod 启用节点问题检测器 {#using-addon-pod}
115+
### 使用插件 Pod 启用节点问题检测器 {#using-addon-pod}
121116

122117
如果你使用的是自定义集群引导解决方案,不需要覆盖默认配置,
123118
可以利用插件 Pod 进一步自动化部署。
124119

125120
创建 `node-strick-detector.yaml`,并在控制平面节点上保存配置到插件 Pod 的目录
126121
`/etc/kubernetes/addons/node-problem-detector`
127122

128-
<!--
129-
## Overwrite the Configuration
123+
<!--
124+
## Overwrite the configuration
130125
131-
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
126+
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12/config)
132127
is embedded when building the Docker image of Node Problem Detector.
133128
-->
134129
## 覆盖配置文件
135130

136131
构建节点问题检测器的 docker 镜像时,会嵌入
137-
[默认配置](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
132+
[默认配置](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12/config)
138133

139-
<!--
134+
<!--
140135
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
141136
to overwrite the configuration:
142137
-->
143138
不过,你可以像下面这样使用 [`ConfigMap`](/zh-cn/docs/tasks/configure-pod-container/configure-pod-configmap/)
144139
将其覆盖:
145140

146-
<!--
141+
<!--
147142
1. Change the configuration files in `config/`
148143
1. Create the `ConfigMap` `node-problem-detector-config`:
149144
@@ -165,24 +160,24 @@ to overwrite the configuration:
165160
-->
166161
1. 更改 `config/` 中的配置文件
167162
1. 创建 `ConfigMap` `node-strick-detector-config`
168-
163+
169164
```shell
170165
kubectl create configmap node-problem-detector-config --from-file=config/
171166
```
172167

173168
1. 更改 `node-problem-detector.yaml` 以使用 ConfigMap:
174-
169+
175170
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
176171

177172
1. 使用新的配置文件重新创建节点问题检测器:
178173

179-
```shell
174+
```shell
180175
# 如果你正在运行节点问题检测器,请先删除,然后再重新创建
181176
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
182177
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
183178
```
184179

185-
<!--
180+
<!--
186181
{{< note >}}
187182
This approach only applies to a Node Problem Detector started with `kubectl`.
188183
{{< /note >}}
@@ -197,99 +192,129 @@ The Addon manager does not support `ConfigMap`.
197192
如果节点问题检测器作为集群插件运行,则不支持覆盖配置。
198193
插件管理器不支持 `ConfigMap`
199194

200-
<!--
201-
## Kernel Monitor
195+
<!--
196+
## Problem Daemons
202197
203-
*Kernel Monitor* is a system log monitor daemon supported in the Node Problem Detector.
204-
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
198+
A problem daemon is a sub-daemon of the Node Problem Detector. It monitors specific kinds of node
199+
problems and reports them to the Node Problem Detector.
200+
There are several types of supported problem daemons.
205201
-->
206-
## 内核监视器
207202

208-
*内核监视器(Kernel Monitor)* 是节点问题检测器中支持的系统日志监视器守护进程。
209-
内核监视器观察内核日志并根据预定义规则检测已知的内核问题。
203+
## 问题守护程序
210204

211-
<!--
212-
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
213-
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can expand the rule list by overwriting the
214-
configuration.
205+
问题守护程序是节点问题检测器的子守护程序。
206+
它监视特定类型的节点问题并报告给节点问题检测器。
207+
支持下面几种类型的问题守护程序。
208+
209+
<!--
210+
- A `SystemLogMonitor` type of daemon monitors the system logs and reports problems and metrics
211+
according to predefined rules. You can customize the configurations for different log sources
212+
such as [filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json),
213+
[kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json),
214+
[kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json),
215+
[abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json),
216+
and [systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json).
217+
-->
218+
- `SystemLogMonitor` 类型的守护程序根据预定义的规则监视系统日志并报告问题和指标。
219+
你可以针对不同的日志源自定义配置如
220+
[filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json)
221+
[kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json)
222+
[kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json)
223+
[abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json)
224+
[systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json)
225+
226+
<!--
227+
- A `SystemStatsMonitor` type of daemon collects various health-related system stats as metrics.
228+
You can customize its behavior by updating its
229+
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json).
215230
-->
216-
内核监视器根据 [`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json)
217-
中的一组预定义规则列表匹配内核问题。
218-
规则列表是可扩展的,你始终可以通过覆盖配置来扩展它。
219231

220-
<!--
221-
### Add new NodeConditions
232+
- `SystemStatsMonitor` 类型的守护程序收集各种与健康相关的系统统计数据作为指标。
233+
你可以通过更新其[配置文件](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json)来自定义其行为。
222234

223-
To support a new `NodeCondition`, create a condition definition within the `conditions` field in
224-
`config/kernel-monitor.json`, for example:
225-
```
235+
<!--
236+
- A `CustomPluginMonitor` type of daemon invokes and checks various node problems by running
237+
user-defined scripts. You can use different custom plugin monitors to monitor different
238+
problems and customize the daemon behavior by updating the
239+
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json).
226240
-->
227-
### 添加新的 NodeCondition
228241

229-
要支持新的 `NodeCondition`,请在 `config/kernel-monitor.json` 中的
230-
`conditions` 字段中创建一个条件定义:
242+
- `CustomPluginMonitor` 类型的守护程序通过运行用户定义的脚本来调用和检查各种节点问题。
243+
你可以使用不同的自定义插件监视器来监视不同的问题,并通过更新
244+
[配置文件](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json)
245+
来定制守护程序行为。
231246

232-
```json
233-
{
234-
"type": "NodeConditionType",
235-
"reason": "CamelCaseDefaultNodeConditionReason",
236-
"message": "arbitrary default node condition message"
237-
}
238-
```
247+
<!--
248+
- A `HealthChecker` type of daemon checks the health of the kubelet and container runtime on a node.
249+
-->
250+
- `HealthChecker` 类型的守护程序检查节点上的 kubelet 和容器运行时的健康状况。
239251

240-
<!--
241-
### Detect new problems
252+
<!--
253+
### Adding support for other log format {#support-other-log-format}
242254
243-
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
244-
with a new rule definition:
255+
The system log monitor currently supports file-based logs, journald, and kmsg.
256+
Additional sources can be added by implementing a new
257+
[log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go).
245258
-->
246-
### 检测新的问题
247259

248-
你可以使用新的规则描述来扩展 `config/kernel-monitor.json` 中的 `rules` 字段以检测新问题:
260+
### 增加对其他日志格式的支持 {#support-other-log-format}
249261

250-
```json
251-
{
252-
"type": "temporary/permanent",
253-
"condition": "NodeConditionOfPermanentIssue",
254-
"reason": "CamelCaseShortReason",
255-
"message": "regexp matching the issue in the kernel log"
256-
}
257-
```
262+
系统日志监视器目前支持基于文件的日志、journald 和 kmsg。
263+
可以通过实现一个新的
264+
[log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go)
265+
来添加额外的日志源。
258266

259-
<!--
260-
### Configure path for the kernel log device {#kernel-log-device-path}
267+
<!--
268+
### Adding custom plugin monitors
261269
262-
Check your kernel log path location in your operating system (OS) distribution.
263-
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
264-
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
265-
You can configure the `log` field to match the device path as seen by the Node Problem Detector.
270+
You can extend the Node Problem Detector to execute any monitor scripts written in any language by
271+
developing a custom plugin. The monitor scripts must conform to the plugin protocol in exit code
272+
and standard output. For more information, please refer to the
273+
[plugin interface proposal](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#).
266274
-->
267-
### 配置内核日志设备的路径 {#kernel-log-device-path}
268275

269-
检查你的操作系统(OS)发行版本中的内核日志路径位置。
270-
Linux 内核[日志设备](https://www.kernel.org/doc/documentation/abi/testing/dev-kmsg)
271-
通常呈现为 `/dev/kmsg`
272-
但是,日志路径位置因 OS 发行版本而异。
273-
`config/kernel-monitor.json` 中的 `log` 字段表示容器内的日志路径。
274-
你可以配置 `log` 字段以匹配节点问题检测器所示的设备路径。
276+
### 添加自定义插件监视器
275277

276-
<!--
277-
### Add support for another log format {#support-other-log-format}
278+
你可以通过开发自定义插件来扩展节点问题检测器,以执行以任何语言编写的任何监控脚本。
279+
监控脚本必须符合退出码和标准输出的插件协议。
280+
有关更多信息,请参阅
281+
[插件接口提案](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#).
282+
283+
<!--
284+
## Exporter
285+
286+
An exporter reports the node problems and/or metrics to certain backends.
287+
The following exporters are supported:
278288
279-
Kernel monitor uses the
280-
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
281-
You can implement a new translator for a new log format.
289+
- **Kubernetes exporter**: this exporter reports node problems to the Kubernetes API server.
290+
Temporary problems are reported as Events and permanent problems are reported as Node Conditions.
291+
292+
- **Prometheus exporter**: this exporter reports node problems and metrics locally as Prometheus
293+
(or OpenMetrics) metrics. You can specify the IP address and port for the exporter using command
294+
line arguments.
295+
296+
- **Stackdriver exporter**: this exporter reports node problems and metrics to the Stackdriver
297+
Monitoring API. The exporting behavior can be customized using a
298+
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json).
282299
-->
283-
### 添加对其它日志格式的支持 {#support-other-log-format}
284300

285-
内核监视器使用
286-
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator.go)
287-
插件转换内核日志的内部数据结构。
288-
你可以为新的日志格式实现新的转换器。
301+
## 导出器
302+
303+
导出器(Exporter)向特定后端报告节点问题和/或指标。
304+
支持下列导出器:
305+
306+
- **Kubernetes exporter**: 此导出器向 Kubernetes API 服务器报告节点问题。
307+
临时问题报告为事件,永久性问题报告为节点状况。
308+
309+
- **Prometheus exporter**: 此导出器在本地将节点问题和指标报告为 Prometheus(或 OpenMetrics)指标。
310+
你可以使用命令行参数指定导出器的 IP 地址和端口。
311+
312+
- **Stackdriver exporter**: 此导出器向 Stackdriver Monitoring API 报告节点问题和指标。
313+
可以使用[配置文件](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json)自定义导出行为。
289314

290315
<!-- discussion -->
291316

292-
<!--
317+
<!--
293318
## Recommendations and restrictions
294319
295320
It is recommended to run the Node Problem Detector in your cluster to monitor node health.
@@ -299,7 +324,7 @@ Usually this is fine, because:
299324
* The kernel log grows relatively slowly.
300325
* A resource limit is set for the Node Problem Detector.
301326
* Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector
302-
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
327+
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
303328
-->
304329
## 建议和限制
305330

0 commit comments

Comments
 (0)