@@ -13,23 +13,23 @@ weight: 20
13
13
-->
14
14
15
15
<!-- overview -->
16
- <!--
16
+ <!--
17
17
*Node Problem Detector* is a daemon for monitoring and reporting about a node's health.
18
18
You can run Node Problem Detector as a `DaemonSet` or as a standalone daemon.
19
19
Node Problem Detector collects information about node problems from various daemons
20
- and reports these conditions to the API server as [NodeCondition ](/docs/concepts/architecture/nodes/#condition)
21
- and [Event](/docs/reference/generated/ kubernetes-api/{{< param "version" >}}/# event-v1-core).
20
+ and reports these conditions to the API server as Node [Condition ](/docs/concepts/architecture/nodes/#condition)s
21
+ or as [Event](/docs/reference/kubernetes-api/cluster-resources/ event-v1)s.
22
22
23
23
To learn how to install and use Node Problem Detector, see
24
24
[Node Problem Detector project documentation](https://github.com/kubernetes/node-problem-detector).
25
25
-->
26
26
27
27
* 节点问题检测器(Node Problem Detector)* 是一个守护程序,用于监视和报告节点的健康状况。
28
28
你可以将节点问题探测器以 ` DaemonSet ` 或独立守护程序运行。
29
- 节点问题检测器从各种守护进程收集节点问题,并以
30
- [ NodeCondition ] ( /zh-cn/docs/concepts/architecture/nodes/#condition ) 和
31
- [ Event] (/docs/reference/generated/ kubernetes-api/{{< param "version" >}}/# event-v1-core )
32
- 的形式报告给 API 服务器。
29
+ 节点问题检测器从各种守护进程收集节点问题,并以节点
30
+ [ Condition ] ( /zh-cn/docs/concepts/architecture/nodes/#condition ) 和
31
+ [ Event] ( /zh-cn/ docs/reference/kubernetes-api/cluster-resources/ event-v1 )
32
+ 的形式报告给 API 服务器。
33
33
34
34
要了解如何安装和使用节点问题检测器,请参阅
35
35
[ 节点问题探测器项目文档] ( https://github.com/kubernetes/node-problem-detector ) 。
@@ -40,46 +40,41 @@ To learn how to install and use Node Problem Detector, see
40
40
41
41
<!-- steps -->
42
42
43
- <!--
44
- ## Limitations
45
-
46
- * Node Problem Detector only supports file based kernel log.
47
- Log tools such as `journald` are not supported.
43
+ <!--
44
+ ## Limitations
48
45
49
46
* Node Problem Detector uses the kernel log format for reporting kernel issues.
50
47
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
51
48
-->
52
49
## 局限性 {#limitations}
53
50
54
- * 节点问题检测器只支持基于文件类型的内核日志。
55
- 它不支持像 journald 这样的命令行日志工具。
56
51
* 节点问题检测器使用内核日志格式来报告内核问题。
57
52
要了解如何扩展内核日志格式,请参阅[ 添加对另一个日志格式的支持] ( #support-other-log-format ) 。
58
53
59
- <!--
54
+ <!--
60
55
## Enabling Node Problem Detector
61
56
62
57
Some cloud providers enable Node Problem Detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
63
- You can also enable Node Problem Detector with `kubectl` or by creating an Addon pod .
58
+ You can also enable Node Problem Detector with `kubectl` or by creating an Addon DaemonSet .
64
59
-->
65
60
## 启用节点问题检测器
66
61
67
62
一些云供应商将节点问题检测器以{{< glossary_tooltip text="插件" term_id="addons" >}}形式启用。
68
- 你还可以使用 ` kubectl ` 或创建插件 Pod 来启用节点问题探测器。
63
+ 你还可以使用 ` kubectl ` 或创建插件 DaemonSet 来启用节点问题探测器。
69
64
70
- <!--
71
- ## Using kubectl to enable Node Problem Detector {#using-kubectl}
65
+ <!--
66
+ ### Using kubectl to enable Node Problem Detector {#using-kubectl}
72
67
73
68
`kubectl` provides the most flexible management of Node Problem Detector.
74
69
You can overwrite the default configuration to fit it into your environment or
75
70
to detect customized node problems. For example:
76
71
-->
77
- ## 使用 kubectl 启用节点问题检测器 {#using-kubectl}
72
+ ### 使用 kubectl 启用节点问题检测器 {#using-kubectl}
78
73
79
74
` kubectl ` 提供了节点问题探测器最灵活的管理。
80
75
你可以覆盖默认配置使其适合你的环境或检测自定义节点问题。例如:
81
76
82
- <!--
77
+ <!--
83
78
1. Create a Node Problem Detector configuration similar to `node-problem-detector.yaml`:
84
79
85
80
{{< codenew file="debug/node-problem-detector.yaml" >}}
@@ -107,7 +102,7 @@ to detect customized node problems. For example:
107
102
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
108
103
```
109
104
110
- <!--
105
+ <!--
111
106
### Using an Addon pod to enable Node Problem Detector {#using-addon-pod}
112
107
113
108
If you are using a custom cluster bootstrap solution and don't need
@@ -117,33 +112,33 @@ further automate the deployment.
117
112
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
118
113
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
119
114
-->
120
- ### 使用插件 pod 启用节点问题检测器 {#using-addon-pod}
115
+ ### 使用插件 Pod 启用节点问题检测器 {#using-addon-pod}
121
116
122
117
如果你使用的是自定义集群引导解决方案,不需要覆盖默认配置,
123
118
可以利用插件 Pod 进一步自动化部署。
124
119
125
120
创建 ` node-strick-detector.yaml ` ,并在控制平面节点上保存配置到插件 Pod 的目录
126
121
` /etc/kubernetes/addons/node-problem-detector ` 。
127
122
128
- <!--
129
- ## Overwrite the Configuration
123
+ <!--
124
+ ## Overwrite the configuration
130
125
131
- The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1 /config)
126
+ The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12 /config)
132
127
is embedded when building the Docker image of Node Problem Detector.
133
128
-->
134
129
## 覆盖配置文件
135
130
136
131
构建节点问题检测器的 docker 镜像时,会嵌入
137
- [ 默认配置] ( https://github.com/kubernetes/node-problem-detector/tree/v0.1 /config ) 。
132
+ [ 默认配置] ( https://github.com/kubernetes/node-problem-detector/tree/v0.8.12 /config ) 。
138
133
139
- <!--
134
+ <!--
140
135
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
141
136
to overwrite the configuration:
142
137
-->
143
138
不过,你可以像下面这样使用 [ ` ConfigMap ` ] ( /zh-cn/docs/tasks/configure-pod-container/configure-pod-configmap/ )
144
139
将其覆盖:
145
140
146
- <!--
141
+ <!--
147
142
1. Change the configuration files in `config/`
148
143
1. Create the `ConfigMap` `node-problem-detector-config`:
149
144
@@ -165,24 +160,24 @@ to overwrite the configuration:
165
160
-->
166
161
1 . 更改 ` config/ ` 中的配置文件
167
162
1 . 创建 ` ConfigMap ` ` node-strick-detector-config ` :
168
-
163
+
169
164
``` shell
170
165
kubectl create configmap node-problem-detector-config --from-file=config/
171
166
```
172
167
173
168
1 . 更改 ` node-problem-detector.yaml ` 以使用 ConfigMap:
174
-
169
+
175
170
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
176
171
177
172
1 . 使用新的配置文件重新创建节点问题检测器:
178
173
179
- ``` shell
174
+ ``` shell
180
175
# 如果你正在运行节点问题检测器,请先删除,然后再重新创建
181
176
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
182
177
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
183
178
```
184
179
185
- <!--
180
+ <!--
186
181
{{< note >}}
187
182
This approach only applies to a Node Problem Detector started with `kubectl`.
188
183
{{< /note >}}
@@ -197,99 +192,129 @@ The Addon manager does not support `ConfigMap`.
197
192
如果节点问题检测器作为集群插件运行,则不支持覆盖配置。
198
193
插件管理器不支持 ` ConfigMap ` 。
199
194
200
- <!--
201
- ## Kernel Monitor
195
+ <!--
196
+ ## Problem Daemons
202
197
203
- *Kernel Monitor* is a system log monitor daemon supported in the Node Problem Detector.
204
- Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
198
+ A problem daemon is a sub-daemon of the Node Problem Detector. It monitors specific kinds of node
199
+ problems and reports them to the Node Problem Detector.
200
+ There are several types of supported problem daemons.
205
201
-->
206
- ## 内核监视器
207
202
208
- * 内核监视器(Kernel Monitor)* 是节点问题检测器中支持的系统日志监视器守护进程。
209
- 内核监视器观察内核日志并根据预定义规则检测已知的内核问题。
203
+ ## 问题守护程序
210
204
211
- <!--
212
- The Kernel Monitor matches kernel issues according to a set of predefined rule list in
213
- [`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can expand the rule list by overwriting the
214
- configuration.
205
+ 问题守护程序是节点问题检测器的子守护程序。
206
+ 它监视特定类型的节点问题并报告给节点问题检测器。
207
+ 支持下面几种类型的问题守护程序。
208
+
209
+ <!--
210
+ - A `SystemLogMonitor` type of daemon monitors the system logs and reports problems and metrics
211
+ according to predefined rules. You can customize the configurations for different log sources
212
+ such as [filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json),
213
+ [kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json),
214
+ [kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json),
215
+ [abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json),
216
+ and [systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json).
217
+ -->
218
+ - ` SystemLogMonitor ` 类型的守护程序根据预定义的规则监视系统日志并报告问题和指标。
219
+ 你可以针对不同的日志源自定义配置如
220
+ [ filelog] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json ) 、
221
+ [ kmsg] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json ) 、
222
+ [ kernel] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json ) 、
223
+ [ abrt] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json )
224
+ 和 [ systemd] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json ) 。
225
+
226
+ <!--
227
+ - A `SystemStatsMonitor` type of daemon collects various health-related system stats as metrics.
228
+ You can customize its behavior by updating its
229
+ [configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json).
215
230
-->
216
- 内核监视器根据 [ ` config/kernel-monitor.json ` ] ( https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json )
217
- 中的一组预定义规则列表匹配内核问题。
218
- 规则列表是可扩展的,你始终可以通过覆盖配置来扩展它。
219
231
220
- <!--
221
- ### Add new NodeConditions
232
+ - ` SystemStatsMonitor ` 类型的守护程序收集各种与健康相关的系统统计数据作为指标。
233
+ 你可以通过更新其 [ 配置文件 ] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json ) 来自定义其行为。
222
234
223
- To support a new `NodeCondition`, create a condition definition within the `conditions` field in
224
- `config/kernel-monitor.json`, for example:
225
- ```
235
+ <!--
236
+ - A `CustomPluginMonitor` type of daemon invokes and checks various node problems by running
237
+ user-defined scripts. You can use different custom plugin monitors to monitor different
238
+ problems and customize the daemon behavior by updating the
239
+ [configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json).
226
240
-->
227
- ### 添加新的 NodeCondition
228
241
229
- 要支持新的 ` NodeCondition ` ,请在 ` config/kernel-monitor.json ` 中的
230
- ` conditions ` 字段中创建一个条件定义:
242
+ - ` CustomPluginMonitor ` 类型的守护程序通过运行用户定义的脚本来调用和检查各种节点问题。
243
+ 你可以使用不同的自定义插件监视器来监视不同的问题,并通过更新
244
+ [ 配置文件] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json )
245
+ 来定制守护程序行为。
231
246
232
- ``` json
233
- {
234
- "type" : " NodeConditionType" ,
235
- "reason" : " CamelCaseDefaultNodeConditionReason" ,
236
- "message" : " arbitrary default node condition message"
237
- }
238
- ```
247
+ <!--
248
+ - A `HealthChecker` type of daemon checks the health of the kubelet and container runtime on a node.
249
+ -->
250
+ - ` HealthChecker ` 类型的守护程序检查节点上的 kubelet 和容器运行时的健康状况。
239
251
240
- <!--
241
- ### Detect new problems
252
+ <!--
253
+ ### Adding support for other log format {#support-other-log-format}
242
254
243
- To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
244
- with a new rule definition:
255
+ The system log monitor currently supports file-based logs, journald, and kmsg.
256
+ Additional sources can be added by implementing a new
257
+ [log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go).
245
258
-->
246
- ### 检测新的问题
247
259
248
- 你可以使用新的规则描述来扩展 ` config/kernel-monitor.json ` 中的 ` rules ` 字段以检测新问题:
260
+ ### 增加对其他日志格式的支持 {#support-other-log-format}
249
261
250
- ``` json
251
- {
252
- "type" : " temporary/permanent" ,
253
- "condition" : " NodeConditionOfPermanentIssue" ,
254
- "reason" : " CamelCaseShortReason" ,
255
- "message" : " regexp matching the issue in the kernel log"
256
- }
257
- ```
262
+ 系统日志监视器目前支持基于文件的日志、journald 和 kmsg。
263
+ 可以通过实现一个新的
264
+ [ log watcher] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go )
265
+ 来添加额外的日志源。
258
266
259
- <!--
260
- ### Configure path for the kernel log device {#kernel-log-device-path}
267
+ <!--
268
+ ### Adding custom plugin monitors
261
269
262
- Check your kernel log path location in your operating system (OS) distribution.
263
- The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
264
- The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
265
- You can configure the `log` field to match the device path as seen by the Node Problem Detector .
270
+ You can extend the Node Problem Detector to execute any monitor scripts written in any language by
271
+ developing a custom plugin. The monitor scripts must conform to the plugin protocol in exit code
272
+ and standard output. For more information, please refer to the
273
+ [plugin interface proposal](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#) .
266
274
-->
267
- ### 配置内核日志设备的路径 {#kernel-log-device-path}
268
275
269
- 检查你的操作系统(OS)发行版本中的内核日志路径位置。
270
- Linux 内核[ 日志设备] ( https://www.kernel.org/doc/documentation/abi/testing/dev-kmsg )
271
- 通常呈现为 ` /dev/kmsg ` 。
272
- 但是,日志路径位置因 OS 发行版本而异。
273
- ` config/kernel-monitor.json ` 中的 ` log ` 字段表示容器内的日志路径。
274
- 你可以配置 ` log ` 字段以匹配节点问题检测器所示的设备路径。
276
+ ### 添加自定义插件监视器
275
277
276
- <!--
277
- ### Add support for another log format {#support-other-log-format}
278
+ 你可以通过开发自定义插件来扩展节点问题检测器,以执行以任何语言编写的任何监控脚本。
279
+ 监控脚本必须符合退出码和标准输出的插件协议。
280
+ 有关更多信息,请参阅
281
+ [ 插件接口提案] ( https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit# ) .
282
+
283
+ <!--
284
+ ## Exporter
285
+
286
+ An exporter reports the node problems and/or metrics to certain backends.
287
+ The following exporters are supported:
278
288
279
- Kernel monitor uses the
280
- [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
281
- You can implement a new translator for a new log format.
289
+ - **Kubernetes exporter**: this exporter reports node problems to the Kubernetes API server.
290
+ Temporary problems are reported as Events and permanent problems are reported as Node Conditions.
291
+
292
+ - **Prometheus exporter**: this exporter reports node problems and metrics locally as Prometheus
293
+ (or OpenMetrics) metrics. You can specify the IP address and port for the exporter using command
294
+ line arguments.
295
+
296
+ - **Stackdriver exporter**: this exporter reports node problems and metrics to the Stackdriver
297
+ Monitoring API. The exporting behavior can be customized using a
298
+ [configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json).
282
299
-->
283
- ### 添加对其它日志格式的支持 {#support-other-log-format}
284
300
285
- 内核监视器使用
286
- [ ` Translator ` ] ( https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator.go )
287
- 插件转换内核日志的内部数据结构。
288
- 你可以为新的日志格式实现新的转换器。
301
+ ## 导出器
302
+
303
+ 导出器(Exporter)向特定后端报告节点问题和/或指标。
304
+ 支持下列导出器:
305
+
306
+ - ** Kubernetes exporter** : 此导出器向 Kubernetes API 服务器报告节点问题。
307
+ 临时问题报告为事件,永久性问题报告为节点状况。
308
+
309
+ - ** Prometheus exporter** : 此导出器在本地将节点问题和指标报告为 Prometheus(或 OpenMetrics)指标。
310
+ 你可以使用命令行参数指定导出器的 IP 地址和端口。
311
+
312
+ - ** Stackdriver exporter** : 此导出器向 Stackdriver Monitoring API 报告节点问题和指标。
313
+ 可以使用[ 配置文件] ( https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json ) 自定义导出行为。
289
314
290
315
<!-- discussion -->
291
316
292
- <!--
317
+ <!--
293
318
## Recommendations and restrictions
294
319
295
320
It is recommended to run the Node Problem Detector in your cluster to monitor node health.
@@ -299,7 +324,7 @@ Usually this is fine, because:
299
324
* The kernel log grows relatively slowly.
300
325
* A resource limit is set for the Node Problem Detector.
301
326
* Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector
302
- [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
327
+ [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
303
328
-->
304
329
## 建议和限制
305
330
0 commit comments