Skip to content

Commit 12fee40

Browse files
authored
Merge pull request #38520 from tengqm/rev-npd
Revise the NPD task page
2 parents e5b34d3 + b940dcf commit 12fee40

File tree

1 file changed

+47
-44
lines changed

1 file changed

+47
-44
lines changed

content/en/docs/tasks/debug/debug-cluster/monitor-node-health.md

Lines changed: 47 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ weight: 20
1212
*Node Problem Detector* is a daemon for monitoring and reporting about a node's health.
1313
You can run Node Problem Detector as a `DaemonSet` or as a standalone daemon.
1414
Node Problem Detector collects information about node problems from various daemons
15-
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
16-
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
15+
and reports these conditions to the API server as Node [Condition](/docs/concepts/architecture/nodes/#condition)s
16+
or as [Event](/docs/reference/kubernetes-api/cluster-resources/event-v1)s.
1717

1818
To learn how to install and use Node Problem Detector, see
1919
[Node Problem Detector project documentation](https://github.com/kubernetes/node-problem-detector).
@@ -26,16 +26,13 @@ To learn how to install and use Node Problem Detector, see
2626

2727
## Limitations
2828

29-
* Node Problem Detector only supports file based kernel log.
30-
Log tools such as `journald` are not supported.
31-
3229
* Node Problem Detector uses the kernel log format for reporting kernel issues.
3330
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
3431

3532
## Enabling Node Problem Detector
3633

3734
Some cloud providers enable Node Problem Detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
38-
You can also enable Node Problem Detector with `kubectl` or by creating an Addon pod.
35+
You can also enable Node Problem Detector with `kubectl` or by creating an Addon DaemonSet.
3936

4037
### Using kubectl to enable Node Problem Detector {#using-kubectl}
4138

@@ -68,7 +65,7 @@ directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node
6865

6966
## Overwrite the configuration
7067

71-
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
68+
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12/config)
7269
is embedded when building the Docker image of Node Problem Detector.
7370

7471
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
@@ -100,54 +97,59 @@ This approach only applies to a Node Problem Detector started with `kubectl`.
10097
Overwriting a configuration is not supported if a Node Problem Detector runs as a cluster Addon.
10198
The Addon manager does not support `ConfigMap`.
10299

103-
## Kernel Monitor
100+
## Problem Daemons
101+
102+
A problem daemon is a sub-daemon of the Node Problem Detector. It monitors specific kinds of node
103+
problems and reports them to the Node Problem Detector.
104+
There are several types of supported problem daemons.
104105

105-
*Kernel Monitor* is a system log monitor daemon supported in the Node Problem Detector.
106-
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
106+
- A `SystemLogMonitor` type of daemon monitors the system logs and reports problems and metrics
107+
according to predefined rules. You can customize the configurations for different log sources
108+
such as [filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json),
109+
[kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json),
110+
[kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json),
111+
[abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json),
112+
and [systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json).
107113

108-
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
109-
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can expand the rule list by overwriting the
110-
configuration.
114+
- A `SystemStatsMonitor` type of daemon collects various health-related system stats as metrics.
115+
You can customize its behavior by updating its
116+
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json).
111117

112-
### Add new NodeConditions
118+
- A `CustomPluginMonitor` type of daemon invokes and checks various node problems by running
119+
user-defined scripts. You can use different custom plugin monitors to monitor different
120+
problems and customize the daemon behavior by updating the
121+
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json).
113122

114-
To support a new `NodeCondition`, create a condition definition within the `conditions` field in
115-
`config/kernel-monitor.json`, for example:
123+
- A `HealthChecker` type of daemon checks the health of the kubelet and container runtime on a node.
116124

117-
```json
118-
{
119-
"type": "NodeConditionType",
120-
"reason": "CamelCaseDefaultNodeConditionReason",
121-
"message": "arbitrary default node condition message"
122-
}
123-
```
125+
### Adding support for other log format {#support-other-log-format}
124126

125-
### Detect new problems
127+
The system log monitor currently supports file-based logs, journald, and kmsg.
128+
Additional sources can be added by implementing a new
129+
[log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go).
126130

127-
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
128-
with a new rule definition:
131+
### Adding custom plugin monitors
129132

130-
```json
131-
{
132-
"type": "temporary/permanent",
133-
"condition": "NodeConditionOfPermanentIssue",
134-
"reason": "CamelCaseShortReason",
135-
"message": "regexp matching the issue in the kernel log"
136-
}
137-
```
133+
You can extend the Node Problem Detector to execute any monitor scripts written in any language by
134+
developing a custom plugin. The monitor scripts must conform to the plugin protocol in exit code
135+
and standard output. For more information, please refer to the
136+
[plugin interface proposal](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#).
138137

139-
### Configure path for the kernel log device {#kernel-log-device-path}
138+
## Exporter
140139

141-
Check your kernel log path location in your operating system (OS) distribution.
142-
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
143-
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
144-
You can configure the `log` field to match the device path as seen by the Node Problem Detector.
140+
An exporter reports the node problems and/or metrics to certain backends.
141+
The following exporters are supported:
145142

146-
### Add support for another log format {#support-other-log-format}
143+
- **Kubernetes exporter**: this exporter reports node problems to the Kubernetes API server.
144+
Temporary problems are reported as Events and permanent problems are reported as Node Conditions.
147145

148-
Kernel monitor uses the
149-
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
150-
You can implement a new translator for a new log format.
146+
- **Prometheus exporter**: this exporter reports node problems and metrics locally as Prometheus
147+
(or OpenMetrics) metrics. You can specify the IP address and port for the exporter using command
148+
line arguments.
149+
150+
- **Stackdriver exporter**: this exporter reports node problems and metrics to the Stackdriver
151+
Monitoring API. The exporting behavior can be customized using a
152+
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json).
151153

152154
<!-- discussion -->
153155

@@ -160,4 +162,5 @@ Usually this is fine, because:
160162
* The kernel log grows relatively slowly.
161163
* A resource limit is set for the Node Problem Detector.
162164
* Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector
163-
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
165+
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
166+

0 commit comments

Comments
 (0)