Merge pull request #38520 from tengqm/rev-npd

k8s-ci-robot · web-flow · commit 12fee40f61dc · 2022-12-22T07:43:26.000-08:00
Revise the NPD task page
diff --git a/content/en/docs/tasks/debug/debug-cluster/monitor-node-health.md b/content/en/docs/tasks/debug/debug-cluster/monitor-node-health.md
@@ -12,8 +12,8 @@ weight: 20
 *Node Problem Detector* is a daemon for monitoring and reporting about a node's health.
 You can run Node Problem Detector as a `DaemonSet` or as a standalone daemon.
 Node Problem Detector collects information about node problems from various daemons
-and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
-and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
+and reports these conditions to the API server as Node [Condition](/docs/concepts/architecture/nodes/#condition)s
+or as [Event](/docs/reference/kubernetes-api/cluster-resources/event-v1)s.
 
 To learn how to install and use Node Problem Detector, see
 [Node Problem Detector project documentation](https://github.com/kubernetes/node-problem-detector).
@@ -26,16 +26,13 @@ To learn how to install and use Node Problem Detector, see
 
 ## Limitations
 
-* Node Problem Detector only supports file based kernel log.
-  Log tools such as `journald` are not supported.
-
 * Node Problem Detector uses the kernel log format for reporting kernel issues.
   To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
 
 ## Enabling Node Problem Detector
 
 Some cloud providers enable Node Problem Detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
-You can also enable Node Problem Detector with `kubectl` or by creating an Addon pod.
+You can also enable Node Problem Detector with `kubectl` or by creating an Addon DaemonSet.
 
 ### Using kubectl to enable Node Problem Detector {#using-kubectl}
 
@@ -68,7 +65,7 @@ directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node
 
 ## Overwrite the configuration
 
-The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
+The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12/config)
 is embedded when building the Docker image of Node Problem Detector.
 
 However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
@@ -100,54 +97,59 @@ This approach only applies to a Node Problem Detector started with `kubectl`.
 Overwriting a configuration is not supported if a Node Problem Detector runs as a cluster Addon.
 The Addon manager does not support `ConfigMap`.
 
-## Kernel Monitor
+## Problem Daemons
+
+A problem daemon is a sub-daemon of the Node Problem Detector. It monitors specific kinds of node
+problems and reports them to the Node Problem Detector.
+There are several types of supported problem daemons.
 
-*Kernel Monitor* is a system log monitor daemon supported in the Node Problem Detector.
-Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
+- A `SystemLogMonitor` type of daemon monitors the system logs and reports problems and metrics
+  according to predefined rules. You can customize the configurations for different log sources
+  such as [filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json),
+  [kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json),
+  [kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json),
+  [abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json),
+  and [systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json).
 
-The Kernel Monitor matches kernel issues according to a set of predefined rule list in
-[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can expand the rule list by overwriting the
-configuration.
+- A `SystemStatsMonitor` type of daemon collects various health-related system stats as metrics.
+  You can customize its behavior by updating its
+  [configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json).
 
-### Add new NodeConditions
+- A `CustomPluginMonitor` type of daemon invokes and checks various node problems by running
+  user-defined scripts. You can use different custom plugin monitors to monitor different
+  problems and customize the daemon behavior by updating the
+  [configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json).
 
-To support a new `NodeCondition`, create a condition definition within the `conditions` field in
-`config/kernel-monitor.json`, for example:
+- A `HealthChecker` type of daemon checks the health of the kubelet and container runtime on a node.
 
-```json
-{
-  "type": "NodeConditionType",
-  "reason": "CamelCaseDefaultNodeConditionReason",
-  "message": "arbitrary default node condition message"
-}
-```
+### Adding support for other log format {#support-other-log-format}
 
-### Detect new problems
+The system log monitor currently supports file-based logs, journald, and kmsg.
+Additional sources can be added by implementing a new
+[log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go).
 
-To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
-with a new rule definition:
+### Adding custom plugin monitors
 
-```json
-{
-  "type": "temporary/permanent",
-  "condition": "NodeConditionOfPermanentIssue",
-  "reason": "CamelCaseShortReason",
-  "message": "regexp matching the issue in the kernel log"
-}
-```
+You can extend the Node Problem Detector to execute any monitor scripts written in any language by
+developing a custom plugin. The monitor scripts must conform to the plugin protocol in exit code
+and standard output. For more information, please refer to the
+[plugin interface proposal](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#).
 
-### Configure path for the kernel log device {#kernel-log-device-path}
+## Exporter
 
-Check your kernel log path location in your operating system (OS) distribution.
-The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
-The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
-You can configure the `log` field to match the device path as seen by the Node Problem Detector.
+An exporter reports the node problems and/or metrics to certain backends.
+The following exporters are supported:
 
-### Add support for another log format {#support-other-log-format}
+- **Kubernetes exporter**: this exporter reports node problems to the Kubernetes API server.
+  Temporary problems are reported as Events and permanent problems are reported as Node Conditions.
 
-Kernel monitor uses the
-[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
-You can implement a new translator for a new log format.
+- **Prometheus exporter**: this exporter reports node problems and metrics locally as Prometheus
+  (or OpenMetrics) metrics. You can specify the IP address and port for the exporter using command
+  line arguments.
+
+- **Stackdriver exporter**: this exporter reports node problems and metrics to the Stackdriver
+  Monitoring API. The exporting behavior can be customized using a
+  [configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json).
 
 <!-- discussion -->
 
@@ -160,4 +162,5 @@ Usually this is fine, because:
 * The kernel log grows relatively slowly.
 * A resource limit is set for the Node Problem Detector.
 * Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector
-[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
+  [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
+