Skip to content

Commit 68e6b45

Browse files
authored
Merge pull request #24772 from kbhawkey/node-problem-detector-cleanup
clean up node problem detector task page
2 parents a8f7a93 + d3f374c commit 68e6b45

File tree

1 file changed

+81
-100
lines changed

1 file changed

+81
-100
lines changed
Lines changed: 81 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -1,133 +1,117 @@
11
---
2+
title: Monitor Node Health
3+
content_type: task
24
reviewers:
35
- Random-Liu
46
- dchen1107
5-
content_type: task
6-
title: Monitor Node Health
77
---
88

99
<!-- overview -->
1010

11-
*Node problem detector* is a [DaemonSet](/docs/concepts/workloads/controllers/daemonset/) monitoring the
12-
node health. It collects node problems from various daemons and reports them
13-
to the apiserver as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
11+
*Node problem detector* is a daemon for monitoring and reporting about a node's health.
12+
You can run node problem detector as a `DaemonSet`
13+
or as a standalone daemon. Node problem detector collects information about node problems from various daemons
14+
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
1415
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
1516

16-
It supports some known kernel issue detection now, and will detect more and
17-
more node problems over time.
18-
19-
Currently Kubernetes won't take any action on the node conditions and events
20-
generated by node problem detector. In the future, a remedy system could be
21-
introduced to deal with node problems.
22-
23-
See more information
24-
[here](https://github.com/kubernetes/node-problem-detector).
25-
26-
17+
To learn how to install and use the node problem detector, see the
18+
[Node problem detector project documentation](https://github.com/kubernetes/node-problem-detector).
2719

2820
## {{% heading "prerequisites" %}}
2921

30-
31-
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
32-
33-
22+
{{< include "task-tutorial-prereqs.md" >}}
3423

3524
<!-- steps -->
3625

3726
## Limitations
3827

39-
* The kernel issue detection of node problem detector only supports file based
40-
kernel log now. It doesn't support log tools like journald.
28+
* Node problem detector only supports file based kernel log.
29+
Log tools such as `journald` are not supported.
4130

42-
* The kernel issue detection of node problem detector has assumption on kernel
43-
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
44-
it to [support other log format](/docs/tasks/debug-application-cluster/monitor-node-health/#support-other-log-format).
31+
* Node problem detector uses the kernel log format for reporting kernel issues.
32+
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
4533

46-
## Enable/Disable in GCE cluster
34+
## Enabling node problem detector
4735

48-
Node problem detector is [running as a cluster addon](/docs/setup/best-practices/cluster-large/#addon-resources) enabled by default in the
49-
gce cluster.
36+
Some cloud providers enable node problem detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
37+
You can also enable node problem detector with `kubectl` or by creating an Addon pod.
5038

51-
You can enable/disable it by setting the environment variable
52-
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
39+
### Using kubectl to enable node problem detector {#using-kubectl}
5340

54-
## Use in Other Environment
41+
`kubectl` provides the most flexible management of node problem detector.
42+
You can overwrite the default configuration to fit it into your environment or
43+
to detect customized node problems. For example:
5544

56-
To enable node problem detector in other environment outside of GCE, you can use
57-
either `kubectl` or addon pod.
45+
1. Create a node problem detector configuration similar to `node-problem-detector.yaml`:
5846

59-
### Kubectl
47+
{{< codenew file="debug/node-problem-detector.yaml" >}}
6048

61-
This is the recommended way to start node problem detector outside of GCE. It
62-
provides more flexible management, such as overwriting the default
63-
configuration to fit it into your environment or detect
64-
customized node problems.
49+
{{< note >}}
50+
You should verify that the system log directory is right for your operating system distribution.
51+
{{< /note >}}
6552

66-
* **Step 1:** `node-problem-detector.yaml`:
53+
1. Start node problem detector with `kubectl`:
6754

68-
{{< codenew file="debug/node-problem-detector.yaml" >}}
55+
```shell
56+
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
57+
```
6958

59+
### Using an Addon pod to enable node problem detector {#using-addon-pod}
7060

71-
***Notice that you should make sure the system log directory is right for your
72-
OS distro.***
73-
74-
* **Step 2:** Start node problem detector with `kubectl`:
75-
76-
```shell
77-
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
78-
```
79-
80-
### Addon Pod
81-
82-
This is for those who have their own cluster bootstrap solution, and don't need
83-
to overwrite the default configuration. They could leverage the addon pod to
61+
If you are using a custom cluster bootstrap solution and don't need
62+
to overwrite the default configuration, you can leverage the Addon pod to
8463
further automate the deployment.
8564

86-
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
87-
`/etc/kubernetes/addons/node-problem-detector` on master node.
65+
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
66+
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
8867

89-
## Overwrite the Configuration
68+
## Overwrite the configuration
9069

9170
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
9271
is embedded when building the Docker image of node problem detector.
9372

94-
However, you can use [ConfigMap](/docs/tasks/configure-pod-container/configure-pod-configmap/) to overwrite it
95-
following the steps:
73+
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
74+
to overwrite the configuration:
9675

97-
* **Step 1:** Change the config files in `config/`.
98-
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
99-
node-problem-detector-config --from-file=config/`.
100-
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
76+
1. Change the configuration files in `config/`
77+
1. Create the `ConfigMap` `node-problem-detector-config`:
10178

102-
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
79+
```shell
80+
kubectl create configmap node-problem-detector-config --from-file=config/
81+
```
10382

83+
1. Change the `node-problem-detector.yaml` to use the `ConfigMap`:
10484

105-
* **Step 4:** Re-create the node problem detector with the new yaml file:
85+
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
10686

107-
```shell
108-
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml # If you have a node-problem-detector running
109-
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
110-
```
87+
1. Recreate the node problem detector with the new configuration file:
11188

112-
***Notice that this approach only applies to node problem detector started with `kubectl`.***
89+
```shell
90+
# If you have a node-problem-detector running, delete before recreating
91+
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
92+
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
93+
```
11394

114-
For node problem detector running as cluster addon, because addon manager doesn't support
115-
ConfigMap, configuration overwriting is not supported now.
95+
{{< note >}}
96+
This approach only applies to a node problem detector started with `kubectl`.
97+
{{< /note >}}
98+
99+
Overwriting a configuration is not supported if a node problem detector runs as a cluster Addon.
100+
The Addon manager does not support `ConfigMap`.
116101

117102
## Kernel Monitor
118103

119-
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
120-
and detects known kernel issues following predefined rules.
104+
*Kernel Monitor* is a system log monitor daemon supported in the node problem detector.
105+
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
121106

122107
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
123-
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
124-
The rule list is extensible, and you can always extend it by overwriting the
108+
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can extend the rule list by overwriting the
125109
configuration.
126110

127-
### Add New NodeConditions
111+
### Add new NodeConditions
128112

129-
To support new node conditions, you can extend the `conditions` field in
130-
`config/kernel-monitor.json` with new condition definition:
113+
To support a new `NodeCondition`, you can extend the `conditions` field in
114+
`config/kernel-monitor.json` with a new condition definition such as:
131115

132116
```json
133117
{
@@ -137,10 +121,10 @@ To support new node conditions, you can extend the `conditions` field in
137121
}
138122
```
139123

140-
### Detect New Problems
124+
### Detect new problems
141125

142126
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
143-
with new rule definition:
127+
with a new rule definition:
144128

145129
```json
146130
{
@@ -151,31 +135,28 @@ with new rule definition:
151135
}
152136
```
153137

154-
### Change Log Path
155-
156-
Kernel log in different OS distros may locate in different path. The `log`
157-
field in `config/kernel-monitor.json` is the log path inside the container.
158-
You can always configure it to match your OS distro.
159-
160-
### Support Other Log Format
138+
### Configure path for the kernel log device {#kernel-log-device-path}
161139

162-
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
163-
plugin to translate kernel log the internal data structure. It is easy to
164-
implement a new translator for a new log format.
140+
Check your kernel log path location in your operating system (OS) distribution.
141+
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
142+
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
143+
You can configure the `log` field to match the device path as seen by the node problem detector.
165144

145+
### Add support for another log format {#support-other-log-format}
166146

147+
Kernel monitor uses the
148+
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
149+
You can implement a new translator for a new log format.
167150

168151
<!-- discussion -->
169152

170-
## Caveats
171-
172-
It is recommended to run the node problem detector in your cluster to monitor
173-
the node health. However, you should be aware that this will introduce extra
174-
resource overhead on each node. Usually this is fine, because:
175-
176-
* The kernel log is generated relatively slowly.
177-
* Resource limit is set for node problem detector.
178-
* Even under high load, the resource usage is acceptable.
179-
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
153+
## Recommendations and restrictions
180154

155+
It is recommended to run the node problem detector in your cluster to monitor node health.
156+
When running the node problem detector, you can expect extra resource overhead on each node.
157+
Usually this is fine, because:
181158

159+
* The kernel log grows relatively slowly.
160+
* A resource limit is set for the node problem detector.
161+
* Even under high load, the resource usage is acceptable. For more information, see the node problem detector
162+
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).

0 commit comments

Comments
 (0)