Skip to content

Commit 55af4e7

Browse files
committed
Add initial README.md for node-problem-detector.
1 parent db90081 commit 55af4e7

File tree

2 files changed

+104
-2
lines changed

2 files changed

+104
-2
lines changed

README.md

Lines changed: 102 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,103 @@
11
# node-problem-detector
2-
This is a place for various problem detectors running on the Kubernetes nodes.
2+
node-problem-detector aims to make various node problems visible to the upstream
3+
layers in cluster management stack. It is a [DaemonSet](http://kubernetes.io/docs/admin/daemons/)
4+
detecting node problems and reporting them to apiserver. Now it is running as
5+
a [Kubernetes Addon](https://github.com/kubernetes/kubernetes/tree/master/cluster/addons)
6+
enabled by default in the GCE cluster.
7+
8+
# Background
9+
There are tons of node problems could possibly affect the pods running on the
10+
node such as:
11+
* Hardware issues: Bad cpu, memory or disk;
12+
* Kernel issues: Kernel deadlock, corrupted file system;
13+
* Container runtime issues: Unresponsive runtime daemon;
14+
* ...
15+
16+
Currently these problems are invisible to the upstream layers in cluster management
17+
stack, so Kubernetes will continue scheduling pods to the bad nodes.
18+
19+
To solve this problem, we introduced this new daemon **node-problem-detector** to
20+
collect node problems from various daemons and make them visible to the upstream
21+
layers. Once upstream layers have the visibility to those problems, we can discuss the
22+
remedy system.
23+
24+
# Problem API
25+
node-problem-detector uses `Event` and `NodeCondition` to report problems to
26+
apiserver.
27+
* `NodeCondition`: Permanent problem that makes the node unavailable for pods should
28+
be reported as `NodeCondition`.
29+
* `Event`: Temporary problem that has limited impact on pod but is informative
30+
should be reported as `Event`.
31+
32+
# Problem Daemon
33+
A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific
34+
kind of node problems and reports them to node-problem-detector.
35+
36+
A problem daemon could be:
37+
* A tiny daemon designed for dedicated usecase of Kubernetes.
38+
* An existing node health monitoring daemon integrated with node-problem-detector.
39+
40+
Currently, a problem daemon is running as a goroutine in the node-problem-detector
41+
binary. In the future, we'll separate node-problem-detector and problem daemons into
42+
different containers, and compose them with pod specification.
43+
44+
List of supported problem daemons:
45+
46+
| Problem Daemon | NodeCondition | Description |
47+
|----------------|:---------------:|:------------|
48+
| [KernelMonitor](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/kernelmonitor) | KernelDeadlock | A problem daemon monitors kernel log and reports problem according to predefined rules. |
49+
50+
# Usage
51+
## Build Image
52+
Run `make` in the top directory. It will:
53+
* Build the binary.
54+
* Build the docker image. The binary and `config/` are copied into the docker image.
55+
* Upload the docker image to registry. By default, the image will be uploaded to
56+
`gcr.io/google_containers`. It's easy to modify the `Makefile` to push the image
57+
to another registry
58+
59+
## Start DaemonSet
60+
* Create a file node-problem-daemon.yaml with the following yaml.
61+
```yaml
62+
apiVersion: extensions/v1beta1
63+
kind: DaemonSet
64+
metadata:
65+
name: node-problem-detector
66+
spec:
67+
template:
68+
spec:
69+
hostNetwork: true
70+
containers:
71+
- name: node-problem-detector
72+
image: gcr.io/google_containers/node-problem-detector:v0.1
73+
imagePullPolicy: Always
74+
env:
75+
# Config `host` and `port` of apiserver.
76+
- name: "KUBERNETES_SERVICE_HOST"
77+
value: "master-node-host-name"
78+
- name: "KUBERNETES_SERVICE_PORT"
79+
value: "443"
80+
securityContext:
81+
privileged: true
82+
volumeMounts:
83+
- name: log
84+
mountPath: /log
85+
readOnly: true
86+
volumes:
87+
- name: log
88+
# Config `log` to your system log directory
89+
hostPath:
90+
path: /var/log/
91+
```
92+
* Edit node-problem-detector.yaml to fit your environment:
93+
* Set environent variables `KUBERNETES_SERVICE_HOST` and `KUBERNETES_SERVICE_PORT`
94+
to apiserver host ip and port.
95+
* Set `log` volueme to your system log diretory. (Used by KernelMonitor)
96+
* Create the DaemonSet with `kubectl create -f node-problem-detector.yaml`
97+
* If needed, you can use [ConfigMap](http://kubernetes.io/docs/user-guide/configmap/)
98+
to overwrite the `config/`.
99+
100+
# Links
101+
* [Design Doc](https://docs.google.com/document/d/1cs1kqLziG-Ww145yN6vvlKguPbQQ0psrSBnEqpy0pzE/edit?usp=sharing)
102+
* [Slides](https://docs.google.com/presentation/d/1bkJibjwWXy8YnB5fna6p-Ltiy-N5p01zUsA22wCNkXA/edit?usp=sharing)
103+
* [Addon Manifest](https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/node-problem-detector)

node-problem-detector.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ spec:
1717
image: gcr.io/google_containers/node-problem-detector:v0.1
1818
imagePullPolicy: Always
1919
env:
20-
# Config the host ip and port of apiserver.
20+
# Config `host` and `port` of apiserver.
2121
- name: "KUBERNETES_SERVICE_HOST"
2222
value: "e2e-test-lantaol-master"
2323
- name: "KUBERNETES_SERVICE_PORT"
@@ -33,6 +33,7 @@ spec:
3333
readOnly: true
3434
volumes:
3535
- name: log
36+
# Config `log` to your system log directory
3637
hostPath:
3738
path: /var/log/
3839
- name: config

0 commit comments

Comments
 (0)