|
1 | 1 | # node-problem-detector
|
2 |
| -This is a place for various problem detectors running on the Kubernetes nodes. |
| 2 | +node-problem-detector aims to make various node problems visible to the upstream |
| 3 | +layers in cluster management stack. It is a [DaemonSet](http://kubernetes.io/docs/admin/daemons/) |
| 4 | +detecting node problems and reporting them to apiserver. Now it is running as |
| 5 | +a [Kubernetes Addon](https://github.com/kubernetes/kubernetes/tree/master/cluster/addons) |
| 6 | +enabled by default in the GCE cluster. |
| 7 | + |
| 8 | +# Background |
| 9 | +There are tons of node problems could possibly affect the pods running on the |
| 10 | +node such as: |
| 11 | +* Hardware issues: Bad cpu, memory or disk; |
| 12 | +* Kernel issues: Kernel deadlock, corrupted file system; |
| 13 | +* Container runtime issues: Unresponsive runtime daemon; |
| 14 | +* ... |
| 15 | + |
| 16 | +Currently these problems are invisible to the upstream layers in cluster management |
| 17 | +stack, so Kubernetes will continue scheduling pods to the bad nodes. |
| 18 | + |
| 19 | +To solve this problem, we introduced this new daemon **node-problem-detector** to |
| 20 | +collect node problems from various daemons and make them visible to the upstream |
| 21 | +layers. Once upstream layers have the visibility to those problems, we can discuss the |
| 22 | +remedy system. |
| 23 | + |
| 24 | +# Problem API |
| 25 | +node-problem-detector uses `Event` and `NodeCondition` to report problems to |
| 26 | +apiserver. |
| 27 | +* `NodeCondition`: Permanent problem that makes the node unavailable for pods should |
| 28 | +be reported as `NodeCondition`. |
| 29 | +* `Event`: Temporary problem that has limited impact on pod but is informative |
| 30 | +should be reported as `Event`. |
| 31 | + |
| 32 | +# Problem Daemon |
| 33 | +A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific |
| 34 | +kind of node problems and reports them to node-problem-detector. |
| 35 | + |
| 36 | +A problem daemon could be: |
| 37 | +* A tiny daemon designed for dedicated usecase of Kubernetes. |
| 38 | +* An existing node health monitoring daemon integrated with node-problem-detector. |
| 39 | + |
| 40 | +Currently, a problem daemon is running as a goroutine in the node-problem-detector |
| 41 | +binary. In the future, we'll separate node-problem-detector and problem daemons into |
| 42 | +different containers, and compose them with pod specification. |
| 43 | + |
| 44 | +List of supported problem daemons: |
| 45 | + |
| 46 | +| Problem Daemon | NodeCondition | Description | |
| 47 | +|----------------|:---------------:|:------------| |
| 48 | +| [KernelMonitor](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/kernelmonitor) | KernelDeadlock | A problem daemon monitors kernel log and reports problem according to predefined rules. | |
| 49 | + |
| 50 | +# Usage |
| 51 | +## Build Image |
| 52 | +Run `make` in the top directory. It will: |
| 53 | +* Build the binary. |
| 54 | +* Build the docker image. The binary and `config/` are copied into the docker image. |
| 55 | +* Upload the docker image to registry. By default, the image will be uploaded to |
| 56 | +`gcr.io/google_containers`. It's easy to modify the `Makefile` to push the image |
| 57 | +to another registry |
| 58 | + |
| 59 | +## Start DaemonSet |
| 60 | +* Create a file node-problem-daemon.yaml with the following yaml. |
| 61 | +```yaml |
| 62 | +apiVersion: extensions/v1beta1 |
| 63 | +kind: DaemonSet |
| 64 | +metadata: |
| 65 | + name: node-problem-detector |
| 66 | +spec: |
| 67 | + template: |
| 68 | + spec: |
| 69 | + hostNetwork: true |
| 70 | + containers: |
| 71 | + - name: node-problem-detector |
| 72 | + image: gcr.io/google_containers/node-problem-detector:v0.1 |
| 73 | + imagePullPolicy: Always |
| 74 | + env: |
| 75 | + # Config `host` and `port` of apiserver. |
| 76 | + - name: "KUBERNETES_SERVICE_HOST" |
| 77 | + value: "master-node-host-name" |
| 78 | + - name: "KUBERNETES_SERVICE_PORT" |
| 79 | + value: "443" |
| 80 | + securityContext: |
| 81 | + privileged: true |
| 82 | + volumeMounts: |
| 83 | + - name: log |
| 84 | + mountPath: /log |
| 85 | + readOnly: true |
| 86 | + volumes: |
| 87 | + - name: log |
| 88 | + # Config `log` to your system log directory |
| 89 | + hostPath: |
| 90 | + path: /var/log/ |
| 91 | +``` |
| 92 | +* Edit node-problem-detector.yaml to fit your environment: |
| 93 | + * Set environent variables `KUBERNETES_SERVICE_HOST` and `KUBERNETES_SERVICE_PORT` |
| 94 | + to apiserver host ip and port. |
| 95 | + * Set `log` volueme to your system log diretory. (Used by KernelMonitor) |
| 96 | +* Create the DaemonSet with `kubectl create -f node-problem-detector.yaml` |
| 97 | +* If needed, you can use [ConfigMap](http://kubernetes.io/docs/user-guide/configmap/) |
| 98 | +to overwrite the `config/`. |
| 99 | + |
| 100 | +# Links |
| 101 | +* [Design Doc](https://docs.google.com/document/d/1cs1kqLziG-Ww145yN6vvlKguPbQQ0psrSBnEqpy0pzE/edit?usp=sharing) |
| 102 | +* [Slides](https://docs.google.com/presentation/d/1bkJibjwWXy8YnB5fna6p-Ltiy-N5p01zUsA22wCNkXA/edit?usp=sharing) |
| 103 | +* [Addon Manifest](https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/node-problem-detector) |
0 commit comments