|
6 | 6 | [id="machine-health-checks-about_{context}"]
|
7 | 7 | = About machine health checks
|
8 | 8 |
|
9 |
| -You can define conditions under which machines in a cluster are considered unhealthy by using a `MachineHealthCheck` resource. |
10 |
| -Machines matching the conditions are automatically remediated. |
| 9 | +Machine health checks automatically repair unhealthy machines in a particular machine pool. |
11 | 10 |
|
12 |
| -To monitor machine health, create a `MachineHealthCheck` custom resource (CR) that includes a label for the set of machines to monitor and a condition to check, such as staying in the `NotReady` status for 15 minutes or displaying a permanent condition in the node-problem-detector. |
| 11 | +To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the `NotReady` status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor. |
13 | 12 |
|
14 |
| -The controller that observes a `MachineHealthCheck` CR checks for the condition that you defined. If a machine fails the health check, the machine is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a `machine deleted` event. |
| 13 | +[NOTE] |
| 14 | +==== |
| 15 | +You cannot apply a machine health check to a machine with the master role. |
| 16 | +==== |
| 17 | + |
| 18 | +The controller that observes a `MachineHealthCheck` resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a `machine deleted` event. |
| 19 | + |
| 20 | +To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the `maxUnhealthy` threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention. |
15 | 21 |
|
16 | 22 | [NOTE]
|
17 | 23 | ====
|
18 |
| -For machines with the master role, the machine health check reports the number of unhealthy nodes, but the machine is not deleted. For example: |
19 |
| -
|
20 |
| -.Example output |
21 |
| -[source,terminal] |
22 |
| ----- |
23 |
| -$ oc get machinehealthcheck example -n openshift-machine-api |
24 |
| ----- |
25 |
| -[source,terminal] |
26 |
| ----- |
27 |
| -NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY |
28 |
| -example 40% 3 1 |
29 |
| ----- |
30 |
| -
|
31 |
| -To limit the disruptive impact of machine deletions, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the `maxUnhealthy` threshold allows for in the targeted pool of machines, the controller stops deleting machines and you must manually intervene. |
| 24 | +Consider the timeouts carefully, accounting for workloads and requirements. |
| 25 | +
|
| 26 | +* Long timeouts can result in long periods of downtime for the workload on the unhealthy machine. |
| 27 | +* Too short timeouts can result in a remediation loop. For example, the timeout for checking the `NotReady` status must be long enough to allow the machine to complete the startup process. |
32 | 28 | ====
|
33 | 29 |
|
34 |
| -To stop the check, remove the custom resource. |
| 30 | +To stop the check, remove the resource. |
35 | 31 |
|
36 | 32 | [id="machine-health-checks-limitations_{context}"]
|
37 | 33 | == Limitations when deploying machine health checks
|
|
0 commit comments