|
6 | 6 | [id="machine-health-checks-about_{context}"] |
7 | 7 | = About machine health checks |
8 | 8 |
|
9 | | -You can define conditions under which machines in a cluster are considered unhealthy by using a `MachineHealthCheck` resource. |
10 | | -Machines matching the conditions are automatically remediated. |
| 9 | +Machine health checks automatically repair unhealthy machines in a particular machine pool. |
11 | 10 |
|
12 | | -To monitor machine health, create a `MachineHealthCheck` custom resource (CR) that includes a label for the set of machines to monitor and a condition to check, such as staying in the `NotReady` status for 15 minutes or displaying a permanent condition in the node-problem-detector. |
| 11 | +To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the `NotReady` status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor. |
13 | 12 |
|
14 | | -The controller that observes a `MachineHealthCheck` CR checks for the condition that you defined. If a machine fails the health check, the machine is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a `machine deleted` event. |
| 13 | +[NOTE] |
| 14 | +==== |
| 15 | +You cannot apply a machine health check to a machine with the master role. |
| 16 | +==== |
| 17 | + |
| 18 | +The controller that observes a `MachineHealthCheck` resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a `machine deleted` event. |
| 19 | + |
| 20 | +To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the `maxUnhealthy` threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention. |
15 | 21 |
|
16 | 22 | [NOTE] |
17 | 23 | ==== |
18 | | -For machines with the master role, the machine health check reports the number of unhealthy nodes, but the machine is not deleted. For example: |
19 | | -
|
20 | | -.Example output |
21 | | -[source,terminal] |
22 | | ----- |
23 | | -$ oc get machinehealthcheck example -n openshift-machine-api |
24 | | ----- |
25 | | -[source,terminal] |
26 | | ----- |
27 | | -NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY |
28 | | -example 40% 3 1 |
29 | | ----- |
30 | | -
|
31 | | -To limit the disruptive impact of machine deletions, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the `maxUnhealthy` threshold allows for in the targeted pool of machines, the controller stops deleting machines and you must manually intervene. |
| 24 | +Consider the timeouts carefully, accounting for workloads and requirements. |
| 25 | +
|
| 26 | +* Long timeouts can result in long periods of downtime for the workload on the unhealthy machine. |
| 27 | +* Too short timeouts can result in a remediation loop. For example, the timeout for checking the `NotReady` status must be long enough to allow the machine to complete the startup process. |
32 | 28 | ==== |
33 | 29 |
|
34 | | -To stop the check, remove the custom resource. |
| 30 | +To stop the check, remove the resource. |
35 | 31 |
|
36 | 32 | [id="machine-health-checks-limitations_{context}"] |
37 | 33 | == Limitations when deploying machine health checks |
|
0 commit comments