Skip to content

Commit 4ad9f80

Browse files
authored
Merge pull request #31755 from jcfrye77/KNIDEPLOY-4016
KNIDEPLOY-4016
2 parents 64a2a55 + 54c4b9e commit 4ad9f80

File tree

1 file changed

+15
-19
lines changed

1 file changed

+15
-19
lines changed

modules/machine-health-checks-about.adoc

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,32 +6,28 @@
66
[id="machine-health-checks-about_{context}"]
77
= About machine health checks
88

9-
You can define conditions under which machines in a cluster are considered unhealthy by using a `MachineHealthCheck` resource.
10-
Machines matching the conditions are automatically remediated.
9+
Machine health checks automatically repair unhealthy machines in a particular machine pool.
1110

12-
To monitor machine health, create a `MachineHealthCheck` custom resource (CR) that includes a label for the set of machines to monitor and a condition to check, such as staying in the `NotReady` status for 15 minutes or displaying a permanent condition in the node-problem-detector.
11+
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the `NotReady` status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.
1312

14-
The controller that observes a `MachineHealthCheck` CR checks for the condition that you defined. If a machine fails the health check, the machine is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a `machine deleted` event.
13+
[NOTE]
14+
====
15+
You cannot apply a machine health check to a machine with the master role.
16+
====
17+
18+
The controller that observes a `MachineHealthCheck` resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a `machine deleted` event.
19+
20+
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the `maxUnhealthy` threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.
1521

1622
[NOTE]
1723
====
18-
For machines with the master role, the machine health check reports the number of unhealthy nodes, but the machine is not deleted. For example:
19-
20-
.Example output
21-
[source,terminal]
22-
----
23-
$ oc get machinehealthcheck example -n openshift-machine-api
24-
----
25-
[source,terminal]
26-
----
27-
NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
28-
example 40% 3 1
29-
----
30-
31-
To limit the disruptive impact of machine deletions, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the `maxUnhealthy` threshold allows for in the targeted pool of machines, the controller stops deleting machines and you must manually intervene.
24+
Consider the timeouts carefully, accounting for workloads and requirements.
25+
26+
* Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.
27+
* Too short timeouts can result in a remediation loop. For example, the timeout for checking the `NotReady` status must be long enough to allow the machine to complete the startup process.
3228
====
3329

34-
To stop the check, remove the custom resource.
30+
To stop the check, remove the resource.
3531

3632
[id="machine-health-checks-limitations_{context}"]
3733
== Limitations when deploying machine health checks

0 commit comments

Comments
 (0)