You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a bare metal cluster, remediation of nodes is critical to ensuring the overall health of the cluster. Physically remediating a cluster can be challenging and any delay in putting the machine into a safe or an operational state increases the time the cluster remains in a degraded state, and the risk that subsequent failures might bring the cluster offline. Power-based remediation helps counter such challenges.
9
10
10
11
Instead of reprovisioning the nodes, power-based remediation uses a power controller to power off an inoperable node. This type of remediation is also called power fencing.
@@ -17,7 +18,7 @@ Power-based remediation provides the following capabilities:
17
18
* Reduces the risk of data loss in hyperconverged environments
18
19
* Reduces the downtime associated with recovering physical machines
== Understanding the annotation-based remediation process
37
38
38
39
The remediation process operates as follows:
@@ -49,7 +50,7 @@ The remediation process operates as follows:
49
50
If the power operations did not complete, the bare metal machine controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.
== Understanding the metal3-based remediation process
54
55
55
56
The remediation process operates as follows:
@@ -66,7 +67,7 @@ The remediation process operates as follows:
66
67
If the power operations did not complete, the metal3 remediation controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.
== Creating a MachineHealthCheck resource for bare metal
71
72
72
73
.Prerequisites
@@ -76,9 +77,11 @@ If the power operations did not complete, the metal3 remediation controller trig
76
77
* Network access to the BMC interface of the unhealthy node.
77
78
78
79
.Procedure
80
+
79
81
. Create a `healthcheck.yaml` file that contains the definition of your machine health check.
80
-
. Apply the `healthcheck.yaml` file to your cluster using the following command:
81
82
83
+
. Apply the `healthcheck.yaml` file to your cluster using the following command:
84
+
+
82
85
[source,terminal]
83
86
----
84
87
$ oc apply -f healthcheck.yaml
@@ -110,7 +113,6 @@ spec:
110
113
maxUnhealthy: "40%" <6>
111
114
nodeStartupTimeout: "10m" <7>
112
115
----
113
-
114
116
<1> Specify the name of the machine health check to deploy.
115
117
<2> For bare metal clusters, you must include the `machine.openshift.io/remediation-strategy: external-baremetal` annotation in the `annotations` section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster.
116
118
<3> Specify a label for the machine pool that you want to check.
@@ -170,7 +172,7 @@ spec:
170
172
The `matchLabels` are examples only; you must map your machine groups based on your specific needs. The `annotations` section does not apply to metal3-based remediation. Annotation-based remediation and metal3-based remediation are mutually exclusive.
0 commit comments