Skip to content

Commit 2de0055

Browse files
authored
Merge pull request #57986 from ogradyp/TELCODOCS-329
TELCODOCS-329: MachineHealthCheck: Implement metal3 remediation using new external API
2 parents a9e3b91 + 11c6f37 commit 2de0055

File tree

1 file changed

+73
-8
lines changed

1 file changed

+73
-8
lines changed

modules/mgmt-power-remediation-baremetal-about.adoc

Lines changed: 73 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Instead of reprovisioning the nodes, power-based remediation uses a power contro
1414
Power-based remediation provides the following capabilities:
1515

1616
* Allows the recovery of control plane nodes
17-
* Reduces the risk data loss in hyperconverged environments
17+
* Reduces the risk of data loss in hyperconverged environments
1818
* Reduces the downtime associated with recovering physical machines
1919
2020
[id="machine-health-checks-bare-metal_{context}"]
@@ -23,15 +23,17 @@ Power-based remediation provides the following capabilities:
2323
Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host.
2424
Usually bare metal reprovisioning is a lengthy process, during which the cluster
2525
is missing compute resources and applications might be interrupted.
26-
To change the default remediation process from machine deletion to host power-cycle,
27-
annotate the `MachineHealthCheck` resource with the
26+
27+
There are two ways to change the default remediation process from machine deletion to host power-cycle:
28+
29+
. Annotate the `MachineHealthCheck` resource with the
2830
`machine.openshift.io/remediation-strategy: external-baremetal` annotation.
31+
. Create a `Metal3RemediationTemplate` resource, and refer to it in the `spec.remediationTemplate` of the `MachineHealthCheck`.
2932

30-
After you set the annotation, unhealthy machines are power-cycled by using
31-
BMC credentials.
33+
After using one of these methods, unhealthy machines are power-cycled by using Baseboard Management Controller (BMC) credentials.
3234

3335
[id="mgmt-understanding-remediation-process_{context}"]
34-
== Understanding the remediation process
36+
== Understanding the annotation-based remediation process
3537

3638
The remediation process operates as follows:
3739

@@ -47,13 +49,30 @@ The remediation process operates as follows:
4749
If the power operations did not complete, the bare metal machine controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.
4850
====
4951

52+
[id="mgmt-understanding-metal3-remediation-process_{context}"]
53+
== Understanding the metal3-based remediation process
54+
55+
The remediation process operates as follows:
56+
57+
. The MachineHealthCheck (MHC) controller detects that a node is unhealthy.
58+
. The MHC creates a metal3 remediation custom resource for the metal3 remediation controller, which requests to power-off the unhealthy node.
59+
. After the power is off, the node is deleted, which allows the cluster to reschedule the affected workload on other nodes.
60+
. The metal3 remediation controller requests to power on the node.
61+
. After the node is up, the node re-registers itself with the cluster, resulting in the creation of a new node.
62+
. After the node is recreated, the metal3 remediation controller restores the annotations and labels that existed on the unhealthy node before its deletion.
63+
64+
[NOTE]
65+
====
66+
If the power operations did not complete, the metal3 remediation controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.
67+
====
68+
5069
[id="mgmt-creating-mhc-baremetal_{context}"]
5170
== Creating a MachineHealthCheck resource for bare metal
5271

5372
.Prerequisites
5473

5574
* The {product-title} is installed using installer-provisioned infrastructure (IPI).
56-
* Access to Baseboard Management Controller (BMC) credentials (or BMC access to each node)
75+
* Access to BMC credentials (or BMC access to each node).
5776
* Network access to the BMC interface of the unhealthy node.
5877

5978
.Procedure
@@ -65,7 +84,7 @@ If the power operations did not complete, the bare metal machine controller trig
6584
$ oc apply -f healthcheck.yaml
6685
----
6786

68-
.Sample `MachineHealthCheck` resource for bare metal
87+
.Sample `MachineHealthCheck` resource for bare metal, annotation-based remediation
6988
[source,yaml]
7089
----
7190
apiVersion: machine.openshift.io/v1beta1
@@ -105,6 +124,52 @@ spec:
105124
The `matchLabels` are examples only; you must map your machine groups based on your specific needs.
106125
====
107126

127+
.Sample `MachineHealthCheck` resource for bare metal, metal3-based remediation
128+
[source,yaml]
129+
----
130+
apiVersion: machine.openshift.io/v1beta1
131+
kind: MachineHealthCheck
132+
metadata:
133+
name: example
134+
namespace: openshift-machine-api
135+
spec:
136+
selector:
137+
matchLabels:
138+
machine.openshift.io/cluster-api-machine-role: <role>
139+
machine.openshift.io/cluster-api-machine-type: <role>
140+
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>
141+
selector:
142+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
143+
kind: Metal3RemediationTemplate
144+
name: metal3-remediation-template
145+
namespace: openshift-machine-api
146+
unhealthyConditions:
147+
- type: "Ready"
148+
timeout: "300s"
149+
----
150+
151+
.Sample `Metal3RemediationTemplate` resource for bare metal, metal3-based remediation
152+
[source,yaml]
153+
----
154+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
155+
kind: Metal3RemediationTemplate
156+
metadata:
157+
name: metal3-remediation-template
158+
namespace: openshift-machine-api
159+
spec:
160+
template:
161+
spec:
162+
strategy:
163+
type: Reboot
164+
retryLimit: 1
165+
timeout: 5m0s
166+
----
167+
168+
[NOTE]
169+
====
170+
The `matchLabels` are examples only; you must map your machine groups based on your specific needs. The `annotations` section does not apply to metal3-based remediation. Annotation-based remediation and metal3-based remediation are mutually exclusive.
171+
====
172+
108173
[mgmt-troubleshooting-issue-power-remediation_{context}]
109174
== Troubleshooting issues with power-based remediation
110175

0 commit comments

Comments
 (0)