Skip to content

Commit e568a5a

Browse files
committed
squashing commits
1 parent 28f1bee commit e568a5a

File tree

4 files changed

+118
-55
lines changed

4 files changed

+118
-55
lines changed

machine_management/deploying-machine-health-checks.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,7 @@ include::modules/machine-health-checks-resource.adoc[leveloffset=+1]
2222
.Additional resources
2323

2424
include::modules/machine-health-checks-creating.adoc[leveloffset=+1]
25+
26+
You can configure and deploy a machine health check to detect and repair unhealthy bare metal nodes.
27+
28+
include::modules/mgmt-power-remediation-baremetal-about.adoc[leveloffset=+1]

modules/machine-health-checks-about.adoc

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -33,19 +33,6 @@ To limit the disruptive impact of machine deletions, the controller drains and d
3333

3434
To stop the check, remove the custom resource.
3535

36-
[id="machine-health-checks-bare-metal_{context}"]
37-
== MachineHealthChecks on Bare Metal
38-
39-
Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host.
40-
Usually bare metal reprovisioning is a lengthy process, during which the cluster
41-
is missing compute resources and applications might be interrupted.
42-
To change the default remediation process from machine deletion to host power-cycle,
43-
annotate the MachineHealthCheck resource with the
44-
`machine.openshift.io/remediation-strategy: external-baremetal` annotation.
45-
46-
After you set the annotation, unhealthy machines are power-cycled by using
47-
BMC credentials.
48-
4936
[id="machine-health-checks-limitations_{context}"]
5037
== Limitations when deploying machine health checks
5138

modules/machine-health-checks-resource.adoc

Lines changed: 1 addition & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -7,49 +7,8 @@
77
[id="machine-health-checks-resource_{context}"]
88
= Sample MachineHealthCheck resource
99

10-
The `MachineHealthCheck` resource resembles one of the following YAML files:
10+
The `MachineHealthCheck` resource for all cloud-based installation types, and other than bare metal, resembles the following YAML file:
1111

12-
.`MachineHealthCheck` for bare metal
13-
[source,yaml]
14-
----
15-
apiVersion: machine.openshift.io/v1beta1
16-
kind: MachineHealthCheck
17-
metadata:
18-
name: example <1>
19-
namespace: openshift-machine-api
20-
annotations:
21-
machine.openshift.io/remediation-strategy: external-baremetal <2>
22-
spec:
23-
selector:
24-
matchLabels:
25-
machine.openshift.io/cluster-api-machine-role: <role> <3>
26-
machine.openshift.io/cluster-api-machine-type: <role> <3>
27-
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> <4>
28-
unhealthyConditions:
29-
- type: "Ready"
30-
timeout: "300s" <5>
31-
status: "False"
32-
- type: "Ready"
33-
timeout: "300s" <5>
34-
status: "Unknown"
35-
maxUnhealthy: "40%" <6>
36-
nodeStartupTimeout: "10m" <7>
37-
----
38-
39-
<1> Specify the name of the machine health check to deploy.
40-
<2> For bare metal clusters, you must include the `machine.openshift.io/remediation-strategy: external-baremetal` annotation in the `annotations` section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster.
41-
<3> Specify a label for the machine pool that you want to check.
42-
<4> Specify the machine set to track in `<cluster_name>-<label>-<zone>` format. For example, `prod-node-us-east-1a`.
43-
<5> Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
44-
<6> Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
45-
<7> Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
46-
47-
[NOTE]
48-
====
49-
The `matchLabels` are examples only; you must map your machine groups based on your specific needs.
50-
====
51-
52-
.`MachineHealthCheck` for all other installation types
5312
[source,yaml]
5413
----
5514
apiVersion: machine.openshift.io/v1beta1
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
// Module included in the following assemblies:
2+
3+
// * machine_management/mgmt-power-remediation-baremetal
4+
5+
[id="mgmt-power-remediation-baremetal-about_{context}"]
6+
= About power-based remediation of bare metal
7+
In a bare metal cluster, remediation of nodes is critical to ensuring the overall health of the cluster. Physically remediating a cluster can be challenging and any delay in putting the machine into a safe or an operational state increases the time the cluster remains in a degraded state, and the risk that subsequent failures might bring the cluster offline. Power-based remediation helps counter such challenges.
8+
9+
Instead of reprovisioning the nodes, power-based remediation uses a power controller to power off an inoperable node. This type of remediation is also called power fencing.
10+
11+
{product-title} uses the `MachineHealthCheck` controller to detect faulty bare metal nodes. Power-based remediation is fast and reboots faulty nodes instead of removing them from the cluster.
12+
13+
Power-based remediation provides the following capabilities:
14+
15+
* Allows the recovery of control plane nodes
16+
* Reduces the risk data loss in hyperconverged environments
17+
* Reduces the downtime associated with recovering physical machines
18+
19+
[id="machine-health-checks-bare-metal_{context}"]
20+
== MachineHealthChecks on bare metal
21+
22+
Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host.
23+
Usually bare metal reprovisioning is a lengthy process, during which the cluster
24+
is missing compute resources and applications might be interrupted.
25+
To change the default remediation process from machine deletion to host power-cycle,
26+
annotate the `MachineHealthCheck` resource with the
27+
`machine.openshift.io/remediation-strategy: external-baremetal` annotation.
28+
29+
After you set the annotation, unhealthy machines are power-cycled by using
30+
BMC credentials.
31+
32+
[id="mgmt-understanding-remediation-process_{context}"]
33+
== Understanding the remediation process
34+
35+
The remediation process operates as follows:
36+
37+
. The MachineHealthCheck (MHC) controller detects that a node is unhealthy.
38+
. The MHC notifies the bare metal machine controller which requests to power-off the unhealthy node.
39+
. After the power is off, the node is deleted, which allows the cluster to reschedule the affected workload on other nodes.
40+
. The bare metal machine controller requests to power on the node.
41+
. After the node is up, the node re-registers itself with the cluster, resulting in the creation of a new node.
42+
. After the node is recreated, the bare metal machine controller restores the annotations and labels that existed on the unhealthy node before its deletion.
43+
44+
[NOTE]
45+
====
46+
If the power operations did not complete, the bare metal machine controller triggers the reprovisioning of the unhealthy node unless this is a master node or a node that was provisioned externally.
47+
====
48+
49+
[id="mgmt-creating-mhc-baremetal_{context}"]
50+
== Creating a MachineHealthCheck resource for bare metal
51+
52+
.Prerequisites
53+
54+
* The {product-title} is installed using installer-provisioned infrastructure (IPI).
55+
* Access to Baseboard Management Controller (BMC) credentials (or BMC access to each node)
56+
* Network access to the BMC interface of the unhealthy node.
57+
58+
.Procedure
59+
. Create a `healthcheck.yaml` file that contains the definition of your machine health check.
60+
. Apply the `healthcheck.yaml` file to your cluster using the following command:
61+
62+
[source,terminal]
63+
----
64+
$ oc apply -f healthcheck.yaml
65+
----
66+
67+
.Sample `MachineHealthCheck` resource for bare metal
68+
[source,yaml]
69+
----
70+
apiVersion: machine.openshift.io/v1beta1
71+
kind: MachineHealthCheck
72+
metadata:
73+
name: example <1>
74+
namespace: openshift-machine-api
75+
annotations:
76+
machine.openshift.io/remediation-strategy: external-baremetal <2>
77+
spec:
78+
selector:
79+
matchLabels:
80+
machine.openshift.io/cluster-api-machine-role: <role> <3>
81+
machine.openshift.io/cluster-api-machine-type: <role> <3>
82+
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> <4>
83+
unhealthyConditions:
84+
- type: "Ready"
85+
timeout: "300s" <5>
86+
status: "False"
87+
- type: "Ready"
88+
timeout: "300s" <5>
89+
status: "Unknown"
90+
maxUnhealthy: "40%" <6>
91+
nodeStartupTimeout: "10m" <7>
92+
----
93+
94+
<1> Specify the name of the machine health check to deploy.
95+
<2> For bare metal clusters, you must include the `machine.openshift.io/remediation-strategy: external-baremetal` annotation in the `annotations` section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster.
96+
<3> Specify a label for the machine pool that you want to check.
97+
<4> Specify the machine set to track in `<cluster_name>-<label>-<zone>` format. For example, `prod-node-us-east-1a`.
98+
<5> Specify the timeout duration for the node condition. If the condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
99+
<6> Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
100+
<7> Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
101+
102+
[NOTE]
103+
====
104+
The `matchLabels` are examples only; you must map your machine groups based on your specific needs.
105+
====
106+
107+
[mgmt-troubleshooting-issue-power-remediation_{context}]
108+
== Troubleshooting issues with power-based remediation
109+
110+
To troubleshoot an issue with power-based remediation, verify the following:
111+
112+
* You have access to the BMC.
113+
* BMC is connected to the master node that is responsible for running the remediation task.

0 commit comments

Comments
 (0)