Skip to content

Commit 3dfd476

Browse files
authored
Merge pull request #51820 from ogradyp/TELCODOCS-502
TELCODOCS-502/830: SNR / NHC Operator - Control Plane Fencing
2 parents a552342 + 73733b3 commit 3dfd476

7 files changed

+150
-15
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *nodes/nodes/eco-poison-pill-operator.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="configuring-control-plane-machine-health-check-with-self-node-remediation-operator_{context}"]
7+
= Configuring control-plane machine health checks to use the Self Node Remediation Operator
8+
9+
Use the following procedure to configure the control-plane machine health checks to use the Self Node Remediation Operator as a remediation provider.
10+
11+
.Prerequisites
12+
13+
* Install the OpenShift CLI (`oc`).
14+
* Log in as a user with `cluster-admin` privileges.
15+
16+
.Procedure
17+
18+
. Create a `SelfNodeRemediationTemplate` CR:
19+
20+
.. Define the `SelfNodeRemediationTemplate` CR:
21+
+
22+
[source,yaml]
23+
----
24+
apiVersion: self-node-remediation.medik8s.io/v1alpha1
25+
kind: SelfNodeRemediationTemplate
26+
metadata:
27+
namespace: openshift-machine-api
28+
name: selfnoderemediationtemplate-sample
29+
spec:
30+
template:
31+
spec:
32+
remediationStrategy: ResourceDeletion <1>
33+
----
34+
<1> Specifies the remediation strategy. The default strategy is `ResourceDeletion`.
35+
36+
.. To create the `SelfNodeRemediationTemplate` CR, run the following command:
37+
+
38+
[source,terminal]
39+
----
40+
$ oc create -f <snrt-name>.yaml
41+
----
42+
43+
. Create or update the `MachineHealthCheck` CR to point to the `SelfNodeRemediationTemplate` CR:
44+
45+
.. Define or update the `MachineHealthCheck` CR:
46+
+
47+
[source,yaml]
48+
----
49+
apiVersion: machine.openshift.io/v1beta1
50+
kind: MachineHealthCheck
51+
metadata:
52+
name: machine-health-check
53+
namespace: openshift-machine-api
54+
spec:
55+
selector:
56+
matchLabels:
57+
machine.openshift.io/cluster-api-machine-role: "control-plane"
58+
machine.openshift.io/cluster-api-machine-type: "control-plane"
59+
unhealthyConditions:
60+
- type: "Ready"
61+
timeout: "300s"
62+
status: "False"
63+
- type: "Ready"
64+
timeout: "300s"
65+
status: "Unknown"
66+
maxUnhealthy: "40%"
67+
nodeStartupTimeout: "10m"
68+
remediationTemplate: <1>
69+
kind: SelfNodeRemediationTemplate
70+
apiVersion: self-node-remediation.medik8s.io/v1alpha1
71+
name: selfnoderemediationtemplate-sample
72+
----
73+
<1> Specifies the details for the remediation template.
74+
75+
+
76+
.. To create a `MachineHealthCheck` CR, run the following command:
77+
+
78+
[source,terminal]
79+
----
80+
$ oc create -f <mhc-name>.yaml
81+
----
82+
83+
.. To update a `MachineHealthCheck` CR, run the following command:
84+
+
85+
[source,terminal]
86+
----
87+
$ oc apply -f <mhc-name>.yaml
88+
----

modules/eco-configuring-machine-health-check-with-self-node-remediation.adoc

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
[id="configuring-machine-health-check-with-self-node-remediation-operator_{context}"]
77
= Configuring machine health checks to use the Self Node Remediation Operator
88

9-
Use the following procedure to configure the machine health checks to use the Self Node Remediation Operator as a remediation provider.
9+
Use the following procedure to configure the worker or control-plane machine health checks to use the Self Node Remediation Operator as a remediation provider.
1010

1111
.Prerequisites
1212

@@ -37,7 +37,7 @@ spec:
3737
+
3838
[source,terminal]
3939
----
40-
$ oc create -f <snr-name>.yaml
40+
$ oc create -f <snrt-name>.yaml
4141
----
4242

4343
. Create or update the `MachineHealthCheck` CR to point to the `SelfNodeRemediationTemplate` CR:
@@ -53,7 +53,7 @@ metadata:
5353
namespace: openshift-machine-api
5454
spec:
5555
selector:
56-
matchLabels:
56+
matchLabels: <1>
5757
machine.openshift.io/cluster-api-machine-role: "worker"
5858
machine.openshift.io/cluster-api-machine-type: "worker"
5959
unhealthyConditions:
@@ -65,26 +65,25 @@ spec:
6565
status: "Unknown"
6666
maxUnhealthy: "40%"
6767
nodeStartupTimeout: "10m"
68-
remediationTemplate: <1>
68+
remediationTemplate: <2>
6969
kind: SelfNodeRemediationTemplate
7070
apiVersion: self-node-remediation.medik8s.io/v1alpha1
7171
name: selfnoderemediationtemplate-sample
7272
----
73-
<1> Specifies the details for the remediation template.
73+
<1> Selects whether the machine health check is for `worker` or `control-plane` nodes. The label can also be user-defined.
74+
<2> Specifies the details for the remediation template.
7475

7576
+
7677
.. To create a `MachineHealthCheck` CR, run the following command:
7778
+
7879
[source,terminal]
7980
----
80-
$ oc create -f <file-name>.yaml
81+
$ oc create -f <mhc-name>.yaml
8182
----
8283

8384
.. To update a `MachineHealthCheck` CR, run the following command:
8485
+
8586
[source,terminal]
8687
----
87-
$ oc apply -f <file-name>.yaml
88+
$ oc apply -f <mhc-name>.yaml
8889
----
89-
90-
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/eco-node-health-check-operator.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="control-plane-fencing-node-health-check-operator_{context}"]
7+
= Control plane fencing
8+
9+
In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.
10+
11+
Do not use the same `NodeHealthCheck` CR for worker nodes and control plane nodes. Grouping worker nodes and control plane nodes together can result in incorrect evaluation of the minimum healthy node count, and cause unexpected or missing remediations. This is because of the way the Node Health Check Operator handles control plane nodes. You should group the control plane nodes in their own group and the worker nodes in their own group. If required, you can also create multiple groups of worker nodes.
12+
13+
Considerations for remediation strategies:
14+
15+
* Avoid Node Health Check configurations that involve multiple configurations overlapping the same nodes because they can result in unexpected behavior. This suggestion applies to both worker and control plane nodes.
16+
* The Node Health Check Operator implements a hardcoded limitation of remediating a maximum of one control plane node at a time. Multiple control plane nodes should not be remediated at the same time.

modules/eco-self-node-remediation-operator-about.adoc

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,13 @@ The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate
8989

9090
`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy.
9191

92-
`NodeDeletion`:: This remediation strategy removes the node object.
92+
`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected.
9393

94-
The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CRs for each strategy:
94+
95+
The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CR for the strategy:
9596

9697
* `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses
97-
* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses
98+
//* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses
9899

99100
The `SelfNodeRemediationTemplate` CR resembles the following YAML file:
100101

@@ -111,5 +112,6 @@ spec:
111112
spec:
112113
remediationStrategy: <remediation_strategy> <2>
113114
----
114-
<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`, for example, `self-node-remediation-resource-deletion-template`.
115-
<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
115+
<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`.
116+
//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
117+
<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/eco-node-health-check-operator.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="control-plane-fencing-self-node-remediation-operator_{context}"]
7+
= Control plane fencing
8+
9+
In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.
10+
11+
Self Node Remediation occurs in two primary scenarios.
12+
13+
* API Server Connectivity
14+
** In this scenario, the control plane node to be remediated is not isolated. It can be directly connected to the API Server, or it can be indirectly connected to the API Server through worker nodes or control-plane nodes, that are directly connected to the API Server.
15+
** When there is API Server Connectivity, the control plane node is remediated only if the Node Health Check Operator has created a `SelfNodeRemediation` custom resource (CR) for the node.
16+
17+
* No API Server Connectivity
18+
** In this scenario, the control plane node to be remediated is isolated from the API Server. The node cannot connect directly or indirectly to the API Server.
19+
** When there is no API Server Connectivity, the control plane node will be remediated as outlined with these steps:
20+
21+
22+
*** Check the status of the control plane node with the majority of the peer worker nodes. If its status is unhealthy or unknown, even if the control plane node can communicate with the peer worker nodes, the node will be analyzed further.
23+
**** Self-diagnose the status of the control plane node
24+
***** If self diagnostics passed, no action will be taken.
25+
***** If self diagnostics failed, the node will be fenced and remediated.
26+
*** If the node did not manage to communicate to most of its worker peers, check the connectivity of the control plane node with other control plane nodes. If the node can communicate with any other control plane peer, no action will be taken. Otherwise, the node will be fenced and remediated.

nodes/nodes/eco-node-health-check-operator.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ xref:../../nodes/nodes/eco-self-node-remediation-operator.adoc#self-node-remedia
1515

1616
include::modules/eco-node-health-check-operator-about.adoc[leveloffset=+1]
1717

18+
include::modules/eco-node-health-check-operator-control-plane-fencing.adoc[leveloffset=+1]
19+
1820
include::modules/eco-node-health-check-operator-installation-web-console.adoc[leveloffset=+1]
1921

2022
include::modules/eco-node-health-check-operator-installation-cli.adoc[leveloffset=+1]

nodes/nodes/eco-self-node-remediation-operator.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ include::modules/eco-self-node-remediation-about-watchdog.adoc[leveloffset=+2]
1515
.Additional resources
1616
xref:../../virt/virtual_machines/advanced_vm_management/virt-configuring-a-watchdog.adoc#virt-configuring-a-watchdog[Configuring a watchdog]
1717

18+
include::modules/eco-self-node-remediation-operator-control-plane-fencing.adoc[leveloffset=+1]
19+
1820
include::modules/eco-self-node-remediation-operator-installation-web-console.adoc[leveloffset=+1]
1921

2022
include::modules/eco-self-node-remediation-operator-installation-cli.adoc[leveloffset=+1]
@@ -32,4 +34,4 @@ To collect debugging information about the Self Node Remediation Operator, use t
3234
== Additional resources
3335

3436
* The Self Node Remediation Operator is supported in a restricted network environment. For more information, see xref:../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks].
35-
* xref:../../operators/admin/olm-deleting-operators-from-cluster.adoc#olm-deleting-operators-from-a-cluster[Deleting Operators from a cluster]
37+
* xref:../../operators/admin/olm-deleting-operators-from-cluster.adoc#olm-deleting-operators-from-a-cluster[Deleting Operators from a cluster]

0 commit comments

Comments
 (0)