Skip to content

Commit c2f4aff

Browse files
authored
Merge pull request #54678 from mburke5678/53866
TELCODOCS-843: Remediation, Fencing, and Maintentance concept details added
2 parents 8b10c39 + ee7acb8 commit c2f4aff

15 files changed

+194
-125
lines changed

_topic_maps/_topic_map.yml

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2142,12 +2142,19 @@ Topics:
21422142
File: nodes-nodes-managing-max-pods
21432143
- Name: Using the Node Tuning Operator
21442144
File: nodes-node-tuning-operator
2145-
- Name: Remediating nodes with the Self Node Remediation Operator
2146-
File: eco-self-node-remediation-operator
2147-
- Name: Deploying node health checks by using the Node Health Check Operator
2148-
File: eco-node-health-check-operator
2149-
- Name: Using the Node Maintenance Operator to place nodes in maintenance mode
2150-
File: eco-node-maintenance-operator
2145+
- Name: Remediation, fencing, and maintenance
2146+
Dir: ecosystems
2147+
Topics:
2148+
- Name: About node remediation, fencing, and maintentance
2149+
File: eco-about-remediation-fencing-maintenance
2150+
- Name: Using Self Node Remediation
2151+
File: eco-self-node-remediation-operator
2152+
- Name: Remediating nodes with Machine Health Checks
2153+
File: eco-machine-health-checks
2154+
- Name: Remediating nodes with Node Health Checks
2155+
File: eco-node-health-check-operator
2156+
- Name: Placing nodes in maintenance mode with Node Maintenance Operator
2157+
File: eco-node-maintenance-operator
21512158
- Name: Understanding node rebooting
21522159
File: nodes-nodes-rebooting
21532160
- Name: Freeing node resources using garbage collection

modules/eco-self-node-remediation-operator-about.adoc

Lines changed: 0 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -25,93 +25,3 @@ status:
2525
<1> Displays the last error that occurred during remediation. When remediation succeeds or if no errors occur, the field is left empty.
2626

2727
The Self Node Remediation Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.
28-
29-
[id="understanding-self-node-remediation-operator-config_{context}"]
30-
== Understanding the Self Node Remediation Operator configuration
31-
32-
The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR with the name `self-node-remediation-config`. The CR is created in the namespace of the Self Node Remediation Operator.
33-
34-
A change in the `SelfNodeRemediationConfig` CR re-creates the Self Node Remediation daemon set.
35-
36-
The `SelfNodeRemediationConfig` CR resembles the following YAML file:
37-
38-
[source,yaml]
39-
----
40-
apiVersion: self-node-remediation.medik8s.io/v1alpha1
41-
kind: SelfNodeRemediationConfig
42-
metadata:
43-
name: self-node-remediation-config
44-
namespace: openshift-operators
45-
spec:
46-
safeTimeToAssumeNodeRebootedSeconds: 180 <1>
47-
watchdogFilePath: /dev/watchdog <2>
48-
isSoftwareRebootEnabled: true <3>
49-
apiServerTimeout: 15s <4>
50-
apiCheckInterval: 5s <5>
51-
maxApiErrorThreshold: 3 <6>
52-
peerApiServerTimeout: 5s <7>
53-
peerDialTimeout: 5s <8>
54-
peerRequestTimeout: 5s <9>
55-
peerUpdateInterval: 15m <10>
56-
----
57-
58-
<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
59-
<2> Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.
60-
+
61-
If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot.
62-
<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`.
63-
<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be more than or equal to 10 milliseconds.
64-
<5> Specify the frequency to check connectivity with each API server. The timeout duration must be more than or equal to 1 second.
65-
<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be more than or equal to 1 second.
66-
<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be more than or equal to 10 milliseconds.
67-
<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be more than or equal to 10 milliseconds.
68-
<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be more than or equal to 10 milliseconds.
69-
<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be more than or equal to 10 seconds.
70-
71-
[NOTE]
72-
====
73-
You can edit the `self-node-remediation-config` CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs:
74-
75-
[source,text]
76-
----
77-
controllers.SelfNodeRemediationConfig
78-
ignoring selfnoderemediationconfig CRs that are not named 'self-node-remediation-config'
79-
or not in the namespace of the operator:
80-
'openshift-operators' {"selfnoderemediationconfig":
81-
"openshift-operators/selfnoderemediationconfig-copy"}
82-
----
83-
====
84-
85-
[id="understanding-self-node-remediation-remediation-template-config_{context}"]
86-
== Understanding the Self Node Remediation Template configuration
87-
88-
The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available:
89-
90-
`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy.
91-
92-
`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected.
93-
94-
95-
The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CR for the strategy:
96-
97-
* `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses
98-
//* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses
99-
100-
The `SelfNodeRemediationTemplate` CR resembles the following YAML file:
101-
102-
[source,yaml]
103-
----
104-
apiVersion: self-node-remediation.medik8s.io/v1alpha1
105-
kind: SelfNodeRemediationTemplate
106-
metadata:
107-
creationTimestamp: "2022-03-02T08:02:40Z"
108-
name: self-node-remediation-<remediation_object>-deletion-template <1>
109-
namespace: openshift-operators
110-
spec:
111-
template:
112-
spec:
113-
remediationStrategy: <remediation_strategy> <2>
114-
----
115-
<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`.
116-
//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
117-
<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/eco-self-node-remediation-operator.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="configuring-self-node-remediation-operator_{context}"]
7+
= Configuring the Self Node Remediation Operator
8+
9+
The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR and the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD).
10+
11+
[id="understanding-self-node-remediation-operator-config_{context}"]
12+
== Understanding the Self Node Remediation Operator configuration
13+
14+
The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR with the name `self-node-remediation-config`. The CR is created in the namespace of the Self Node Remediation Operator.
15+
16+
A change in the `SelfNodeRemediationConfig` CR re-creates the Self Node Remediation daemon set.
17+
18+
The `SelfNodeRemediationConfig` CR resembles the following YAML file:
19+
20+
[source,yaml]
21+
----
22+
apiVersion: self-node-remediation.medik8s.io/v1alpha1
23+
kind: SelfNodeRemediationConfig
24+
metadata:
25+
name: self-node-remediation-config
26+
namespace: openshift-operators
27+
spec:
28+
safeTimeToAssumeNodeRebootedSeconds: 180 <1>
29+
watchdogFilePath: /dev/watchdog <2>
30+
isSoftwareRebootEnabled: true <3>
31+
apiServerTimeout: 15s <4>
32+
apiCheckInterval: 5s <5>
33+
maxApiErrorThreshold: 3 <6>
34+
peerApiServerTimeout: 5s <7>
35+
peerDialTimeout: 5s <8>
36+
peerRequestTimeout: 5s <9>
37+
peerUpdateInterval: 15m <10>
38+
----
39+
40+
<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
41+
<2> Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.
42+
+
43+
If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot.
44+
<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`.
45+
<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be greater than or equal to 10 milliseconds.
46+
<5> Specify the frequency to check connectivity with each API server. The timeout duration must be greater than or equal to 1 second.
47+
<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be greater than or equal to 1 second.
48+
<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be greater than or equal to 10 milliseconds.
49+
<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be greater than or equal to 10 milliseconds.
50+
<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be greater than or equal to 10 milliseconds.
51+
<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be greater than or equal to 10 seconds.
52+
53+
[NOTE]
54+
====
55+
You can edit the `self-node-remediation-config` CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs:
56+
57+
[source,text]
58+
----
59+
controllers.SelfNodeRemediationConfig
60+
ignoring selfnoderemediationconfig CRs that are not named 'self-node-remediation-config'
61+
or not in the namespace of the operator:
62+
'openshift-operators' {"selfnoderemediationconfig":
63+
"openshift-operators/selfnoderemediationconfig-copy"}
64+
----
65+
====
66+
67+
[id="understanding-self-node-remediation-remediation-template-config_{context}"]
68+
== Understanding the Self Node Remediation Template configuration
69+
70+
The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available:
71+
72+
`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy.
73+
74+
`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected.
75+
76+
The Self Node Remediation Operator creates the `SelfNodeRemediationTemplate` CR for the strategy `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses.
77+
78+
The `SelfNodeRemediationTemplate` CR resembles the following YAML file:
79+
80+
[source,yaml]
81+
----
82+
apiVersion: self-node-remediation.medik8s.io/v1alpha1
83+
kind: SelfNodeRemediationTemplate
84+
metadata:
85+
creationTimestamp: "2022-03-02T08:02:40Z"
86+
name: self-node-remediation-<remediation_object>-deletion-template <1>
87+
namespace: openshift-operators
88+
spec:
89+
template:
90+
spec:
91+
remediationStrategy: <remediation_strategy> <2>
92+
----
93+
<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`.
94+
//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
95+
<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`.

modules/eco-self-node-remediation-operator-control-plane-fencing.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ Self Node Remediation occurs in two primary scenarios.
1919
** When there is no API Server Connectivity, the control plane node will be remediated as outlined with these steps:
2020
2121
22-
*** Check the status of the control plane node with the majority of the peer worker nodes. If its status is unhealthy or unknown, even if the control plane node can communicate with the peer worker nodes, the node will be analyzed further.
22+
*** Check the status of the control plane node with the majority of the peer worker nodes. If the majority of the peer worker nodes cannot be reached, the node will be analyzed further.
2323
**** Self-diagnose the status of the control plane node
2424
***** If self diagnostics passed, no action will be taken.
2525
***** If self diagnostics failed, the node will be fenced and remediated.
26+
***** The self diagnostics currently supported are checking the `kubelet` service status, and checking endpoint availability using `opt in` configuration.
2627
*** If the node did not manage to communicate to most of its worker peers, check the connectivity of the control plane node with other control plane nodes. If the node can communicate with any other control plane peer, no action will be taken. Otherwise, the node will be fenced and remediated.

modules/eco-self-node-remediation-operator-installation-web-console.adoc

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@
88

99
You can use the {product-title} web console to install the Self Node Remediation Operator.
1010

11+
[NOTE]
12+
====
13+
The Node Health Check Operator also installs the Self Node Remediation Operator as a default remediation provider.
14+
====
15+
1116
.Prerequisites
1217

1318
* Log in as a user with `cluster-admin` privileges.
@@ -29,4 +34,4 @@ To confirm that the installation is successful:
2934
If the Operator is not installed successfully:
3035

3136
. Navigate to the *Operators* -> *Installed Operators* page and inspect the `Status` column for any errors or failures.
32-
. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `self-node-remediation-controller-manager` project that are reporting issues.
37+
. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `self-node-remediation-controller-manager` project that are reporting issues.

modules/machine-health-checks-about.adoc

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@
77
[id="machine-health-checks-about_{context}"]
88
= About machine health checks
99

10-
Machine health checks automatically repair unhealthy machines in a particular machine pool.
11-
1210
[NOTE]
1311
====
1412
You can only apply a machine health check to control plane machines on clusters that use control plane machine sets.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
:_content-type: ASSEMBLY
2+
[id="about-remediation-fencing-maintenance"]
3+
= About node remediation, fencing, and maintenance
4+
include::_attributes/common-attributes.adoc[]
5+
:context: about-node-remediation-fencing-maintenance
6+
7+
toc::[]
8+
9+
Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.
10+
11+
Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as `fencing` before initiating recovery of the workload, known as `remediation` and ideally, recovery of the node also.
12+
13+
It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, {product-title} provides multiple components for the automation of failure detection, fencing and remediation.
14+
15+
[id="about-remediation-fencing-maintenance-snr"]
16+
== Self Node Remediation
17+
18+
The Self Node Remediation Operator is an {product-title} add-on operator which implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as, Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.
19+
20+
Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.
21+
22+
[id="about-remediation-fencing-maintenance-mhc"]
23+
== Machine Health Check
24+
25+
Machine Health Check utilizes an {product-title} built-in failure detection, fencing and remediation system, which monitors the status of machines and the conditions of nodes. Machine Health Checks can be configured to trigger external fencing and remediation systems, like Self Node Remediation.
26+
27+
[id="about-remediation-fencing-maintenance-nhc"]
28+
== Node Health Check
29+
30+
The Node Health Check Operator is an {product-title} add-on operator which implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides such features. By default, it is configured to utilize the Self Node Remediation system.
31+
32+
[id="about-remediation-fencing-maintenance-node"]
33+
== Node Maintenance
34+
35+
Administrators face situations where they need to interrupt the cluster, for example, replace a drive, RAM, or a NIC.
36+
37+
In advance of this maintenance, affected nodes should be cordoned and drained. When a node is cordoned, new workloads cannot be scheduled on that node. When a node is drained, to avoid or minimize downtime, workloads on the affected node are transferred to other nodes.
38+
39+
While this maintenance can be achieved using command line tools, the Node Maintenance Operator offers a declarative approach to achieve this by using a custom resource. When such a resource exists for a node, the operator cordons and drains the node until the resource is deleted.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
:_content-type: ASSEMBLY
2+
[id="machine-health-checks"]
3+
= Remediating nodes with Machine Health Checks
4+
include::_attributes/common-attributes.adoc[]
5+
:context: machine-health-checks
6+
7+
toc::[]
8+
9+
Machine health checks automatically repair unhealthy machines in a particular machine pool.
10+
11+
include::modules/machine-health-checks-about.adoc[leveloffset=+1]
12+
13+
include::modules/eco-configuring-machine-health-check-with-self-node-remediation.adoc[leveloffset=+1]

0 commit comments

Comments
 (0)