openshift
diff --git a/‎_topic_maps/_topic_map.yml
Lines changed: 13 additions & 6 deletions b/‎_topic_maps/_topic_map.yml
Lines changed: 13 additions & 6 deletions
diff --git a/‎modules/eco-self-node-remediation-operator-about.adoc
Lines changed: 0 additions & 90 deletions b/‎modules/eco-self-node-remediation-operator-about.adoc
Lines changed: 0 additions & 90 deletions
diff --git a/‎modules/eco-self-node-remediation-operator-configuring.adoc
Lines changed: 95 additions & 0 deletions b/‎modules/eco-self-node-remediation-operator-configuring.adoc
Lines changed: 95 additions & 0 deletions
diff --git a/‎modules/eco-self-node-remediation-operator-control-plane-fencing.adoc
Lines changed: 2 additions & 1 deletion b/‎modules/eco-self-node-remediation-operator-control-plane-fencing.adoc
Lines changed: 2 additions & 1 deletion
diff --git a/‎modules/eco-self-node-remediation-operator-installation-web-console.adoc
Lines changed: 6 additions & 1 deletion b/‎modules/eco-self-node-remediation-operator-installation-web-console.adoc
Lines changed: 6 additions & 1 deletion
diff --git a/‎modules/machine-health-checks-about.adoc
Lines changed: 0 additions & 2 deletions b/‎modules/machine-health-checks-about.adoc
Lines changed: 0 additions & 2 deletions
diff --git a/‎nodes/nodes/ecosystems/eco-about-remediation-fencing-maintenance.adoc
Lines changed: 39 additions & 0 deletions b/‎nodes/nodes/ecosystems/eco-about-remediation-fencing-maintenance.adoc
Lines changed: 39 additions & 0 deletions
diff --git a/‎nodes/nodes/ecosystems/eco-machine-health-checks.adoc
Lines changed: 13 additions & 0 deletions b/‎nodes/nodes/ecosystems/eco-machine-health-checks.adoc
Lines changed: 13 additions & 0 deletions
@@ -2142,12 +2142,19 @@ Topics:
     File: nodes-nodes-managing-max-pods
   - Name: Using the Node Tuning Operator
     File: nodes-node-tuning-operator
-  - Name: Remediating nodes with the Self Node Remediation Operator
-    File: eco-self-node-remediation-operator
-  - Name: Deploying node health checks by using the Node Health Check Operator
-    File: eco-node-health-check-operator
-  - Name: Using the Node Maintenance Operator to place nodes in maintenance mode
-    File: eco-node-maintenance-operator
+  - Name: Remediation, fencing, and maintenance
+    Dir: ecosystems
+    Topics:
+    - Name: About node remediation, fencing, and maintentance
+      File: eco-about-remediation-fencing-maintenance
+    - Name: Using Self Node Remediation
+      File: eco-self-node-remediation-operator
+    - Name: Remediating nodes with Machine Health Checks
+      File: eco-machine-health-checks
+    - Name: Remediating nodes with Node Health Checks
+      File: eco-node-health-check-operator
+    - Name: Placing nodes in maintenance mode with Node Maintenance Operator
+      File: eco-node-maintenance-operator
   - Name: Understanding node rebooting
     File: nodes-nodes-rebooting
   - Name: Freeing node resources using garbage collection
 
@@ -25,93 +25,3 @@ status:
 <1> Displays the last error that occurred during remediation. When remediation succeeds or if no errors occur, the field is left empty.
 
 The Self Node Remediation Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.
-
-[id="understanding-self-node-remediation-operator-config_{context}"]
-== Understanding the Self Node Remediation Operator configuration
-
-The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR with the name `self-node-remediation-config`. The CR is created in the namespace of the Self Node Remediation Operator.
-
-A change in the `SelfNodeRemediationConfig` CR re-creates the Self Node Remediation daemon set.
-
-The `SelfNodeRemediationConfig` CR resembles the following YAML file:
-
-[source,yaml]
-----
-apiVersion: self-node-remediation.medik8s.io/v1alpha1
-kind: SelfNodeRemediationConfig
-metadata:
-  name: self-node-remediation-config
-  namespace: openshift-operators
-spec:
-  safeTimeToAssumeNodeRebootedSeconds: 180 <1>
-  watchdogFilePath: /dev/watchdog <2>
-  isSoftwareRebootEnabled: true <3>
-  apiServerTimeout: 15s <4>
-  apiCheckInterval: 5s <5>
-  maxApiErrorThreshold: 3 <6>
-  peerApiServerTimeout: 5s <7>
-  peerDialTimeout: 5s <8>
-  peerRequestTimeout: 5s <9>
-  peerUpdateInterval: 15m <10>
-----
-
-<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
-<2> Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.
-+
-If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot.
-<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`.
-<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be more than or equal to 10 milliseconds.
-<5> Specify the frequency to check connectivity with each API server. The timeout duration must be more than or equal to 1 second.
-<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be more than or equal to 1 second.
-<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be more than or equal to 10 milliseconds.
-<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be more than or equal to 10 milliseconds.
-<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be more than or equal to 10 milliseconds.
-<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be more than or equal to 10 seconds.
-
-[NOTE]
-====
-You can edit the `self-node-remediation-config` CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs:
-
-[source,text]
-----
-controllers.SelfNodeRemediationConfig
-ignoring selfnoderemediationconfig CRs that are not named 'self-node-remediation-config'
-or not in the namespace of the operator:
-'openshift-operators' {"selfnoderemediationconfig":
-"openshift-operators/selfnoderemediationconfig-copy"}
-----
-====
-
-[id="understanding-self-node-remediation-remediation-template-config_{context}"]
-== Understanding the Self Node Remediation Template configuration
-
-The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available:
-
-`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy.
-
-`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected.
-
-
-The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CR for the strategy:
-
-* `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses
-//* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses
-
-The `SelfNodeRemediationTemplate` CR resembles the following YAML file:
-
-[source,yaml]
-----
-apiVersion: self-node-remediation.medik8s.io/v1alpha1
-kind: SelfNodeRemediationTemplate
-metadata:
-  creationTimestamp: "2022-03-02T08:02:40Z"
-  name: self-node-remediation-<remediation_object>-deletion-template <1>
-  namespace: openshift-operators
-spec:
-  template:
-    spec:
-      remediationStrategy: <remediation_strategy>  <2>
-----
-<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`.
-//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
-<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`.
@@ -0,0 +1,95 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes/eco-self-node-remediation-operator.adoc
+
+:_content-type: CONCEPT
+[id="configuring-self-node-remediation-operator_{context}"]
+= Configuring the Self Node Remediation Operator
+
+The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR and the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD).
+
+[id="understanding-self-node-remediation-operator-config_{context}"]
+== Understanding the Self Node Remediation Operator configuration
+
+The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR with the name `self-node-remediation-config`. The CR is created in the namespace of the Self Node Remediation Operator.
+
+A change in the `SelfNodeRemediationConfig` CR re-creates the Self Node Remediation daemon set.
+
+The `SelfNodeRemediationConfig` CR resembles the following YAML file:
+
+[source,yaml]
+----
+apiVersion: self-node-remediation.medik8s.io/v1alpha1
+kind: SelfNodeRemediationConfig
+metadata:
+  name: self-node-remediation-config
+  namespace: openshift-operators
+spec:
+  safeTimeToAssumeNodeRebootedSeconds: 180 <1>
+  watchdogFilePath: /dev/watchdog <2>
+  isSoftwareRebootEnabled: true <3>
+  apiServerTimeout: 15s <4>
+  apiCheckInterval: 5s <5>
+  maxApiErrorThreshold: 3 <6>
+  peerApiServerTimeout: 5s <7>
+  peerDialTimeout: 5s <8>
+  peerRequestTimeout: 5s <9>
+  peerUpdateInterval: 15m <10>
+----
+
+<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
+<2> Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.
++
+If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot.
+<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`.
+<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be greater than or equal to 10 milliseconds.
+<5> Specify the frequency to check connectivity with each API server. The timeout duration must be greater than or equal to 1 second.
+<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be greater than or equal to 1 second.
+<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be greater than or equal to 10 milliseconds.
+<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be greater than or equal to 10 milliseconds.
+<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be greater than or equal to 10 milliseconds.
+<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be greater than or equal to 10 seconds.
+
+[NOTE]
+====
+You can edit the `self-node-remediation-config` CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs:
+
+[source,text]
+----
+controllers.SelfNodeRemediationConfig
+ignoring selfnoderemediationconfig CRs that are not named 'self-node-remediation-config'
+or not in the namespace of the operator:
+'openshift-operators' {"selfnoderemediationconfig":
+"openshift-operators/selfnoderemediationconfig-copy"}
+----
+====
+
+[id="understanding-self-node-remediation-remediation-template-config_{context}"]
+== Understanding the Self Node Remediation Template configuration
+
+The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available:
+
+`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy.
+
+`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected.
+
+The Self Node Remediation Operator creates the `SelfNodeRemediationTemplate` CR for the strategy `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses.
+
+The `SelfNodeRemediationTemplate` CR resembles the following YAML file:
+
+[source,yaml]
+----
+apiVersion: self-node-remediation.medik8s.io/v1alpha1
+kind: SelfNodeRemediationTemplate
+metadata:
+  creationTimestamp: "2022-03-02T08:02:40Z"
+  name: self-node-remediation-<remediation_object>-deletion-template <1>
+  namespace: openshift-operators
+spec:
+  template:
+    spec:
+      remediationStrategy: <remediation_strategy>  <2>
+----
+<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`.
+//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
+<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`.
@@ -19,8 +19,9 @@ Self Node Remediation occurs in two primary scenarios.
 ** When there is no API Server Connectivity, the control plane node will be remediated as outlined with these steps:
 
 
-*** Check the status of the control plane node with the majority of the peer worker nodes. If its status is unhealthy or unknown, even if the control plane node can communicate with the peer worker nodes, the node will be analyzed further.
+*** Check the status of the control plane node with the majority of the peer worker nodes. If the majority of the peer worker nodes cannot be reached, the node will be analyzed further.
 **** Self-diagnose the status of the control plane node
 ***** If self diagnostics passed, no action will be taken.
 ***** If self diagnostics failed, the node will be fenced and remediated.
+***** The self diagnostics currently supported are checking the `kubelet` service status, and checking endpoint availability using `opt in` configuration.
 *** If the node did not manage to communicate to most of its worker peers, check the connectivity of the control plane node with other control plane nodes. If the node can communicate with any other control plane peer, no action will be taken. Otherwise, the node will be fenced and remediated.
@@ -8,6 +8,11 @@
 
 You can use the {product-title} web console to install the Self Node Remediation Operator.
 
+[NOTE]
+====
+The Node Health Check Operator also installs the Self Node Remediation Operator as a default remediation provider.
+====
+
 .Prerequisites
 
 * Log in as a user with `cluster-admin` privileges.
@@ -29,4 +34,4 @@ To confirm that the installation is successful:
 If the Operator is not installed successfully:
 
 . Navigate to the *Operators* -> *Installed Operators* page and inspect the `Status` column for any errors or failures.
-. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `self-node-remediation-controller-manager` project that are reporting issues.
+. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `self-node-remediation-controller-manager` project that are reporting issues.
@@ -7,8 +7,6 @@
 [id="machine-health-checks-about_{context}"]
 = About machine health checks
 
-Machine health checks automatically repair unhealthy machines in a particular machine pool.
-
 [NOTE]
 ====
 You can only apply a machine health check to control plane machines on clusters that use control plane machine sets.
 
@@ -0,0 +1,39 @@
+:_content-type: ASSEMBLY
+[id="about-remediation-fencing-maintenance"]
+= About node remediation, fencing, and maintenance
+include::_attributes/common-attributes.adoc[]
+:context: about-node-remediation-fencing-maintenance
+
+toc::[]
+
+Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.
+
+Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as `fencing` before initiating recovery of the workload, known as `remediation` and ideally, recovery of the node also.
+
+It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, {product-title} provides multiple components for the automation of failure detection, fencing and remediation.
+
+[id="about-remediation-fencing-maintenance-snr"]
+== Self Node Remediation
+
+The Self Node Remediation Operator is an {product-title} add-on operator which implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as, Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.
+
+Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.
+
+[id="about-remediation-fencing-maintenance-mhc"]
+== Machine Health Check
+
+Machine Health Check utilizes an {product-title} built-in failure detection, fencing and remediation system, which monitors the status of machines and the conditions of nodes. Machine Health Checks can be configured to trigger external fencing and remediation systems, like Self Node Remediation.
+
+[id="about-remediation-fencing-maintenance-nhc"]
+== Node Health Check
+
+The Node Health Check Operator is an {product-title} add-on operator which implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides such features. By default, it is configured to utilize the Self Node Remediation system.
+
+[id="about-remediation-fencing-maintenance-node"]
+== Node Maintenance
+
+Administrators face situations where they need to interrupt the cluster, for example, replace a drive, RAM, or a NIC.
+
+In advance of this maintenance, affected nodes should be cordoned and drained. When a node is cordoned, new workloads cannot be scheduled on that node. When a node is drained, to avoid or minimize downtime, workloads on the affected node are transferred to other nodes.
+
+While this maintenance can be achieved using command line tools, the Node Maintenance Operator offers a declarative approach to achieve this by using a custom resource. When such a resource exists for a node, the operator cordons and drains the node until the resource is deleted.
@@ -0,0 +1,13 @@
+:_content-type: ASSEMBLY
+[id="machine-health-checks"]
+= Remediating nodes with Machine Health Checks
+include::_attributes/common-attributes.adoc[]
+:context: machine-health-checks
+
+toc::[]
+
+Machine health checks automatically repair unhealthy machines in a particular machine pool.
+
+include::modules/machine-health-checks-about.adoc[leveloffset=+1]
+
+include::modules/eco-configuring-machine-health-check-with-self-node-remediation.adoc[leveloffset=+1]