Skip to content

Commit 7f1b25f

Browse files
committed
Port changes from previously merged Poison Pill PRs to updated SNR files
1 parent 5cecc5a commit 7f1b25f

File tree

1 file changed

+24
-8
lines changed

1 file changed

+24
-8
lines changed

modules/eco-self-node-remediation-operator-about.adoc

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,22 @@
88

99
The Self Node Remediation Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the `MachineHealthCheck` or `NodeHealthCheck` controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the `MachineHealthCheck` or the `NodeHealthCheck` resource creates the `SelfNodeRemediation` custom resource (CR), which triggers the Self Node Remediation Operator.
1010

11+
The `SelfNodeRemediation` CR resembles the following YAML file:
12+
13+
[source,yaml]
14+
----
15+
apiVersion: self-node-remediation.medik8s.io/v1alpha1
16+
kind: SelfNodeRemediation
17+
metadata:
18+
name: selfnoderemediation-sample
19+
namespace: openshift-operators
20+
spec:
21+
status:
22+
lastError: <last_error_message> <1>
23+
----
24+
25+
<1> Displays the last error that occurred during remediation. When remediation succeeds or if no errors occur, the field is left empty.
26+
1127
The Self Node Remediation Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.
1228

1329
[id="understanding-self-node-remediation-operator-config_{context}"]
@@ -28,7 +44,7 @@ metadata:
2844
namespace: openshift-operators
2945
spec:
3046
safeTimeToAssumeNodeRebootedSeconds: 180 <1>
31-
watchdogFilePath: /dev/watchdog1 <2>
47+
watchdogFilePath: /dev/watchdog <2>
3248
isSoftwareRebootEnabled: true <3>
3349
apiServerTimeout: 15s <4>
3450
apiCheckInterval: 5s <5>
@@ -44,13 +60,13 @@ spec:
4460
+
4561
If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot.
4662
<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`.
47-
<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation.
48-
<5> Specify the frequency to check connectivity with each API server.
49-
<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers.
50-
<7> Specify the timeout duration for the peer to connect the API server.
51-
<8> Specify the timeout duration for establishing connection with the peer.
52-
<9> Specify the timeout duration to get a response from the peer.
53-
<10> Specify the frequency to update peer information, such as IP address.
63+
<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be more than or equal to 10 milliseconds.
64+
<5> Specify the frequency to check connectivity with each API server. The timeout duration must be more than or equal to 1 second.
65+
<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be more than or equal to 1 second.
66+
<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be more than or equal to 10 milliseconds.
67+
<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be more than or equal to 10 milliseconds.
68+
<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be more than or equal to 10 milliseconds.
69+
<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be more than or equal to 10 seconds.
5470

5571
[NOTE]
5672
====

0 commit comments

Comments
 (0)