Skip to content

Commit d69e827

Browse files
author
Bob Furu
authored
Merge pull request #35974 from abhatt-rh/telcodocs-53
2 parents 579f7d3 + 0377411 commit d69e827

7 files changed

+366
-1
lines changed

_topic_map.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1687,7 +1687,7 @@ Topics:
16871687
- Name: Adding compute machines to bare metal
16881688
File: adding-bare-metal-compute-user-infra
16891689
- Name: Deploying machine health checks
1690-
File: deploying-machine-health-checks
1690+
File: deploying-machine-health-checks
16911691
---
16921692
Name: Nodes
16931693
Dir: nodes
@@ -1772,6 +1772,8 @@ Topics:
17721772
File: nodes-nodes-managing-max-pods
17731773
- Name: Using the Node Tuning Operator
17741774
File: nodes-node-tuning-operator
1775+
- Name: Remediating nodes with the Poison Pill Operator
1776+
File: eco-poison-pill-operator
17751777
- Name: Understanding node rebooting
17761778
File: nodes-nodes-rebooting
17771779
- Name: Freeing node resources using garbage collection
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *nodes/nodes/eco-poison-pill-operator.adoc
4+
5+
[id="configuring-machine-health-check-with-poison-pill_{context}"]
6+
= Configuring machine health checks to use the Poison Pill Operator
7+
8+
Use the following procedure to configure the machine health checks to use the Poison Pill Operator as a remediation provider.
9+
10+
.Prerequisites
11+
12+
* Install the OpenShift CLI (`oc`).
13+
* Log in as a user with `cluster-admin` privileges.
14+
15+
.Procedure
16+
17+
. Create a `PoisonPillRemediationTemplate` CR:
18+
19+
.. Define the `PoisonPillRemediationTemplate` CR:
20+
+
21+
[source,yaml]
22+
----
23+
apiVersion: poison-pill.medik8s.io/v1alpha1
24+
kind: PoisonPillRemediationTemplate
25+
metadata:
26+
namespace: openshift-machine-api
27+
name: poisonpillremediationtemplate-sample
28+
spec:
29+
template:
30+
spec: {}
31+
----
32+
33+
.. To create the `PoisonPillRemediationTemplate` CR, run the following command:
34+
+
35+
[source,terminal]
36+
----
37+
$ oc create -f <ppr-name>.yaml
38+
----
39+
40+
. Create or update the `MachineHealthCheck` CR to point to the `PoisonPillRemediationTemplate` CR:
41+
42+
.. Define or update the `MachineHealthCheck` CR:
43+
+
44+
[source,yaml]
45+
----
46+
apiVersion: machine.openshift.io/v1beta1
47+
kind: MachineHealthCheck
48+
metadata:
49+
name: machine-health-check
50+
namespace: openshift-machine-api
51+
spec:
52+
selector:
53+
matchLabels:
54+
machine.openshift.io/cluster-api-machine-role: "worker"
55+
machine.openshift.io/cluster-api-machine-type: "worker"
56+
unhealthyConditions:
57+
- type: "Ready"
58+
timeout: "300s"
59+
status: "False"
60+
- type: "Ready"
61+
timeout: "300s"
62+
status: "Unknown"
63+
maxUnhealthy: "40%"
64+
nodeStartupTimeout: "10m"
65+
remediationTemplate: <1>
66+
kind: PoisonPillRemediationTemplate
67+
apiVersion: poison-pill.medik8s.io/v1alpha1
68+
name: <poison-pill-remediation-template-sample>
69+
----
70+
<1> Specify the details for the remediation template.
71+
+
72+
.. To create a `MachineHealthCheck` CR, run the following command:
73+
+
74+
[source,terminal]
75+
----
76+
$ oc create -f <file-name>.yaml
77+
----
78+
79+
.. To update a `MachineHealthCheck` CR, run the following command:
80+
+
81+
[source,terminal]
82+
----
83+
$ oc apply -f <file-name>.yaml
84+
----
85+
86+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/eco-poison-pill-operator.adoc
4+
5+
[id="about-poison-pill-operator_{context}"]
6+
= About the Poison Pill Operator
7+
8+
The Poison Pill Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the `MachineHealthCheck` controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the `MachineHealthCheck` resource creates the `PoisonPillRemediation` custom resource (CR), which triggers the Poison Pill Operator.
9+
10+
The Poison Pill Operator provides the following capabilities:
11+
12+
* Minimizes downtime for stateful applications and restores compute capacity if transient failures occur.
13+
* Independent of any management interface, such as IPMI or an API to provision a node.
14+
15+
[id="understanding-poison-pill-operator-config_{context}"]
16+
== Understanding the Poison Pill Operator configuration
17+
18+
The Poison Pill Operator creates the `PoisonPillConfig` CR with the name `poison-pill-config` in the Poison Pill Operator's namespace. You can edit this CR. However, you cannot create a new CR for the Poison Pill Operator.
19+
20+
A change in the `PoisonPillConfig` CR re-creates the Poison Pill daemon set.
21+
22+
The `PoisonPillConfig` CR resembles the following YAML file:
23+
24+
[source,yaml]
25+
----
26+
apiVersion: poison-pill.medik8s.io/v1alpha1
27+
kind: PoisonPillConfig
28+
metadata:
29+
name: poison-pill-config
30+
namespace: openshift-operators
31+
spec:
32+
safeTimeToAssumeNodeRebootedSeconds: 180 <1>
33+
watchdogFilePath: /test/watchdog1 <2>
34+
----
35+
36+
<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
37+
<2> Specify the file path of the watchdog device in the nodes. If a watchdog device is unavailable, the `PoisonPillConfig` CR uses a software reboot.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/eco-poison-pill-operator.adoc
4+
5+
[id="installing-poison-pill-operator-using-cli_{context}"]
6+
= Installing the Poison Pill Operator by using the CLI
7+
8+
You can use the OpenShift CLI (`oc`) to install the Poison Pill Operator.
9+
10+
.Prerequisites
11+
12+
* Install the OpenShift CLI (`oc`).
13+
* Log in as a user with `cluster-admin` privileges.
14+
15+
.Procedure
16+
17+
. Create a `Namespace` custom resource (CR) for the Poison Pill Operator:
18+
.. Define the `Namespace` CR and save the YAML file, for example, `poison-pill-namespace.yaml`:
19+
+
20+
[source,yaml]
21+
----
22+
apiVersion: v1
23+
kind: Namespace
24+
metadata:
25+
name: poison-pill
26+
----
27+
.. To create the `Namespace` CR, run the following command:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc create -f poison-pill-namespace.yaml
32+
----
33+
34+
. Create an `OperatorGroup` CR:
35+
.. Define the `OperatorGroup` CR and save the YAML file, for example, `poison-pill-operator-group.yaml`:
36+
+
37+
[source,yaml]
38+
----
39+
apiVersion: operators.coreos.com/v1
40+
kind: OperatorGroup
41+
metadata:
42+
name: poison-pill-manager
43+
namespace: poison-pill
44+
spec:
45+
targetNamespaces:
46+
- poison-pill
47+
----
48+
.. To create the `OperatorGroup` CR, run the following command:
49+
+
50+
[source,terminal]
51+
----
52+
$ oc create -f poison-pill-operator-group.yaml
53+
----
54+
55+
. Create a `Subscription` CR:
56+
.. Define the `Subscription` CR and save the YAML file, for example, `poison-pill-subscription.yaml`:
57+
+
58+
[source,yaml]
59+
----
60+
apiVersion: operators.coreos.com/v1alpha1
61+
kind: Subscription
62+
metadata:
63+
name: poison-pill-manager
64+
namespace: poison-pill
65+
spec:
66+
channel: alpha
67+
name: poison-pill-manager
68+
source: redhat-operators
69+
sourceNamespace: openshift-marketplace
70+
package: poison-pill-manager
71+
----
72+
.. To create the `Subscription` CR, run the following command:
73+
+
74+
[source,terminal]
75+
----
76+
$ oc create -f poison-pill-subscription.yaml
77+
----
78+
79+
.Verification
80+
81+
. Verify that the installation succeeded by inspecting the CSV resource:
82+
+
83+
[source,terminal]
84+
----
85+
$ oc get csv -n poison-pill
86+
----
87+
+
88+
.Example output
89+
[source,terminal]
90+
----
91+
NAME DISPLAY VERSION REPLACES PHASE
92+
poison-pill.v0.1.4 Poison Pill Operator 0.1.4 Succeeded
93+
----
94+
95+
. Verify that the Poison Pill Operator is up and running:
96+
+
97+
[source,terminal]
98+
----
99+
$ oc get deploy -n poison-pill
100+
----
101+
+
102+
.Example output
103+
[source,terminal]
104+
----
105+
NAME READY UP-TO-DATE AVAILABLE AGE
106+
poison-pill-controller-manager 1/1 1 1 10d
107+
----
108+
109+
. Verify that the Poison Pill Operator created the `PoisonPillConfig` CR:
110+
+
111+
[source,terminal]
112+
----
113+
$ oc get PoisonPillConfig -n poison-pill
114+
----
115+
+
116+
.Example output
117+
[source,terminal]
118+
----
119+
NAME AGE
120+
poison-pill-config 10d
121+
----
122+
. Verify that each poison pill pod is scheduled and running on each worker node:
123+
+
124+
[source,terminal]
125+
----
126+
$ oc get daemonset -n poison-pill
127+
----
128+
+
129+
.Example output
130+
[source,terminal]
131+
----
132+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
133+
poison-pill-ds 2 2 2 2 2 <none> 10d
134+
----
135+
+
136+
[NOTE]
137+
====
138+
This command is unsupported for the control plane nodes.
139+
====
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *nodes/nodes/eco-poison-pill-operator.adoc
4+
5+
[id="installing-poison-pill-operator-using-web-console_{context}"]
6+
= Installing the Poison Pill Operator by using the web console
7+
8+
You can use the {product-title} web console to install the Poison Pill Operator.
9+
10+
.Prerequisites
11+
12+
* Log in as a user with `cluster-admin` privileges.
13+
14+
.Procedure
15+
16+
. In the {product-title} web console, navigate to *Operators* -> *OperatorHub*.
17+
. Search for the Poison Pill Operator from the list of available Operators, and then click *Install*.
18+
. Keep the default selection of *Installation mode* and *namespace* to ensure that the Operator is installed to the `poison-pill` namespace.
19+
. Click *Install*.
20+
21+
.Verification
22+
23+
To confirm that the installation is successful:
24+
25+
. Navigate to the *Operators* -> *Installed Operators* page.
26+
. Check that the Operator is installed in the `poison-pill` namespace and its status is `Succeeded`.
27+
28+
If the Operator is not installed successfully:
29+
30+
. Navigate to the *Operators* -> *Installed Operators* page and inspect the `Status` column for any errors or failures.
31+
. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `poison-pill-controller-manager` project that are reporting issues.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/eco-poison-pill-operator.adoc
4+
5+
[id="troubleshooting-poison-pill-operator_{context}"]
6+
= Troubleshooting the Poison Pill Operator
7+
8+
[id="general-troubleshooting-poison-pill-operator_{context}"]
9+
== General troubleshooting
10+
11+
Issue::
12+
You want to troubleshoot issues with the Poison Pill Operator.
13+
14+
Resolution::
15+
Check the Operator logs.
16+
17+
[id="checking-daemon-set_{context}"]
18+
== Checking the daemon set
19+
Issue:: The Poison Pill Operator is installed but the daemon set is not available.
20+
21+
Resolution:: Check the Operator logs for errors or warnings.
22+
23+
[id="unsuccessful_remediation{context}"]
24+
== Unsuccessful remediation
25+
Issue:: An unhealthy node was not remediated.
26+
27+
Resolution:: Verify that the `PoisonPillRemediation` CR was created by running the following command:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc get ppr -A
32+
----
33+
+
34+
If the `MachineHealthCheck` controller did not create the `PoisonPillRemediation` CR when the node turned unhealthy, check the logs of the `MachineHealthCheck` controller. Additionally, ensure that the `MachineHealthCheck` CR includes the required specification to use the remediation template.
35+
+
36+
If the `PoisonPillRemediation` CR was created, ensure that its name matches the unhealthy node or the machine object.
37+
38+
[id="daemon-set-exists_{context}"]
39+
== Daemon set exists even after uninstalling the Poison Pill Operator
40+
Issue:: The Poison Pill daemon set exists even after after uninstalling the Operator.
41+
42+
Resolution:: To remove the Poison Pill daemon set, manually delete the `PoisonPillConfig` CR. Run the following command:
43+
+
44+
[source,terminal]
45+
----
46+
$ oc delete ds <poison-pill-daemonset> -n <namespace>
47+
----
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
[id="poison-pill-operator-remediate-nodes"]
2+
= Remediating nodes with the Poison Pill Operator
3+
include::modules/common-attributes.adoc[]
4+
:context: poison-pill-operator-remediate-nodes
5+
6+
toc::[]
7+
8+
You can use the Poison Pill Operator to automatically reboot unhealthy nodes. This remediation strategy minimizes downtime for stateful applications and ReadWriteOnce (RWO) volumes, and restores compute capacity if transient failures occur.
9+
10+
include::modules/eco-poison-pill-operator-about.adoc[leveloffset=+1]
11+
12+
include::modules/eco-poison-pill-operator-installation-web-console.adoc[leveloffset=+1]
13+
14+
include::modules/eco-poison-pill-operator-installation-cli.adoc[leveloffset=+1]
15+
16+
include::modules/eco-configuring-machine-health-check-with-poison-pill.adoc[leveloffset=+1]
17+
18+
include::modules/eco-poison-pill-operator-troubleshooting.adoc[leveloffset=+1]
19+
20+
[id="additional-resources-poison-pill-operator-installation"]
21+
== Additional resources
22+
23+
The Poison Pill Operator is supported in a restricted network environment. For more information, see xref:../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks].

0 commit comments

Comments
 (0)