Skip to content

Commit 0209d2b

Browse files
committed
[OSDOCS-4239]: CPMSO troubleshooting
addressing feedback
1 parent 2e4c83f commit 0209d2b

File tree

9 files changed

+182
-16
lines changed

9 files changed

+182
-16
lines changed

_topic_maps/_topic_map.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1994,8 +1994,10 @@ Topics:
19941994
File: cpmso-using
19951995
- Name: Control plane resiliency and recovery
19961996
File: cpmso-resiliency
1997-
#- Name: Troubleshooting the Control Plane Machine Set Operator
1998-
# File: cpmso-troubleshooting
1997+
- Name: Troubleshooting the control plane machine set
1998+
File: cpmso-troubleshooting
1999+
- Name: Disabling the control plane machine set
2000+
File: cpmso-disabling
19992001
- Name: Deploying machine health checks
20002002
File: deploying-machine-health-checks
20012003
---
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
:_content-type: ASSEMBLY
2+
[id="cpmso-disabling"]
3+
= Disabling the control plane machine set
4+
include::_attributes/common-attributes.adoc[]
5+
:context: cpmso-disabling
6+
7+
toc::[]
8+
9+
The `.spec.state` field in an activated `ControlPlaneMachineSet` custom resource (CR) cannot be changed from `Active` to `Inactive`. To disable the control plane machine set, you must delete the CR so that it is removed from the cluster.
10+
11+
When you delete the CR, the Control Plane Machine Set Operator performs cleanup operations and disables the control plane machine set. The Operator then removes the CR from the cluster and creates an inactive control plane machine set with default settings.
12+
13+
//Deleting the control plane machine set
14+
include::modules/cpmso-deleting.adoc[leveloffset=+1]
15+
16+
//Checking the control plane machine set custom resource status
17+
include::modules/cpmso-checking-status.adoc[leveloffset=+1]
18+
19+
[id="cpmso-reenabling_{context}"]
20+
== Re-enabling the control plane machine set
21+
22+
To re-enable the control plane machine set, you must ensure that the configuration in the CR is correct for your cluster and activate it.
23+
24+
[role="_additional-resources"]
25+
.Additional resources
26+
* xref:../../machine_management/control_plane_machine_management/cpmso-getting-started.adoc#cpmso-activating_cpmso-getting-started[Activating the control plane machine set custom resource]

machine_management/control_plane_machine_management/cpmso-getting-started.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ The status of the Operator after installation depends on your cloud provider and
5757
//Checking the Control Plane Machine Set Operator status
5858
include::modules/cpmso-checking-status.adoc[leveloffset=+1]
5959

60-
//Activating the Control Plane Machine Set Operator
60+
//Activating the control plane machine set custom resource
6161
include::modules/cpmso-activating.adoc[leveloffset=+1]
6262

6363
[role="_additional-resources"]
Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,31 @@
11
:_content-type: ASSEMBLY
22
[id="cpmso-troubleshooting"]
3-
= Troubleshooting the Control Plane Machine Set Operator
3+
= Troubleshooting the control plane machine set
44
include::_attributes/common-attributes.adoc[]
55
:context: cpmso-troubleshooting
66

77
toc::[]
88

9-
//todo: add Checking the Control Plane Machine Set Operator status
10-
//include::modules/cpmso-checking-status.adoc[leveloffset=+1]
9+
Use the information in this section to understand and recover from issues you might encounter.
1110

12-
[id="cpmso_ts_ilb_missing_{context}"]
13-
== Internal load balancer missing form Azure provider specification
11+
//Checking the control plane machine set custom resource status
12+
include::modules/cpmso-checking-status.adoc[leveloffset=+1]
1413

15-
The `internalLoadBalancer` parameter is required in both the `ControlPlaneMachineSet` and control plane `Machine` CRs, but might not be prepopulated. If this parameter is not populated in those CRs on your cluster, you must populate it on both CRs.
14+
[role="_additional-resources"]
15+
.Additional resources
16+
* xref:../../machine_management/control_plane_machine_management/cpmso-getting-started.adoc#cpmso-activating_cpmso-getting-started[Activating the control plane machine set custom resource]
17+
* xref:../../machine_management/control_plane_machine_management/cpmso-getting-started.adoc#cpmso-creating-cr_cpmso-getting-started[Creating a control plane machine set custom resource]
1618
17-
For more information on where this parameter is located in the Azure provider specification, see xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-provider-spec-azure_cpmso-configuration[Sample Azure provider specification].
19+
//Adding a missing Azure internal load balancer
20+
include::modules/cpmso-ts-ilb-missing.adoc[leveloffset=+1]
1821

19-
//Would like to include some detail about the machine CR too, need to see if we have tha structure somewhere.
22+
[role="_additional-resources"]
23+
.Additional resources
24+
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-provider-spec-azure_cpmso-configuration[Sample Azure provider specification]
25+
26+
//Recovering a degraded etcd Operator after a machine health check operation
27+
include::modules/cpmso-ts-mhc-etcd-degraded.adoc[leveloffset=+1]
28+
29+
[role="_additional-resources"]
30+
.Additional resources
31+
* xref:../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]

modules/cpmso-checking-status.adoc

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,18 @@
11
// Module included in the following assemblies:
22
//
33
// * machine_management/cpmso-getting-started.adoc
4+
// * machine_management/cpmso-troubleshooting.adoc
5+
// * machine_management/cpmso-disabling.adoc
6+
7+
ifeval::["{context}" == "cpmso-disabling"]
8+
:cpmso-disabling:
9+
endif::[]
410

511
:_content-type: PROCEDURE
612
[id="cpmso-checking-status_{context}"]
7-
= Checking the control plane machine set custom resource status
13+
= Checking the control plane machine set custom resource state
814

9-
You can verify the existence and status of the `ControlPlaneMachineSet` custom resource (CR).
15+
You can verify the existence and state of the `ControlPlaneMachineSet` custom resource (CR).
1016

1117
.Procedure
1218

@@ -23,10 +29,16 @@ $ oc get controlplanemachineset.machine.openshift.io cluster --namespace openshi
2329

2430
** A result of `NotFound` indicates that there is no existing `ControlPlaneMachineSet` CR.
2531

32+
ifndef::cpmso-disabling[]
2633
.Next steps
2734

28-
Before using the Operator, you must ensure that a `ControlPlaneMachineSet` CR with the correct settings for your cluster exists.
35+
To use the control plane machine set, you must ensure that a `ControlPlaneMachineSet` CR with the correct settings for your cluster exists.
2936

3037
* If your cluster has an existing CR, you must verify that the configuration in the CR is correct for your cluster.
3138
32-
* If your cluster does not have an existing CR, you must create one with the correct configuration for your cluster.
39+
* If your cluster does not have an existing CR, you must create one with the correct configuration for your cluster.
40+
endif::[]
41+
42+
ifeval::["{context}" == "cpmso-disabling"]
43+
:!cpmso-disabling:
44+
endif::[]

modules/cpmso-deleting.adoc

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/cpmso-disabling.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="cpmso-deleting_{context}"]
7+
= Deleting the control plane machine set
8+
9+
To stop managing control plane machines with the control plane machine set on your cluster, you must delete the `ControlPlaneMachineSet` custom resource (CR).
10+
11+
.Procedure
12+
13+
* Delete the control plane machine set CR by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc delete controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api
18+
----
19+
20+
.Verification
21+
22+
* Check the control plane machine set custom resource state. A result of `Inactive` indicates that the removal and replacement process is successful. A `ControlPlaneMachineSet` CR exists but is not activated.

modules/cpmso-ts-ilb-missing.adoc

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/cpmso-troubleshooting.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="cpmso-ts-ilb-missing_{context}"]
7+
= Adding a missing Azure internal load balancer
8+
9+
The `internalLoadBalancer` parameter is required in both the `ControlPlaneMachineSet` and control plane `Machine` custom resources (CRs) for Azure. If this parameter is not preconfigured on your cluster, you must add it to both CRs.
10+
11+
For more information about where this parameter is located in the Azure provider specification, see the sample Azure provider specification. The placement in the control plane `Machine` CR is similar.
12+
13+
.Procedure
14+
15+
. List the control plane machines in your cluster by running the following command:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get machines -l machine.openshift.io/cluster-api-machine-role==master -n openshift-machine-api
20+
----
21+
22+
. For each control plane machine, edit the CR by running the following command:
23+
+
24+
[source,terminal]
25+
----
26+
$ oc edit machine <control_plane_machine_name>
27+
----
28+
29+
. Add the `internalLoadBalancer` parameter with the correct details for your cluster and save your changes.
30+
31+
. Edit your control plane machine set CR by running the following command:
32+
+
33+
[source,terminal]
34+
----
35+
$ oc --namespace openshift-machine-api edit controlplanemachineset.machine.openshift.io cluster
36+
----
37+
38+
. Add the `internalLoadBalancer` parameter with the correct details for your cluster and save your changes.
39+
40+
.Next steps
41+
42+
* For clusters that use the default `RollingUpdate` update strategy, the Operator automatically propagates the changes to your control plane configuration.
43+
44+
* For clusters that are configured to use the `OnDelete` update strategy, you must replace your control plane machines manually.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/cpmso-troubleshooting.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="cpmso-ts-etcd-degraded_{context}"]
7+
= Recovering a degraded etcd Operator
8+
9+
Certain situations can cause the etcd Operator to become degraded.
10+
11+
For example, while performing remediation, the machine health check might delete a control plane machine that is hosting etcd. If the etcd member is not reachable at that time, the etcd Operator becomes degraded.
12+
13+
When the etcd Operator is degraded, manual intervention is required to force the Operator to remove the failed member and restore the cluster state.
14+
15+
.Procedure
16+
17+
. List the control plane machines in your cluster by running the following command:
18+
+
19+
[source,terminal]
20+
----
21+
$ oc get machines -l machine.openshift.io/cluster-api-machine-role==master -n openshift-machine-api -o wide
22+
----
23+
+
24+
Any of the following conditions might indicate a failed control plane machine:
25+
+
26+
--
27+
** The `STATE` value is `stopped`.
28+
** The `PHASE` value is `Failed`.
29+
** The `PHASE` value is `Deleting` for more than ten minutes.
30+
--
31+
+
32+
[IMPORTANT]
33+
====
34+
Before continuing, ensure that your cluster has two healthy control plane machines. Performing the actions in this procedure on more than one control plane machine risks losing etcd quorum and can cause data loss.
35+
36+
If you have lost the majority of your control plane hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure "Restoring to a previous cluster state" instead of this procedure.
37+
====
38+
39+
. Edit the machine CR for the failed control plane machine by running the following command:
40+
+
41+
[source,terminal]
42+
----
43+
$ oc edit machine <control_plane_machine_name>
44+
----
45+
46+
. Remove the contents of the `lifecycleHooks` parameter from the failed control plane machine and save your changes.
47+
+
48+
The etcd Operator removes the failed machine from the cluster and can then safely add new etcd members.

modules/cpmso-yaml-provider-spec-azure.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ providerSpec:
5959
<1> Specifies the secret name for the cluster. Do not change this value.
6060
<2> Specifies the image details for your control plane machine set.
6161
<3> Specifies an image that is compatible with your instance type. The Hyper-V generation V2 images created by the installation program have a `-gen2` suffix, while V1 images have the same name without the suffix.
62-
<4> Specifies the internal load balancer for the control plane. This field might not be prepopulated but is required in both the `ControlPlaneMachineSet` and control plane `Machine` CRs.
62+
<4> Specifies the internal load balancer for the control plane. This field might not be preconfigured but is required in both the `ControlPlaneMachineSet` and control plane `Machine` CRs.
6363
<5> Specifies the cloud provider platform type. Do not change this value.
6464
<6> Specifies the region to place control plane machines on.
6565
<7> Specifies the disk configuration for the control plane.

0 commit comments

Comments
 (0)