Skip to content

Commit 4e767ac

Browse files
authored
Merge pull request #53246 from jeana-redhat/OSDOCS-4235-CPMS-failure-recovery
[OSDOCS-4235]: CPMS resiliency and recovery
2 parents 2ac3b49 + 9fe46fa commit 4e767ac

14 files changed

+131
-29
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1977,6 +1977,8 @@ Topics:
19771977
File: cpmso-configuration
19781978
#- Name: Using the Control Plane Machine Set Operator
19791979
# File: cpmso-using
1980+
- Name: Control plane resiliency and recovery
1981+
File: cpmso-resiliency
19801982
#- Name: Troubleshooting the Control Plane Machine Set Operator
19811983
# File: cpmso-troubleshooting
19821984
- Name: Deploying machine health checks
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
:_content-type: ASSEMBLY
2+
[id="cpmso-resiliency"]
3+
= Control plane resiliency and recovery
4+
include::_attributes/common-attributes.adoc[]
5+
:context: cpmso-resiliency
6+
7+
toc::[]
8+
9+
You can use the Control Plane Machine Set Operator to improve the resiliency of the control plane for your {product-title} cluster.
10+
11+
[id="cpmso-failure-domains_{context}"]
12+
== High availability and fault tolerance with failure domains
13+
14+
When possible, the control plane machine set spreads the control plane machines across multiple failure domains. This configuration provides high availability and fault tolerance within the control plane. This strategy can help protect the control plane when issues arise within the infrastructure provider.
15+
16+
//Failure domain platform support and configuration
17+
include::modules/cpmso-failure-domains-provider.adoc[leveloffset=+2]
18+
19+
[role="_additional-resources"]
20+
.Additional resources
21+
22+
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-failure-domain-aws_cpmso-configuration[Sample Amazon Web Services failure domain configuration]
23+
24+
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-failure-domain-azure_cpmso-configuration[Sample Microsoft Azure failure domain configuration]
25+
26+
//Balancing control plane machines
27+
include::modules/cpmso-failure-domains-balancing.adoc[leveloffset=+2]
28+
29+
//Recovery of the failed control plane machines
30+
include::modules/cpmso-control-plane-recovery.adoc[leveloffset=+1]
31+
32+
[role="_additional-resources"]
33+
.Additional resources
34+
35+
* xref:../../machine_management/deploying-machine-health-checks.adoc#deploying-machine-health-checks[Deploying machine health checks]

machine_management/control_plane_machine_management/cpmso-using.adoc

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,6 @@ The Control Plane Machine Set Operator automates the following capabilities:
1111
//Vertical resizing of the control plane
1212
include::modules/cpmso-feat-vertical-resize.adoc[leveloffset=+1]
1313

14-
//Recovery of the failed control plane machines
15-
include::modules/cpmso-feat-failure-recovery.adoc[leveloffset=+1]
16-
1714
//Updating the control plane configuration
1815
include::modules/cpmso-feat-config-update.adoc[leveloffset=+1]
1916

machine_management/deploying-machine-health-checks.adoc

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,14 @@ You can configure and deploy a machine health check to automatically repair dama
1111
include::modules/machine-user-provisioned-limitations.adoc[leveloffset=+1]
1212

1313
include::modules/machine-health-checks-about.adoc[leveloffset=+1]
14-
1514
[role="_additional-resources"]
1615
.Additional resources
17-
18-
* For more information about the node conditions you can define in a `MachineHealthCheck` CR, see xref:../nodes/nodes/nodes-nodes-viewing.html#nodes-nodes-viewing-listing_nodes-nodes-viewing[About listing all the nodes in a cluster].
19-
20-
* For more information about short-circuiting, see xref:../machine_management/deploying-machine-health-checks.adoc#machine-health-checks-short-circuiting_deploying-machine-health-checks[Short-circuiting machine health check remediation].
16+
* xref:../nodes/nodes/nodes-nodes-viewing.adoc#nodes-nodes-viewing-listing_nodes-nodes-viewing[About listing all the nodes in a cluster]
17+
* xref:../machine_management/deploying-machine-health-checks.adoc#machine-health-checks-short-circuiting_deploying-machine-health-checks[Short-circuiting machine health check remediation]
18+
* xref:../machine_management/control_plane_machine_management/cpmso-about.adoc#cpmso-about[About the Control Plane Machine Set Operator]
2119
2220
include::modules/machine-health-checks-resource.adoc[leveloffset=+1]
2321
24-
////
25-
[role="_additional-resources"]
26-
.Additional resources
27-
////
28-
2922
include::modules/machine-health-checks-creating.adoc[leveloffset=+1]
3023
3124
You can configure and deploy a machine health check to detect and repair unhealthy bare metal nodes.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/cpmso-resiliency.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="cpmso-control-plane-recovery_{context}"]
7+
= Recovery of failed control plane machines
8+
9+
The Control Plane Machine Set Operator automates the recovery of control plane machines. When a control plane machine is deleted, the Operator creates a replacement with the configuration that is specified in the `ControlPlaneMachineSet` custom resource (CR).
10+
11+
For clusters that use control plane machine sets, you can configure a machine health check. The machine health check deletes unhealthy control plane machines so that they are replaced.
12+
13+
[IMPORTANT]
14+
====
15+
If you configure a `MachineHealthCheck` resource for the control plane, set the value of `maxUnhealthy` to `1`.
16+
17+
This configuration ensures that the machine health check takes no action when multiple control plane machines appear to be unhealthy. Multiple unhealthy control plane machines can indicate that the etcd cluster is degraded or that a scaling operation to replace a failed machine is in progress.
18+
19+
If the etcd cluster is degraded, manual intervention might be required. If a scaling operation is in progress, the machine health check should allow it to finish.
20+
====
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/cpmso-resiliency.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="cpmso-failure-domains-balancing_{context}"]
7+
= Balancing control plane machines
8+
9+
The control plane machine set balances control plane machines across the failure domains that are specified in the custom resource (CR).
10+
11+
//If failure domains must be reused, they are selected alphabetically by name.
12+
When possible, the control plane machine set uses each failure domain equally to ensure appropriate fault tolerance. If there are fewer failure domains than control plane machines, failure domains are selected for reuse alphabetically by name. For clusters with no failure domains specified, all control plane machines are placed within a single failure domain.
13+
14+
Some changes to the failure domain configuration cause the control plane machine set to rebalance the control plane machines. For example, if you add failure domains to a cluster with fewer failure domains than control plane machines, the control plane machine set rebalances the machines across all available failure domains.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/cpmso-resiliency.adoc
4+
5+
:_content-type: REFERENCE
6+
[id="cpmso-failure-domains-provider_{context}"]
7+
= Failure domain platform support and configuration
8+
9+
The control plane machine set concept of a failure domain is analogous to existing concepts on cloud providers. Not all platforms support the use of failure domains.
10+
11+
.Failure domain support matrix
12+
[cols="<.^,^.^,^.^"]
13+
|====
14+
|Cloud provider |Support for failure domains |Provider nomenclature
15+
16+
|Amazon Web Services (AWS)
17+
|X
18+
|link:https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones[Availability Zone (AZ)]
19+
20+
|Microsoft Azure
21+
|X
22+
|link:https://learn.microsoft.com/en-us/azure/azure-web-pubsub/concept-availability-zones[Azure availability zone]
23+
24+
|VMware vSphere
25+
|
26+
|Not applicable
27+
|====
28+
29+
The failure domain configuration in the control plane machine set custom resource (CR) is platform-specific. For more information about failure domain parameters in the CR, see the sample failure domain configuration for your provider.

modules/cpmso-feat-failure-recovery.adoc

Lines changed: 0 additions & 7 deletions
This file was deleted.

modules/cpmso-yaml-failure-domain-aws.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@
66
[id="cpmso-yaml-failure-domain-aws_{context}"]
77
= Sample AWS failure domain configuration
88

9-
The `ControlPlaneMachineSet` CR spreads control plane machines across multiple failure domains when possible.
9+
The control plane machine set concept of a failure domain is analogous to existing AWS concept of an link:https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones[_Availability Zone (AZ)_]. The `ControlPlaneMachineSet` CR spreads control plane machines across multiple failure domains when possible.
10+
11+
When configuring AWS failure domains in the control plane machine set, you must specify the availability zone name and the subnet to use.
1012

1113
.Sample AWS failure domain values
1214
[source,yaml]

modules/cpmso-yaml-failure-domain-azure.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@
66
[id="cpmso-yaml-failure-domain-azure_{context}"]
77
= Sample Azure failure domain configuration
88

9-
The `ControlPlaneMachineSet` CR spreads control plane machines across multiple failure domains when possible.
9+
The control plane machine set concept of a failure domain is analogous to existing Azure concept of an link:https://learn.microsoft.com/en-us/azure/azure-web-pubsub/concept-availability-zones[_Azure availability zone_]. The `ControlPlaneMachineSet` CR spreads control plane machines across multiple failure domains when possible.
10+
11+
When configuring Azure failure domains in the control plane machine set, you must specify the availability zone name.
1012

1113
.Sample Azure failure domain values
1214
[source,yaml]

0 commit comments

Comments
 (0)