OSDOCS#8867: Adding failure domains to Nutanix docs

mjpytlak · mjpytlak · commit e29f386be404 · 2024-01-31T11:11:40.000-05:00
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -307,6 +307,8 @@ Topics:
   Topics:
   - Name: Preparing to install on Nutanix
     File: preparing-to-install-on-nutanix
+  - Name: Fault tolerant deployments
+    File: nutanix-failure-domains
   - Name: Installing a cluster on Nutanix
     File: installing-nutanix-installer-provisioned
   - Name: Installing a cluster on Nutanix in a restricted network
@@ -606,6 +608,9 @@ Topics:
 - Name: AWS Local Zone tasks
   File: aws-compute-edge-tasks
   Distros: openshift-enterprise
+- Name: Adding failure domains to an existing Nutanix cluster
+  File: adding-nutanix-failure-domains
+  Distros: openshift-origin,openshift-enterprise
 ---
 Name: Updating clusters
 Dir: updating
diff --git a/installing/installing_nutanix/installing-nutanix-installer-provisioned.adoc b/installing/installing_nutanix/installing-nutanix-installer-provisioned.adoc
@@ -47,6 +47,7 @@ include::modules/installation-initializing.adoc[leveloffset=+1]
 * xref:../../installing/installing_nutanix/installation-config-parameters-nutanix.adoc#installation-config-parameters-nutanix[Installation configuration parameters for Nutanix]
 
 include::modules/installation-nutanix-config-yaml.adoc[leveloffset=+2]
+include::modules/installation-configuring-nutanix-failure-domains.adoc[leveloffset=+2]
 include::modules/installation-configure-proxy.adoc[leveloffset=+2]
 
 include::modules/cli-installing-cli.adoc[leveloffset=+1]
diff --git a/installing/installing_nutanix/installing-restricted-networks-nutanix-installer-provisioned.adoc b/installing/installing_nutanix/installing-restricted-networks-nutanix-installer-provisioned.adoc
@@ -46,6 +46,7 @@ include::modules/installation-initializing.adoc[leveloffset=+1]
 * xref:../../installing/installing_nutanix/installation-config-parameters-nutanix.adoc#installation-config-parameters-nutanix[Installation configuration parameters for Nutanix]
 
 include::modules/installation-nutanix-config-yaml.adoc[leveloffset=+2]
+include::modules/installation-configuring-nutanix-failure-domains.adoc[leveloffset=+2]
 include::modules/installation-configure-proxy.adoc[leveloffset=+2]
 
 include::modules/cli-installing-cli.adoc[leveloffset=+1]
diff --git a/installing/installing_nutanix/nutanix-failure-domains.adoc b/installing/installing_nutanix/nutanix-failure-domains.adoc
@@ -0,0 +1,26 @@
+:_mod-docs-content-type: ASSEMBLY
+[id="nutanix-failure-domains"]
+= Fault tolerant deployments using multiple Prism Elements
+include::_attributes/common-attributes.adoc[]
+:context: nutanix-failure-domains
+
+toc::[]
+
+By default, the installation program installs control plane and compute machines into a single Nutanix Prism Element (cluster). To improve the fault tolerance of your {product-title} cluster, you can specify that these machines be distributed across multiple Nutanix clusters by configuring failure domains.
+
+A failure domain represents an additional Prism Element instance that is available to {product-title} machine pools during and after installation.
+
+include::modules/installation-nutanix-failure-domains-req.adoc[leveloffset=+1]
+
+== Installation method and failure domain configuration
+
+The {product-title} installation method determines how and when you configure failure domains:
+
+* If you deploy using installer-provisioned infrastructure, you can configure failure domains in the installation configuration file before deploying the cluster. For more information, see xref:../../installing/installing_nutanix/installing-nutanix-installer-provisioned.adoc#installation-configuring-nutanix-failure-domains_installing-nutanix-installer-provisioned[Configuring failure domains].
++
+You can also configure failure domains after the cluster is deployed.
+* If you deploy using the {ai-full}, you configure failure domains after the cluster is deployed.
++
+For more information about configuring failure domains post-installation, see xref:../../post_installation_configuration/adding-nutanix-failure-domains.adoc#adding-failure-domains-to-an-existing-nutanix-cluster[Adding failure domains to an existing Nutanix cluster].
+
+* If you deploy using infrastructure that you manage (user-provisioned infrastructure) no additional configuration is required. After the cluster is deployed, you can manually distribute control plane and compute machines across failure domains.
diff --git a/modules/installation-configuration-parameters.adoc b/modules/installation-configuration-parameters.adoc
@@ -2956,6 +2956,17 @@ Additional Nutanix configuration parameters are described in the following table
 |The value of a prism category key-value pair to apply to compute VMs. This parameter must be accompanied by the `key` parameter, and both `key` and `value` parameters must exist in Prism Central.
 |String
 
+|compute:
+  platform:
+    nutanix:
+     failureDomains:
+d|The failure domains that apply to only compute machines.
+
+Failure domains are specified in `platform.nutanix.failureDomains`.
+d|List.
+
+The name of one or more failures domains.
+
 |compute:
   platform:
     nutanix:
@@ -2995,6 +3006,17 @@ Additional Nutanix configuration parameters are described in the following table
 |The value of a prism category key-value pair to apply to control plane VMs. This parameter must be accompanied by the `key` parameter, and both `key` and `value` parameters must exist in Prism Central.
 |String
 
+|controlPlane:
+  platform:
+    nutanix:
+     failureDomains:
+d|The failure domains that apply to only control plane machines.
+
+Failure domains are specified in `platform.nutanix.failureDomains`.
+d|List.
+
+The name of one or more failures domains.
+
 |controlPlane:
   platform:
     nutanix:
@@ -3027,6 +3049,17 @@ Additional Nutanix configuration parameters are described in the following table
 |The value of a prism category key-value pair to apply to all VMs. This parameter must be accompanied by the `key` parameter, and both `key` and `value` parameters must exist in Prism Central.
 |String
 
+|platform:
+  nutanix:
+    defaulatMachinePlatform:
+      failureDomains:
+d|The failure domains that apply to both control plane and compute machines.
+
+Failure domains are specified in `platform.nutanix.failureDomains`.
+d|List.
+
+The name of one or more failures domains.
+
 |platform:
   nutanix:
     defaultMachinePlatform:
@@ -3056,6 +3089,23 @@ Additional Nutanix configuration parameters are described in the following table
 |The virtual IP (VIP) address that you configured for control plane API access.
 |IP address
 
+|platform:
+  nutanix:
+    failureDomains:
+    - name:
+      prismElement:
+        name:
+        uuid:
+      subnetUUIDs:
+      -
+a|By default, the installation program installs cluster machines to a single Prism Element instance. You can specify additional Prism Element instances for fault tolerance, and then apply them to:
+
+* The cluster's default machine configuration
+* Only control plane or compute machine pools
+d|A list of configured failure domains.
+
+For more information on usage, see "Configuring a failure domain" in "Installing a cluster on Nutanix".
+
 |platform:
   nutanix:
     ingressVIP:
@@ -3129,8 +3179,8 @@ Additional Nutanix configuration parameters are described in the following table
 |====
 [.small]
 --
-1. The `prismElements` section holds a list of Prism Elements (clusters). A Prism Element encompasses all of the Nutanix resources, for example virtual machines and subnets, that are used to host the {product-title} cluster. Only a single Prism Element is supported.
-2. Only one subnet per {product-title} cluster is supported.
+1. The `prismElements` section holds a list of Prism Elements (clusters). A Prism Element encompasses all of the Nutanix resources, for example virtual machines and subnets, that are used to host the {product-title} cluster.
+2. Only one subnet per Prism Element in an {product-title} cluster is supported.
 --
 endif::nutanix[]
 
diff --git a/modules/installation-configuring-nutanix-failure-domains.adoc b/modules/installation-configuring-nutanix-failure-domains.adoc
@@ -0,0 +1,99 @@
+// Module included in the following assemblies:
+//
+// * installing/installing_nutanix/installing-nutanix-installer-provisioned.adoc
+// * installing/installing_nutanix/installing-restricted-networks-nutanix-installer-provisioned.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="installation-configuring-nutanix-failure-domains_{context}"]
+= Configuring failure domains
+
+Failure domains improve the fault tolerance of an {product-title} cluster by distributing control plane and compute machines across multiple Nutanix Prism Elements (clusters).
+
+[TIP]
+====
+It is recommended that you configure three failure domains to ensure high-availability.
+====
+
+.Prerequisites
+
+* You have an installation configuration file (`install-config.yaml`).
+
+.Procedure
+
+. Edit the `install-config.yaml` file and add the following stanza to configure the first failure domain:
++
+[source,yaml]
+----
+apiVersion: v1
+baseDomain: example.com
+compute:
+# ...
+platform:
+  nutanix:
+    failureDomains:
+    - name: <failure_domain_name>
+      prismElement:
+        name: <prism_element_name>
+        uuid: <prism_element_uuid>
+      subnetUUIDs:
+      - <network_uuid>
+# ...
+----
++
+where:
+
+`<failure_domain_name>`:: Specifies a unique name for the failure domain. The name is limited to 64 or fewer characters, which can include lower-case letters, digits, and a dash (`-`). The dash cannot be in the leading or ending position of the name.
+`<prism_element_name>`:: Optional. Specifies the name of the Prism Element.
+`<prism_element_uuid`>:: Specifies the UUID of the Prism Element.
+`<network_uuid`>:: Specifies the UUID of the Prism Element subnet object. The subnet's IP address prefix (CIDR) should contain the virtual IP addresses that the {product-title} cluster uses. Only one subnet per failure domain (Prism Element) in an {product-title} cluster is supported.
+
+. As required, configure additional failure domains.
+. To distribute control plane and compute machines across the failure domains, do one of the following:
+
+** If compute and control plane machines can share the same set of failure domains, add the failure domain names under the cluster's default machine configuration.
++
+.Example of control plane and compute machines sharing a set of failure domains
++
+[source,yaml]
+----
+apiVersion: v1
+baseDomain: example.com
+compute:
+# ...
+platform:
+  nutanix:
+    defaultMachinePlatform:
+      failureDomains:
+        - failure-domain-1
+        - failure-domain-2
+        - failure-domain-3
+# ...
+----
+** If compute and control plane machines must use different failure domains, add the failure domain names under the respective machine pools.
++
+.Example of control plane and compute machines using different failure domains
++
+[source,yaml]
+----
+apiVersion: v1
+baseDomain: example.com
+compute:
+# ...
+controlPlane:
+  platform:
+    nutanix:
+      failureDomains:
+        - failure-domain-1
+        - failure-domain-2
+        - failure-domain-3
+# ...
+compute:
+  platform:
+    nutanix:
+      failureDomains:
+        - failure-domain-1
+        - failure-domain-2
+# ...
+----
+
+. Save the file.
diff --git a/modules/installation-nutanix-failure-domains-req.adoc b/modules/installation-nutanix-failure-domains-req.adoc
@@ -0,0 +1,14 @@
+// Module included in the following assemblies:
+//
+// * installing/installing_nutanix/nutanix-failure-domains.adoc
+// * post_installation_configuration/adding-nutanix-failure-domains.adoc
+
+:_mod-docs-content-type: CONCEPT
+[id="installation-nutanix-failure-domains-req_{context}"]
+= Failure domain requirements
+
+When planning to use failure domains, consider the following requirements:
+
+* All Nutanix Prism Element instances must be managed by the same instance of Prism Central. A deployment that is comprised of multiple Prism Central instances is not supported.
+* The machines that make up the Prism Element clusters must reside on the same Ethernet network for failure domains to be able to communicate with each other.
+* A subnet is required in each Prism Element that will be used as a failure domain in the {product-title} cluster. When defining these subnets, they must share the same IP address prefix (CIDR) and should contain the virtual IP addresses that the {product-title} cluster uses.
diff --git a/modules/post-installation-adding-nutanix-failure-domains-compute-machines.adoc b/modules/post-installation-adding-nutanix-failure-domains-compute-machines.adoc
@@ -0,0 +1,120 @@
+// Module included in the following assemblies:
+//
+// * post_installation_configuration/adding-nutanix-failure-domains.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="post-installation-adding-nutanix-failure-domains-compute-machines_{context}"]
+= Distributing compute machines across failure domains
+
+You can distribute compute machines across Nutanix failure domains by performing either of the following tasks:
+
+* Modifying existing compute machine sets.
+* Creating new compute machine sets.
+
+The following procedure details how to distribute compute machines across failure domains by modifying existing compute machine sets. For more information on creating a compute machine set, see "Additional resources".
+
+.Prerequisites
+
+* You have configured the failure domains in the cluster's Infrastructure custom resource (CR).
+
+.Procedure
+
+. Run the following command to view the cluster's Infrastructure CR.
++
+[source,terminal]
+----
+$ oc describe infrastructures.config.openshift.io cluster
+----
+. For each failure domain (`platformSpec.nutanix.failureDomains`), note the cluster's UUID, name, and subnet object UUID. These values are required to add a failure domain to a compute machine set.
+. List the compute machine sets in your cluster by running the following command:
++
+[source,terminal]
+----
+$ oc get machinesets -n openshift-machine-api
+----
+. Edit the first compute machine set by running the following command:
++
+[source,terminal]
+----
+$ oc edit machineset <machineset_name> -n openshift-machine-api
+----
+. Configure the compute machine set to use the first failure domain by adding the following to the `spec.template.spec.providerSpec.value` stanza:
++
+[NOTE]
+====
+Be sure that the values you specify for the `cluster` and `subnets` fields match the values that were configured in the `failureDomains` stanza in the cluster's Infrastructure CR.
+====
++
+.Example compute machine set with Nutanix failure domains
+[source,yaml]
+----
+apiVersion: machine.openshift.io/v1
+kind: MachineSet
+metadata:
+  creationTimestamp: null
+  labels:
+    machine.openshift.io/cluster-api-cluster: <cluster_name>
+  name: <machineset_name>
+  namespace: openshift-machine-api
+spec:
+  replicas: 2
+# ...
+  template:
+    spec:
+# ...
+      providerSpec:
+        value:
+          apiVersion: machine.openshift.io/v1
+          failureDomain:
+            name: <failure_domain_name_1>
+          cluster:
+            type: uuid
+            uuid: <prism_element_uuid_1>
+          subnets:
+          - type: uuid
+            uuid: <prism_element_network_uuid_1>
+# ...
+----
+. Note the value of `spec.replicas`, as you need it when scaling the machine set to apply the changes.
+. Save your changes.
+. List the machines that are managed by the updated compute machine set by running the following command:
++
+[source,terminal]
+----
+$ oc get -n openshift-machine-api machines -l machine.openshift.io/cluster-api-machineset=<machine_set_name>
+----
+. For each machine that is managed by the updated compute machine set, set the `delete` annotation by running the following command:
++
+[source,terminal]
+----
+$ oc annotate machine/<machine_name_original_1> \
+  -n openshift-machine-api \
+  machine.openshift.io/delete-machine="true"
+----
+. Scale the compute machine set to twice the number of replicas by running the following command:
++
+[source,terminal]
+----
+$ oc scale --replicas=<twice_the_number_of_replicas> \// <1>
+  machineset <machine_set_name> \
+  -n openshift-machine-api
+----
+<1> For example, if the original number of replicas in the compute machine set is `2`, scale the replicas to `4`.
+. List the machines that are managed by the updated compute machine set by running the following command:
++
+[source,terminal]
+----
+$ oc get -n openshift-machine-api machines -l machine.openshift.io/cluster-api-machineset=<machine_set_name>
+----
++
+When the new machines are in the `Running` phase, you can scale the compute machine set to the original number of replicas.
+. Scale the compute machine set to the original number of replicas by running the following command:
++
+[source,terminal]
+----
+$ oc scale --replicas=<original_number_of_replicas> \// <1>
+  machineset <machine_set_name> \
+  -n openshift-machine-api
+----
+<1> For example, if the original number of replicas in the compute machine set is `2`, scale the replicas to `2`.
+. As required, continue to modify machine sets to reference the additional failure domains that are available to the deployment.
diff --git a/modules/post-installation-adding-nutanix-failure-domains-control-planes.adoc b/modules/post-installation-adding-nutanix-failure-domains-control-planes.adoc
diff --git a/modules/post-installation-configuring-nutanix-failure-domains.adoc b/modules/post-installation-configuring-nutanix-failure-domains.adoc
diff --git a/post_installation_configuration/adding-nutanix-failure-domains.adoc b/post_installation_configuration/adding-nutanix-failure-domains.adoc