Merge pull request #53348 from ahardin-rh/recommended-host-practices-revisions

ahardin-rh · web-flow · commit 00dce91093fe · 2023-01-19T17:03:35.000-05:00
OSDOCS-4320: Recommended host practices and Recommended cluster scaling practices  section reviews
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -2453,15 +2453,12 @@ Name: Scalability and performance
 Dir: scalability_and_performance
 Distros: openshift-origin,openshift-enterprise,openshift-webscale,openshift-dpu
 Topics:
-- Name: Recommended host practices
+- Name: Recommended performance and scalability practices
   File: recommended-host-practices
   Distros: openshift-origin,openshift-enterprise
 - Name: Recommended host practices for IBM Z & LinuxONE environments
   File: ibm-z-recommended-host-practices
   Distros: openshift-enterprise
-- Name: Recommended cluster scaling practices
-  File: recommended-cluster-scaling-practices
-  Distros: openshift-origin,openshift-enterprise
 - Name: Using the Node Tuning Operator
   File: using-node-tuning-operator
   Distros: openshift-origin,openshift-enterprise
diff --git a/getting_started/openshift-web-console.adoc b/getting_started/openshift-web-console.adoc
@@ -46,7 +46,7 @@ include::modules/getting-started-web-console-examining-pod.adoc[leveloffset=+2]
 include::modules/getting-started-web-console-scaling-app.adoc[leveloffset=+2]
 [role="_additional-resources"]
 .Additional resources
-* xref:../scalability_and_performance/recommended-cluster-scaling-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
+* xref:../scalability_and_performance/recommended-host-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
 * xref:../nodes/pods/nodes-pods-autoscaling.adoc#nodes-pods-autoscaling-about_nodes-pods-autoscaling[Understanding horizontal pod autoscalers]
 * xref:../nodes/pods/nodes-pods-vertical-autoscaler.adoc#nodes-pods-vertical-autoscaler-about_nodes-pods-vertical-autoscaler[About the Vertical Pod Autoscaler Operator]
 
diff --git a/modules/create-a-kubeletconfig-crd-to-edit-kubelet-parameters.adoc b/modules/create-a-kubeletconfig-crd-to-edit-kubelet-parameters.adoc
@@ -1,6 +1,5 @@
 // Module included in the following assemblies:
 //
-// * scalability_and_performance/recommended-host-practices.adoc
 // * post_installation_configuration/node-tasks.adoc
 // * post_installation_configuration/machine-configuration-tasks.adoc
 
diff --git a/modules/infrastructure-node-sizing.adoc b/modules/infrastructure-node-sizing.adoc
@@ -5,25 +5,29 @@
 [id="infrastructure-node-sizing_{context}"]
 =  Infrastructure node sizing
 
-_Infrastructure nodes_ are nodes that are labeled to run pieces of the {product-title} environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results of cluster maximums and control plane density focused testing.
+_Infrastructure nodes_ are nodes that are labeled to run pieces of the {product-title} environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results observed in cluster-density testing detailed in the *Control plane node sizing* section, where the monitoring stack and the default ingress-controller were moved to these nodes.
 
-[options="header",cols="3*"]
+[options="header",cols="4*"]
 |===
-| Number of worker nodes |CPU cores |Memory (GB)
+| Number of worker nodes |Cluster density, or number of namespaces |CPU cores |Memory (GB)
 
 | 25
+| 500
 | 4
-| 16
+| 48
 
 | 100
+| 1000
 | 8
-| 32
+| 96
 
-| 250
+| 252
+| 4000
 | 16
 | 128
 
-| 500
+| 501
+| 4000
 | 32
 | 128
 
diff --git a/modules/machineset-modifying.adoc b/modules/machineset-modifying.adoc
@@ -2,7 +2,6 @@
 //
 //
 // * machine_management/modifying-machineset.adoc
-// * scalability_and_performance/recommended-cluster-scaling-practices.adoc
 
 :_content-type: PROCEDURE
 [id="machineset-modifying_{context}"]
diff --git a/modules/master-node-sizing.adoc b/modules/master-node-sizing.adoc
@@ -33,24 +33,26 @@ The control plane node resource requirements depend on the number and type of no
 
 | 252
 | 4000
-| 16
-| 64
+| 16, but 24 if using the OVN-Kubernetes network plug-in
+| 64, but 128 if using the OVN-Kubernetes network plug-in
 
-| 501
+| 501, but untested with the OVN-Kubernetes network plug-in
 | 4000
 | 16
 | 96
 
 |===
 
-On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
+The data from the table above is based on an {product-title} running on top of AWS, using r5.4xlarge instances as control-plane nodes and m5.2xlarge instances as worker nodes.
+
+On a large and dense cluster with three control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted, or fails. The failures can be due to unexpected issues with power, network, underlying infrastructure, or intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available, which leads to increase in the resource usage. This is also expected during upgrades because the control plane nodes are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
 
 [IMPORTANT]
 ====
 The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the `running` phase.
 ====
 
-Operator Lifecycle Manager (OLM ) runs on the control plane nodes and it's memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
+Operator Lifecycle Manager (OLM ) runs on the control plane nodes and its memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
 
 [options="header",cols="3*"]
 |===
diff --git a/modules/modify-unavailable-workers.adoc b/modules/modify-unavailable-workers.adoc
@@ -1,6 +1,5 @@
 // Module included in the following assemblies:
 //
-// * scalability_and_performance/recommended-host-practices.adoc
 // * post_installation_configuration/node-tasks.adoc
 
 :_content-type: PROCEDURE
diff --git a/modules/recommended-node-host-practices.adoc b/modules/recommended-node-host-practices.adoc
@@ -1,6 +1,5 @@
 // Module included in the following assemblies:
 //
-// * scalability_and_performance/recommended-host-practices.adoc
 // * post_installation_configuration/node-tasks.adoc
 
 [id="recommended-node-host-practices_{context}"]
@@ -29,9 +28,9 @@ have 20 containers running.
 
 [NOTE]
 ====
-Disk IOPS throttling from the cloud provider might have an impact on CRI-O and kubelet. 
-They might get overloaded when there are large number of I/O intensive pods running on 
-the nodes. It is recommended that you monitor the disk I/O on the nodes and use volumes 
+Disk IOPS throttling from the cloud provider might have an impact on CRI-O and kubelet.
+They might get overloaded when there are large number of I/O intensive pods running on
+the nodes. It is recommended that you monitor the disk I/O on the nodes and use volumes
 with sufficient throughput for the workload.
 ====
 
diff --git a/modules/recommended-scale-practices.adoc b/modules/recommended-scale-practices.adoc
@@ -1,10 +1,14 @@
 // Module included in the following assemblies:
 //
-// * scalability_and_performance/recommended-cluster-scaling-practices.adoc
+// * scalability_and_performance/recommended-host-practices.adoc
 
 [id="recommended-scale-practices_{context}"]
 = Recommended practices for scaling the cluster
 
+The guidance in this section is only relevant for installations with cloud provider integration.
+
+Apply the following best practices to scale the number of worker machines in your {product-title} cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
+
 When scaling up the cluster to higher node counts:
 
 * Spread nodes across all of the available zones for higher availability.
@@ -16,17 +20,11 @@ When scaling up the cluster to higher node counts:
 Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
 ====
 
-The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which {product-title} is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which {product-title} is deployed has API request limits and excessive queries might lead to machine creation failures due to cloud platform limitations.
+The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which {product-title} is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which {product-title} is deployed has API request limits;  excessive queries might lead to machine creation failures due to cloud platform limitations.
 
-Enable machine health checks when scaling to large node counts. In case of failures, 
-the health checks monitor the condition and automatically repair unhealthy machines.
+Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
 
 [NOTE]
 ====
-When scaling large and dense clusters to lower node counts, it might take large 
-amounts of time as the process involves draining or evicting the objects running on 
-the nodes being terminated in parallel. Also, the client might start to throttle the 
-requests if there are too many objects to evict. The default client QPS and burst 
-rates are currently set to `5` and `10` respectively and they cannot be modified 
-in {product-title}.
+When scaling large and dense clusters to lower node counts, it might take large amounts of time because the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client queries per second (QPS) and burst rates are currently set to `5` and `10` respectively. These values cannot be modified in {product-title}.
 ====
diff --git a/nodes/nodes/nodes-sno-worker-nodes.adoc b/nodes/nodes/nodes-sno-worker-nodes.adoc
@@ -27,7 +27,7 @@ include::modules/ai-sno-requirements-for-installing-worker-nodes.adoc[leveloffse
 
 * xref:../../installing/installing_bare_metal/installing-restricted-networks-bare-metal.adoc#installation-minimum-resource-requirements_installing-restricted-networks-bare-metal[Minimum resource requirements for cluster installation]
 
-* xref:../../scalability_and_performance/recommended-cluster-scaling-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
+* xref:../../scalability_and_performance/recommended-host-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
 
 * xref:../../installing/installing_bare_metal/installing-bare-metal-network-customizations.adoc#installation-dns-user-infra_installing-bare-metal-network-customizations[User-provisioned DNS requirements]
 
@@ -71,4 +71,3 @@ include::modules/sno-adding-worker-nodes-to-sno-clusters-manually.adoc[leveloffs
 * xref:../../nodes/nodes/nodes-sno-worker-nodes.adoc#installation-approve-csrs_add-workers[Approving the certificate signing requests for your machines]
 
 include::modules/installation-approve-csrs.adoc[leveloffset=+1]
-
diff --git a/scalability_and_performance/recommended-cluster-scaling-practices.adoc b/scalability_and_performance/recommended-cluster-scaling-practices.adoc
diff --git a/scalability_and_performance/recommended-host-practices.adoc b/scalability_and_performance/recommended-host-practices.adoc
@@ -1,23 +1,14 @@
 :_content-type: ASSEMBLY
 [id="recommended-host-practices"]
-= Recommended host practices
+= Recommended performance and scalability practices
 include::_attributes/common-attributes.adoc[]
 :context: recommended-host-practices
 
 toc::[]
 
-This topic provides recommended host practices for {product-title}.
+This topic provides recommended performance and scalability practices practices for {product-title}.
 
-[IMPORTANT]
-====
-These guidelines apply to {product-title} with software-defined networking (SDN), not Open Virtual Network (OVN).
-====
-
-include::modules/recommended-node-host-practices.adoc[leveloffset=+1]
-
-include::modules/create-a-kubeletconfig-crd-to-edit-kubelet-parameters.adoc[leveloffset=+1]
-
-include::modules/modify-unavailable-workers.adoc[leveloffset=+1]
+include::modules/recommended-scale-practices.adoc[leveloffset=+1]
 
 include::modules/master-node-sizing.adoc[leveloffset=+1]
 
@@ -52,20 +43,10 @@ include::modules/recommended-etcd-practices.adoc[leveloffset=+1]
 
 [role="_additional-resources"]
 .Additional resources
-* link:https://access.redhat.com/solutions/4885641[How to use `fio` to check etcd disk performance in {product-title}] 
+* link:https://access.redhat.com/solutions/4885641[How to use `fio` to check etcd disk performance in {product-title}]
 
 include::modules/etcd-defrag.adoc[leveloffset=+1]
 
-include::modules/infrastructure-components.adoc[leveloffset=+1]
-
-For information on infrastructure nodes and which components can run on infrastructure nodes, see the "Red Hat OpenShift control plane and infrastructure nodes" section in the link:https://www.redhat.com/en/resources/openshift-subscription-sizing-guide[OpenShift sizing and subscription guide for enterprise Kubernetes] document.
-
-include::modules/infrastructure-moving-monitoring.adoc[leveloffset=+1]
-
-include::modules/infrastructure-moving-registry.adoc[leveloffset=+1]
-
-include::modules/infrastructure-moving-router.adoc[leveloffset=+1]
-
 include::modules/infrastructure-node-sizing.adoc[leveloffset=+1]
 
 [role="_additional-resources"]