Skip to content

Commit 00dce91

Browse files
authored
Merge pull request #53348 from ahardin-rh/recommended-host-practices-revisions
OSDOCS-4320: Recommended host practices and Recommended cluster scaling practices section reviews
2 parents 54f65ee + 5ab4430 commit 00dce91

12 files changed

+36
-89
lines changed

_topic_maps/_topic_map.yml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2453,15 +2453,12 @@ Name: Scalability and performance
24532453
Dir: scalability_and_performance
24542454
Distros: openshift-origin,openshift-enterprise,openshift-webscale,openshift-dpu
24552455
Topics:
2456-
- Name: Recommended host practices
2456+
- Name: Recommended performance and scalability practices
24572457
File: recommended-host-practices
24582458
Distros: openshift-origin,openshift-enterprise
24592459
- Name: Recommended host practices for IBM Z & LinuxONE environments
24602460
File: ibm-z-recommended-host-practices
24612461
Distros: openshift-enterprise
2462-
- Name: Recommended cluster scaling practices
2463-
File: recommended-cluster-scaling-practices
2464-
Distros: openshift-origin,openshift-enterprise
24652462
- Name: Using the Node Tuning Operator
24662463
File: using-node-tuning-operator
24672464
Distros: openshift-origin,openshift-enterprise

getting_started/openshift-web-console.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ include::modules/getting-started-web-console-examining-pod.adoc[leveloffset=+2]
4646
include::modules/getting-started-web-console-scaling-app.adoc[leveloffset=+2]
4747
[role="_additional-resources"]
4848
.Additional resources
49-
* xref:../scalability_and_performance/recommended-cluster-scaling-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
49+
* xref:../scalability_and_performance/recommended-host-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
5050
* xref:../nodes/pods/nodes-pods-autoscaling.adoc#nodes-pods-autoscaling-about_nodes-pods-autoscaling[Understanding horizontal pod autoscalers]
5151
* xref:../nodes/pods/nodes-pods-vertical-autoscaler.adoc#nodes-pods-vertical-autoscaler-about_nodes-pods-vertical-autoscaler[About the Vertical Pod Autoscaler Operator]
5252

modules/create-a-kubeletconfig-crd-to-edit-kubelet-parameters.adoc

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
// Module included in the following assemblies:
22
//
3-
// * scalability_and_performance/recommended-host-practices.adoc
43
// * post_installation_configuration/node-tasks.adoc
54
// * post_installation_configuration/machine-configuration-tasks.adoc
65

modules/infrastructure-node-sizing.adoc

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,29 @@
55
[id="infrastructure-node-sizing_{context}"]
66
= Infrastructure node sizing
77

8-
_Infrastructure nodes_ are nodes that are labeled to run pieces of the {product-title} environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results of cluster maximums and control plane density focused testing.
8+
_Infrastructure nodes_ are nodes that are labeled to run pieces of the {product-title} environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results observed in cluster-density testing detailed in the *Control plane node sizing* section, where the monitoring stack and the default ingress-controller were moved to these nodes.
99

10-
[options="header",cols="3*"]
10+
[options="header",cols="4*"]
1111
|===
12-
| Number of worker nodes |CPU cores |Memory (GB)
12+
| Number of worker nodes |Cluster density, or number of namespaces |CPU cores |Memory (GB)
1313

1414
| 25
15+
| 500
1516
| 4
16-
| 16
17+
| 48
1718

1819
| 100
20+
| 1000
1921
| 8
20-
| 32
22+
| 96
2123

22-
| 250
24+
| 252
25+
| 4000
2326
| 16
2427
| 128
2528

26-
| 500
29+
| 501
30+
| 4000
2731
| 32
2832
| 128
2933

modules/machineset-modifying.adoc

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
//
33
//
44
// * machine_management/modifying-machineset.adoc
5-
// * scalability_and_performance/recommended-cluster-scaling-practices.adoc
65

76
:_content-type: PROCEDURE
87
[id="machineset-modifying_{context}"]

modules/master-node-sizing.adoc

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -33,24 +33,26 @@ The control plane node resource requirements depend on the number and type of no
3333

3434
| 252
3535
| 4000
36-
| 16
37-
| 64
36+
| 16, but 24 if using the OVN-Kubernetes network plug-in
37+
| 64, but 128 if using the OVN-Kubernetes network plug-in
3838

39-
| 501
39+
| 501, but untested with the OVN-Kubernetes network plug-in
4040
| 4000
4141
| 16
4242
| 96
4343

4444
|===
4545

46-
On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
46+
The data from the table above is based on an {product-title} running on top of AWS, using r5.4xlarge instances as control-plane nodes and m5.2xlarge instances as worker nodes.
47+
48+
On a large and dense cluster with three control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted, or fails. The failures can be due to unexpected issues with power, network, underlying infrastructure, or intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available, which leads to increase in the resource usage. This is also expected during upgrades because the control plane nodes are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
4749

4850
[IMPORTANT]
4951
====
5052
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the `running` phase.
5153
====
5254

53-
Operator Lifecycle Manager (OLM ) runs on the control plane nodes and it's memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
55+
Operator Lifecycle Manager (OLM ) runs on the control plane nodes and its memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
5456

5557
[options="header",cols="3*"]
5658
|===

modules/modify-unavailable-workers.adoc

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
// Module included in the following assemblies:
22
//
3-
// * scalability_and_performance/recommended-host-practices.adoc
43
// * post_installation_configuration/node-tasks.adoc
54

65
:_content-type: PROCEDURE

modules/recommended-node-host-practices.adoc

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
// Module included in the following assemblies:
22
//
3-
// * scalability_and_performance/recommended-host-practices.adoc
43
// * post_installation_configuration/node-tasks.adoc
54

65
[id="recommended-node-host-practices_{context}"]
@@ -29,9 +28,9 @@ have 20 containers running.
2928

3029
[NOTE]
3130
====
32-
Disk IOPS throttling from the cloud provider might have an impact on CRI-O and kubelet.
33-
They might get overloaded when there are large number of I/O intensive pods running on
34-
the nodes. It is recommended that you monitor the disk I/O on the nodes and use volumes
31+
Disk IOPS throttling from the cloud provider might have an impact on CRI-O and kubelet.
32+
They might get overloaded when there are large number of I/O intensive pods running on
33+
the nodes. It is recommended that you monitor the disk I/O on the nodes and use volumes
3534
with sufficient throughput for the workload.
3635
====
3736

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
// Module included in the following assemblies:
22
//
3-
// * scalability_and_performance/recommended-cluster-scaling-practices.adoc
3+
// * scalability_and_performance/recommended-host-practices.adoc
44

55
[id="recommended-scale-practices_{context}"]
66
= Recommended practices for scaling the cluster
77

8+
The guidance in this section is only relevant for installations with cloud provider integration.
9+
10+
Apply the following best practices to scale the number of worker machines in your {product-title} cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
11+
812
When scaling up the cluster to higher node counts:
913

1014
* Spread nodes across all of the available zones for higher availability.
@@ -16,17 +20,11 @@ When scaling up the cluster to higher node counts:
1620
Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
1721
====
1822

19-
The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which {product-title} is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which {product-title} is deployed has API request limits and excessive queries might lead to machine creation failures due to cloud platform limitations.
23+
The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which {product-title} is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which {product-title} is deployed has API request limits; excessive queries might lead to machine creation failures due to cloud platform limitations.
2024

21-
Enable machine health checks when scaling to large node counts. In case of failures,
22-
the health checks monitor the condition and automatically repair unhealthy machines.
25+
Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
2326

2427
[NOTE]
2528
====
26-
When scaling large and dense clusters to lower node counts, it might take large
27-
amounts of time as the process involves draining or evicting the objects running on
28-
the nodes being terminated in parallel. Also, the client might start to throttle the
29-
requests if there are too many objects to evict. The default client QPS and burst
30-
rates are currently set to `5` and `10` respectively and they cannot be modified
31-
in {product-title}.
29+
When scaling large and dense clusters to lower node counts, it might take large amounts of time because the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client queries per second (QPS) and burst rates are currently set to `5` and `10` respectively. These values cannot be modified in {product-title}.
3230
====

nodes/nodes/nodes-sno-worker-nodes.adoc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ include::modules/ai-sno-requirements-for-installing-worker-nodes.adoc[leveloffse
2727

2828
* xref:../../installing/installing_bare_metal/installing-restricted-networks-bare-metal.adoc#installation-minimum-resource-requirements_installing-restricted-networks-bare-metal[Minimum resource requirements for cluster installation]
2929
30-
* xref:../../scalability_and_performance/recommended-cluster-scaling-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
30+
* xref:../../scalability_and_performance/recommended-host-practices.adoc#recommended-scale-practices_cluster-scaling[Recommended practices for scaling the cluster]
3131
3232
* xref:../../installing/installing_bare_metal/installing-bare-metal-network-customizations.adoc#installation-dns-user-infra_installing-bare-metal-network-customizations[User-provisioned DNS requirements]
3333
@@ -71,4 +71,3 @@ include::modules/sno-adding-worker-nodes-to-sno-clusters-manually.adoc[leveloffs
7171
* xref:../../nodes/nodes/nodes-sno-worker-nodes.adoc#installation-approve-csrs_add-workers[Approving the certificate signing requests for your machines]
7272

7373
include::modules/installation-approve-csrs.adoc[leveloffset=+1]
74-

0 commit comments

Comments
 (0)