Merge pull request #54983 from ahardin-rh/perfscale-edits

ahardin-rh · web-flow · commit e64dd7b45693 · 2023-01-24T10:26:41.000-05:00
OSDOCS-4320: Updating scaling guidance
diff --git a/modules/openshift-cluster-maximums-major-releases.adoc b/modules/openshift-cluster-maximums-major-releases.adoc
@@ -5,89 +5,108 @@
 [id="cluster-maximums-major-releases_{context}"]
 = {product-title} tested cluster maximums for major releases
 
-Tested Cloud Platforms for {product-title} 3.x: {rh-openstack-first}, Amazon Web Services and Microsoft Azure.
-Tested Cloud Platforms for {product-title} 4.x: Amazon Web Services, Microsoft Azure and Google Cloud Platform.
+[NOTE]
+====
+Red Hat does not provide direct guidance on sizing your {product-title} cluster. This is because determining whether your cluster is within the supported bounds of {product-title} requires careful consideration of all the multidimensional factors that limit the cluster scale.
+====
+
+{product-title} supports tested cluster maximums rather than absolute cluster maximums. Not every combination of {product-title} version, control plane workload, and network plugin are tested, so the following table does not represent an absolute expectation of scale for all deployments. It might not be possible to scale to a maximum on all dimensions simultaneously. The table contains tested maximums for specific workload and deployment configurations, and serves as a scale guide as to what can be expected with similar deployments.
 
-[options="header",cols="3*"]
+[options="header",cols="2*"]
 |===
-| Maximum type |3.x tested maximum |4.x tested maximum
+| Maximum type |4.x tested maximum
 
 | Number of nodes
-| 2,000
 | 2,000 ^[1]^
 
 | Number of pods ^[2]^
 | 150,000
-| 150,000
 
 | Number of pods per node
-| 250
 | 500 ^[3]^
 
 | Number of pods per core
 | There is no default value.
-| There is no default value.
 
 | Number of namespaces ^[4]^
 | 10,000
-| 10,000
 
 | Number of builds
-| 10,000 (Default pod RAM 512 Mi) - Pipeline Strategy
 | 10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy
 
 | Number of pods per namespace ^[5]^
 | 25,000
-| 25,000
 
 | Number of routes and back ends per Ingress Controller
 | 2,000 per router
-| 2,000 per router
 
 | Number of secrets
 | 80,000
-| 80,000
 
 | Number of config maps
 | 90,000
-| 90,000 
 
 | Number of services ^[6]^
 | 10,000
-| 10,000
 
 | Number of services per namespace
 | 5,000
-| 5,000
 
 | Number of back-ends per service
 | 5,000
-| 5,000
 
 | Number of deployments per namespace ^[5]^
 | 2,000
-| 2,000
 
 | Number of build configs
 | 12,000
-| 12,000
 
 | Number of custom resource definitions (CRD)
-| There is no default value.
 | 512 ^[7]^
 
 |===
 [.small]
 --
-1. Pause pods were deployed to stress the control plane components of {product-title} at 2000 node scale.
+1. Pause pods were deployed to stress the control plane components of {product-title} at 2000 node scale. The ability to scale to similar numbers will vary depending upon specific deployment and workload parameters.
 2. The pod count displayed here is the number of test pods. The actual number of pods depends on the application's memory, CPU, and storage requirements.
 3. This was tested on a cluster with 100 worker nodes with 500 pods per worker node. The default `maxPods` is still 250. To get to 500 `maxPods`, the cluster must be created with a `maxPods` set to `500` using a custom kubelet config. If you need 500 user pods, you need a `hostPrefix` of `22` because there are 10-15 system pods already running on the node. The maximum number of pods with attached persistent volume claims (PVC) depends on storage backend from where PVC are allocated. In our tests, only {rh-storage} v4 (OCS v4) was able to satisfy the number of pods per node discussed in this document.
 4. When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, is highly recommended to free etcd storage.
 5. There are a number of control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
 6. Each service port and each service back-end has a corresponding entry in iptables. The number of back-ends of a given service impact the size of the endpoints objects, which impacts the size of data that is being sent all over the system.
 7. {product-title} has a limit of 512 total custom resource definitions (CRD), including those installed by {product-title}, products integrating with {product-title} and user created CRDs. If there are more than 512 CRDs created, then there is a possibility that `oc` commands requests may be throttled.
 --
-[NOTE]
-====
-Red Hat does not provide direct guidance on sizing your {product-title} cluster. This is because determining whether your cluster is within the supported bounds of {product-title} requires careful consideration of all the multidimensional factors that limit the cluster scale.
-====
+
+[id="cluster-maximums-major-releases-example-scenario_{context}"]
+== Example scenario
+
+As an example, 500 worker nodes (m5.2xl) were tested, and are supported, using {product-title} 4.12, the OVN-Kubernetes network plugin, and the following workload objects:
+
+* 200 namespaces, in addition to the defaults
+* 60 pods per node; 30 server and 30 client pods (30k total)
+* 57 image streams/ns (11.4k total)
+* 15 services/ns backed by the server pods (3k total)
+* 15 routes/ns backed by the previous services (3k total)
+* 20 secrets/ns (4k total)
+* 10 config maps/ns (2k total)
+* 6 network policies/ns, including deny-all, allow-from ingress and intra-namespace rules
+* 57 builds/ns
+
+The following factors are known to affect cluster workload scaling, positively or negatively, and should be factored into the scale numbers when planning a deployment.  For additional information and guidance, contact your sales representative or link:https://access.redhat.com/support/[Red Hat support].
+
+* Number of pods per node
+* Number of containers per pod
+* Type of probes used (for example, liveness/readiness, exec/http)
+* Number of network policies
+* Number of projects, or namespaces
+* Number of image streams per project
+* Number of builds per project
+* Number of services/endpoints and type
+* Number of routes
+* Number of shards
+* Number of secrets
+* Number of config maps
+* Rate of API calls, or the cluster “churn”, which is an estimation of how quickly things change in the cluster configuration.
+** Prometheus query for pod creation requests per second over 5 minute windows: `sum(irate(apiserver_request_count{resource="pods",verb="POST"}[5m]))`
+** Prometheus query for all API requests per second over 5 minute windows: `sum(irate(apiserver_request_count{}[5m]))`
+* Cluster node resource consumption of CPU
+* Cluster node resource consumption of memory
diff --git a/scalability_and_performance/planning-your-environment-according-to-object-maximums.adoc b/scalability_and_performance/planning-your-environment-according-to-object-maximums.adoc
@@ -10,11 +10,6 @@ Consider the following tested object maximums when you plan your {product-title}
 
 These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.
 
-[IMPORTANT]
-====
-These guidelines apply to {product-title} with software-defined networking (SDN), not Open Virtual Network (OVN).
-====
-
 In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.
 
 include::modules/openshift-cluster-maximums-major-releases.adoc[leveloffset=+1]