modules/master-node-sizing: Raise CPU/memory cap from <=50% to <=60%

wking · wking · commit de3ba2a01c42 · 2022-03-22T13:50:18.000-07:00
The outgoing "half" is from 7eb9504 (Recommendation around master node sizing to handle upgrades, 2021-05-06, #32230), with the sense fixed in 4c7a955 (modules/master-node-sizing: Fix "at least" to "at most" typo, 2021-07-06, #34338). But HighOverallControlPlaneCPU has a >60 trigger [1] since it landed [2] in 4.8 [3]. It doesn't seem like we have a high-memory alert set yet. But the idea is that if you take down one of three control plane nodes, the CPU usage on the others can be expected to rise by 50% (e.g. from 66% to 100%). 60 and 66 are pretty close together, probably not worth worrying about that. 50 and 60 aren't all that far apart either. So I'm not all that particular about where the threshold is, but it makes sense to have the docs and the alerts agree on the chosen threshold. [1]: https://github.com/openshift/cluster-kube-apiserver-operator/blame/4b25a7a948fd53b96a815c6012db911721c43744/bindata/assets/alerts/cpu-utilization.yaml#L29 [2]: openshift/cluster-kube-apiserver-operator@21fedd4#diff-aeaff49462b4a42ff78d32aa7c7fefeb9b772b9b8ba65067f4bed9d283a6a847R31 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1949306#c7
diff --git a/modules/master-node-sizing.adoc b/modules/master-node-sizing.adoc
@@ -40,7 +40,7 @@ The control plane node resource requirements depend on the number of nodes in th
 
 |===
 
-On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most half of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
+On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
 
 [IMPORTANT]
 ====