Skip to content

Commit de3ba2a

Browse files
committed
modules/master-node-sizing: Raise CPU/memory cap from <=50% to <=60%
The outgoing "half" is from 7eb9504 (Recommendation around master node sizing to handle upgrades, 2021-05-06, #32230), with the sense fixed in 4c7a955 (modules/master-node-sizing: Fix "at least" to "at most" typo, 2021-07-06, #34338). But HighOverallControlPlaneCPU has a >60 trigger [1] since it landed [2] in 4.8 [3]. It doesn't seem like we have a high-memory alert set yet. But the idea is that if you take down one of three control plane nodes, the CPU usage on the others can be expected to rise by 50% (e.g. from 66% to 100%). 60 and 66 are pretty close together, probably not worth worrying about that. 50 and 60 aren't all that far apart either. So I'm not all that particular about where the threshold is, but it makes sense to have the docs and the alerts agree on the chosen threshold. [1]: https://github.com/openshift/cluster-kube-apiserver-operator/blame/4b25a7a948fd53b96a815c6012db911721c43744/bindata/assets/alerts/cpu-utilization.yaml#L29 [2]: openshift/cluster-kube-apiserver-operator@21fedd4#diff-aeaff49462b4a42ff78d32aa7c7fefeb9b772b9b8ba65067f4bed9d283a6a847R31 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1949306#c7
1 parent 2085023 commit de3ba2a

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

modules/master-node-sizing.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ The control plane node resource requirements depend on the number of nodes in th
4040

4141
|===
4242

43-
On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most half of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
43+
On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
4444

4545
[IMPORTANT]
4646
====

0 commit comments

Comments
 (0)