Skip to content

Commit c61b139

Browse files
authored
Merge pull request #29375 from mikemckiernan/alloc-resources-4x
BZ#1853249: remove kube-reserved
2 parents 96d4c20 + f00f00f commit c61b139

File tree

5 files changed

+27
-78
lines changed

5 files changed

+27
-78
lines changed

modules/nodes-nodes-managing-about.adoc

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,8 @@ spec:
5858
podsPerCore: 10
5959
maxPods: 250
6060
systemReserved:
61-
cpu: 1000m
62-
memory: 500Mi
63-
kubeReserved:
64-
cpu: 1000m
65-
memory: 500Mi
61+
cpu: 2000m
62+
memory: 1Gi
6663
----
6764
<1> Assign a name to CR.
6865
<2> Specify the label to apply the configuration change, this is the label you added to the machine config pool.

modules/nodes-nodes-resources-configuring-about.adoc

Lines changed: 17 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,13 @@ CPU and memory resources reserved for node components in {product-title} are bas
99

1010
[options="header",cols="1,2"]
1111
|===
12-
1312
|Setting |Description
1413

1514
|`kube-reserved`
16-
| Resources reserved for node components. Default is none.
15+
| This setting is not used with {product-title}. Add the CPU and memory resources that you planned to reserve to the `system-reserved` setting.
1716

1817
|`system-reserved`
19-
| Resources reserved for the remaining system components. Default settings depend on the {product-title} and Machine Config Operator versions. Confirm the default `systemReserved` parameter on the `machine-config-operator` repository.
18+
| This setting identifies the resources to reserve for the node components and system components. The default settings depend on the {product-title} and Machine Config Operator versions. Confirm the default `systemReserved` parameter on the `machine-config-operator` repository.
2019
|===
2120

2221
If a flag is not set, the defaults are used. If none of the flags are set, the
@@ -29,65 +28,33 @@ introduction of allocatable resources.
2928
An allocated amount of a resource is computed based on the following formula:
3029

3130
----
32-
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
31+
[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]
3332
----
3433

3534
[NOTE]
3635
====
37-
The withholding of `Hard-Eviction-Thresholds` from allocatable is a change in behavior to improve
38-
system reliability now that allocatable is enforced for end-user pods at the node level.
39-
The `experimental-allocatable-ignore-eviction` setting is available to preserve legacy behavior,
40-
but it will be deprecated in a future release.
36+
The withholding of `Hard-Eviction-Thresholds` from `Allocatable` improves system reliability because the value for `Allocatable` is enforced for pods at the node level.
4137
====
4238

43-
If `[Allocatable]` is negative, it is set to *0*.
39+
If `Allocatable` is negative, it is set to `0`.
4440

45-
Each node reports system resources utilized by the container runtime and kubelet.
46-
To better aid your ability to configure `--system-reserved` and `--kube-reserved`,
47-
you can introspect corresponding node's resource usage using the node summary API,
48-
which is accessible at `/api/v1/nodes/<node>/proxy/stats/summary`.
41+
Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring the `system-reserved` parameter, view the resource use for the node by using the node summary API. The node summary is available at `/api/v1/nodes/<node>/proxy/stats/summary`.
4942

5043
[id="allocate-node-enforcement_{context}"]
5144
== How nodes enforce resource constraints
5245

53-
The node is able to limit the total amount of resources that pods
54-
may consume based on the configured allocatable value. This feature significantly
55-
improves the reliability of the node by preventing pods from starving
56-
system services (for example: container runtime, node agent, etc.) for resources.
57-
It is strongly encouraged that administrators reserve
58-
resources based on the desired node utilization target
59-
in order to improve node reliability.
60-
61-
The node enforces resource constraints using a new *cgroup* hierarchy
62-
that enforces quality of service. All pods are launched in a
63-
dedicated cgroup hierarchy separate from system daemons.
64-
65-
Optionally, the node can be made to enforce kube-reserved and system-reserved by
66-
specifying those tokens in the enforce-node-allocatable flag. If specified, the
67-
corresponding `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be provided.
68-
In future releases, the node and container runtime will be packaged in a common cgroup
69-
separate from `system.slice`. Until that time, we do not recommend users
70-
change the default value of enforce-node-allocatable flag.
71-
72-
Administrators should treat system daemons similar to Guaranteed pods. System daemons
73-
can burst within their bounding control groups and this behavior needs to be managed
74-
as part of cluster deployments. Enforcing system-reserved limits
75-
can lead to critical system services being CPU starved or OOM killed on the node. The
76-
recommendation is to enforce system-reserved only if operators have profiled their nodes
77-
exhaustively to determine precise estimates and are confident in their ability to
78-
recover if any process in that group is OOM killed.
79-
80-
As a result, we strongly recommended that users only enforce node allocatable for
81-
`pods` by default, and set aside appropriate reservations for system daemons to maintain
82-
overall node reliability.
46+
The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.
47+
48+
The node enforces resource constraints by using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.
49+
50+
Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the amount of CPU and memory resources in `system-reserved`.
51+
52+
Enforcing `system-reserved` limits can prevent critical system services from receiving CPU and memory resources. As a result, a critical system service can be ended by the out-of-memory killer. The recommendation is to enforce `system-reserved` only if you have profiled the nodes exhaustively to determine precise estimates and you are confident that critical system services can recover if any process in that group is ended by the out-of-memory killer.
8353

8454
[id="allocate-eviction-thresholds_{context}"]
8555
== Understanding Eviction Thresholds
8656

87-
If a node is under memory pressure, it can impact the entire node and all pods running on
88-
it. If a system daemon is using more than its reserved amount of memory, an OOM
89-
event may occur that can impact the entire node and all pods running on it. To avoid
90-
(or reduce the probability of) system OOMs the node provides out-of-resource handling.
57+
If a node is under memory pressure, it can impact the entire node and all pods running on the node. For example, a system daemon that uses more than its reserved amount of memory can trigger an out-of-memory event. To avoid or reduce the probability of system out-of-memory events, the node provides out-of-resource handling.
9158

9259
You can reserve some memory using the `--eviction-hard` flag. The node attempts to evict
9360
pods whenever memory availability on the node drops below the absolute value or percentage.
@@ -98,16 +65,12 @@ before reaching out of memory conditions are not available for pods.
9865
The following is an example to illustrate the impact of node allocatable for memory:
9966

10067
* Node capacity is `32Gi`
101-
* --kube-reserved is `2Gi`
102-
* --system-reserved is `1Gi`
68+
* --system-reserved is `3Gi`
10369
* --eviction-hard is set to `100Mi`.
10470

105-
For this node, the effective node allocatable value is `28.9Gi`. If the node
106-
and system components use up all their reservation, the memory available for pods is `28.9Gi`,
107-
and kubelet will evict pods when it exceeds this usage.
71+
For this node, the effective node allocatable value is `28.9Gi`. If the node and system components use all their reservation, the memory available for pods is `28.9Gi`, and kubelet evicts pods when it exceeds this threshold.
10872

109-
If you enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi`.
110-
Evictions would not be performed unless system daemons are consuming more than `3.1Gi` of memory.
73+
If you enforce node allocatable, `28.9Gi`, with top-level cgroups, then pods can never exceed `28.9Gi`. Evictions are not performed unless system daemons consume more than `3.1Gi` of memory.
11174

11275
If system daemons do not use up all their reservation, with the above example,
11376
pods would face memcg OOM kills from their bounding cgroup before node evictions kick in.

modules/nodes-nodes-resources-configuring-setting.adoc

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ As an administrator, you can set these using a custom resource (CR) through a se
1212

1313
.Prerequisites
1414

15-
. To help you determine setting for `--system-reserved` and `--kube-reserved` you can introspect the corresponding node's resource usage
16-
using the node summary API, which is accessible at `/api/v1/nodes/<node>/proxy/stats/summary`. Enter the following command for your node:
15+
. To help you determine values for the `system-reserved` setting, you can introspect the resource use for a node by using the node summary API. Enter the following command for your node:
1716
+
1817
[source,terminal]
1918
----
@@ -117,11 +116,8 @@ spec:
117116
custom-kubelet: small-pods <2>
118117
kubeletConfig:
119118
systemReserved:
120-
cpu: 500m
121-
memory: 512Mi
122-
kubeReserved:
123-
cpu: 500m
124-
memory: 512Mi
119+
cpu: 1000m
120+
memory: 1Gi
125121
----
126122
<1> Assign a name to CR.
127123
<2> Specify the label from the Machine Config Pool.

modules/setting-up-cpu-manager.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ This adds the CPU Manager feature to the kubelet config and, if needed, the Mach
8080
"name": "cpumanager-enabled",
8181
"uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
8282
}
83-
],
83+
]
8484
----
8585

8686
. Check the worker for the updated `kubelet.conf`:
@@ -241,7 +241,7 @@ Allocated resources:
241241
cpu 1440m (96%) 1 (66%)
242242
----
243243
+
244-
This VM has two CPU cores. You set `kube-reserved` to 500 millicores, meaning half of one core is subtracted from the total capacity of the node to arrive at the `Node Allocatable` amount. You can see that `Allocatable CPU` is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:
244+
This VM has two CPU cores. The `system-reserved` setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at the `Node Allocatable` amount. You can see that `Allocatable CPU` is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:
245245
+
246246
[source, terminal]
247247
----

nodes/nodes/nodes-nodes-resources-configuring.adoc

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,11 @@
1-
2-
:context: nodes-nodes-resources-configuring
31
[id="nodes-nodes-resources-configuring"]
42
= Allocating resources for nodes in an {product-title} cluster
53
include::modules/common-attributes.adoc[]
4+
:context: nodes-nodes-resources-configuring
65

76
toc::[]
87

9-
10-
To provide more reliable scheduling and minimize node resource overcommitment,
11-
each node can reserve a portion of its resources for use by all underlying node
12-
components (such as kubelet, kube-proxy) and the remaining system
13-
components (such as *sshd*, *NetworkManager*) on the host. Once specified, the
14-
scheduler has more information about the resources (e.g., memory, CPU) a node
15-
has allocated for pods.
8+
To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components, such as `kubelet` and `kube-proxy`, and the remaining system components, such as `sshd` and `NetworkManager`. By specifying the resources to reserve, you provide the scheduler with more information about the remaining CPU and memory resources that a node has available for use by pods.
169

1710
// The following include statements pull in the module files that comprise
1811
// the assembly. Include any combination of concept, procedure, or reference
@@ -27,7 +20,7 @@ include::modules/nodes-nodes-resources-configuring-setting.adoc[leveloffset=+1]
2720
== Additional resources
2821
2922
The ephemeral storage management feature is disabled by default. To enable this
30-
feature,
23+
feature,
3124
3225
See /install_config/configuring_ephemeral.adoc#install-config-configuring-ephemeral-storage[configuring for
3326
ephemeral storage].

0 commit comments

Comments
 (0)