@@ -17,33 +17,27 @@ itself. Unless resources are set aside for these system daemons, pods and system
17
17
daemons compete for resources and lead to resource starvation issues on the
18
18
node.
19
19
20
- The ` kubelet ` exposes a feature named ` Node Allocatable ` that helps to reserve
20
+ The ` kubelet ` exposes a feature named ' Node Allocatable' that helps to reserve
21
21
compute resources for system daemons. Kubernetes recommends cluster
22
- administrators to configure ` Node Allocatable ` based on their workload density
22
+ administrators to configure ' Node Allocatable' based on their workload density
23
23
on each node.
24
24
25
-
26
-
27
-
28
25
## {{% heading "prerequisites" %}}
29
26
30
-
31
27
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
32
28
Your Kubernetes server must be at or later than version 1.17 to use
33
29
the kubelet command line option ` --reserved-cpus ` to set an
34
30
[ explicitly reserved CPU list] ( #explicitly-reserved-cpu-list ) .
35
31
36
-
37
-
38
32
<!-- steps -->
39
33
40
34
## Node Allocatable
41
35
42
36
![ node capacity] ( /images/docs/node-capacity.svg )
43
37
44
- ` Allocatable ` on a Kubernetes node is defined as the amount of compute resources
38
+ ' Allocatable' on a Kubernetes node is defined as the amount of compute resources
45
39
that are available for pods. The scheduler does not over-subscribe
46
- ` Allocatable ` . ` CPU ` , ` memory ` and ` ephemeral-storage ` are supported as of now.
40
+ ' Allocatable'. ' CPU', ' memory' and ' ephemeral-storage' are supported as of now.
47
41
48
42
Node Allocatable is exposed as part of ` v1.Node ` object in the API and as part
49
43
of ` kubectl describe node ` in the CLI.
97
91
It is recommended that the kubernetes system daemons are placed under a top
98
92
level control group (` runtime.slice ` on systemd machines for example). Each
99
93
system daemon should ideally run within its own child control group. Refer to
100
- [ this
101
- doc] ( https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md#recommended-cgroups-setup )
94
+ [ the design proposal] ( https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md#recommended-cgroups-setup )
102
95
for more details on recommended control group hierarchy.
103
96
104
97
Note that Kubelet ** does not** create ` --kube-reserved-cgroup ` if it doesn't
@@ -109,7 +102,6 @@ exist. Kubelet will fail if an invalid cgroup is specified.
109
102
- ** Kubelet Flag** : ` --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000] `
110
103
- ** Kubelet Flag** : ` --system-reserved-cgroup= `
111
104
112
-
113
105
` system-reserved ` is meant to capture resource reservation for OS system daemons
114
106
like ` sshd ` , ` udev ` , etc. ` system-reserved ` should reserve ` memory ` for the
115
107
` kernel ` too since ` kernel ` memory is not accounted to pods in Kubernetes at this time.
@@ -127,13 +119,14 @@ kubelet flag.
127
119
It is recommended that the OS system daemons are placed under a top level
128
120
control group (` system.slice ` on systemd machines for example).
129
121
130
- Note that Kubelet ** does not** create ` --system-reserved-cgroup ` if it doesn't
131
- exist. Kubelet will fail if an invalid cgroup is specified.
122
+ Note that ` kubelet ` ** does not** create ` --system-reserved-cgroup ` if it doesn't
123
+ exist. ` kubelet ` will fail if an invalid cgroup is specified.
132
124
133
125
### Explicitly Reserved CPU List
126
+
134
127
{{< feature-state for_k8s_version="v1.17" state="stable" >}}
135
128
136
- - ** Kubelet Flag** : ` --reserved-cpus=0-3 `
129
+ ** Kubelet Flag** : ` --reserved-cpus=0-3 `
137
130
138
131
` reserved-cpus ` is meant to define an explicit CPU set for OS system daemons and
139
132
kubernetes system daemons. ` reserved-cpus ` is for systems that do not intend to
@@ -154,32 +147,33 @@ For example: in Centos, you can do this using the tuned toolset.
154
147
155
148
### Eviction Thresholds
156
149
157
- - ** Kubelet Flag** : ` --eviction-hard=[memory.available<500Mi] `
150
+ ** Kubelet Flag** : ` --eviction-hard=[memory.available<500Mi] `
158
151
159
152
Memory pressure at the node level leads to System OOMs which affects the entire
160
153
node and all pods running on it. Nodes can go offline temporarily until memory
161
154
has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet
162
- provides [ ` Out of Resource ` ] ( /docs/tasks/administer-cluster/out-of-resource/ ) management. Evictions are
155
+ provides [ out of resource] ( /docs/concepts/scheduling-eviction/node-pressure-eviction/ )
156
+ management. Evictions are
163
157
supported for ` memory ` and ` ephemeral-storage ` only. By reserving some memory via
164
- ` --eviction-hard ` flag, the ` kubelet ` attempts to ` evict ` pods whenever memory
158
+ ` --eviction-hard ` flag, the ` kubelet ` attempts to evict pods whenever memory
165
159
availability on the node drops below the reserved value. Hypothetically, if
166
160
system daemons did not exist on a node, pods cannot use more than `capacity -
167
161
eviction-hard`. For this reason, resources reserved for evictions are not
168
162
available for pods.
169
163
170
164
### Enforcing Node Allocatable
171
165
172
- - ** Kubelet Flag** : ` --enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved] `
166
+ ** Kubelet Flag** : ` --enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved] `
173
167
174
- The scheduler treats ` Allocatable ` as the available ` capacity ` for pods.
168
+ The scheduler treats ' Allocatable' as the available ` capacity ` for pods.
175
169
176
- ` kubelet ` enforce ` Allocatable ` across pods by default. Enforcement is performed
170
+ ` kubelet ` enforce ' Allocatable' across pods by default. Enforcement is performed
177
171
by evicting pods whenever the overall usage across all pods exceeds
178
- ` Allocatable ` . More details on eviction policy can be found
179
- [ here] ( /docs/tasks/administer-cluster/out-of-resource/#eviction-policy ) . This enforcement is controlled by
172
+ 'Allocatable'. More details on eviction policy can be found
173
+ on the [ node pressure eviction] ( /docs/concepts/scheduling-eviction/node-pressure-eviction/ )
174
+ page. This enforcement is controlled by
180
175
specifying ` pods ` value to the kubelet flag ` --enforce-node-allocatable ` .
181
176
182
-
183
177
Optionally, ` kubelet ` can be made to enforce ` kube-reserved ` and
184
178
` system-reserved ` by specifying ` kube-reserved ` & ` system-reserved ` values in
185
179
the same flag. Note that to enforce ` kube-reserved ` or ` system-reserved ` ,
@@ -188,10 +182,10 @@ respectively.
188
182
189
183
## General Guidelines
190
184
191
- System daemons are expected to be treated similar to ` Guaranteed ` pods. System
185
+ System daemons are expected to be treated similar to ' Guaranteed' pods. System
192
186
daemons can burst within their bounding control groups and this behavior needs
193
187
to be managed as part of kubernetes deployments. For example, ` kubelet ` should
194
- have its own control group and share ` Kube -reserved` resources with the
188
+ have its own control group and share ` kube -reserved` resources with the
195
189
container runtime. However, Kubelet cannot burst and use up all available Node
196
190
resources if ` kube-reserved ` is enforced.
197
191
@@ -200,9 +194,9 @@ to critical system services being CPU starved, OOM killed, or unable
200
194
to fork on the node. The
201
195
recommendation is to enforce ` system-reserved ` only if a user has profiled their
202
196
nodes exhaustively to come up with precise estimates and is confident in their
203
- ability to recover if any process in that group is oom_killed .
197
+ ability to recover if any process in that group is oom-killed .
204
198
205
- * To begin with enforce ` Allocatable ` on ` pods ` .
199
+ * To begin with enforce ' Allocatable' on ` pods ` .
206
200
* Once adequate monitoring and alerting is in place to track kube system
207
201
daemons, attempt to enforce ` kube-reserved ` based on usage heuristics.
208
202
* If absolutely necessary, enforce ` system-reserved ` over time.
@@ -212,8 +206,6 @@ more features are added. Over time, kubernetes project will attempt to bring
212
206
down utilization of node system daemons, but that is not a priority as of now.
213
207
So expect a drop in ` Allocatable ` capacity in future releases.
214
208
215
-
216
-
217
209
<!-- discussion -->
218
210
219
211
## Example Scenario
@@ -225,15 +217,15 @@ Here is an example to illustrate Node Allocatable computation:
225
217
* ` --system-reserved ` is set to ` cpu=500m,memory=1Gi,ephemeral-storage=1Gi `
226
218
* ` --eviction-hard ` is set to ` memory.available<500Mi,nodefs.available<10% `
227
219
228
- Under this scenario, ` Allocatable ` will be ` 14.5 CPUs ` , ` 28.5Gi ` of memory and
220
+ Under this scenario, ' Allocatable' will be 14.5 CPUs, 28.5Gi of memory and
229
221
` 88Gi ` of local storage.
230
222
Scheduler ensures that the total memory ` requests ` across all pods on this node does
231
- not exceed ` 28.5Gi ` and storage doesn't exceed ` 88Gi ` .
232
- Kubelet evicts pods whenever the overall memory usage across pods exceeds ` 28.5Gi ` ,
233
- or if overall disk usage exceeds ` 88Gi ` If all processes on the node consume as
234
- much CPU as they can, pods together cannot consume more than ` 14.5 CPUs ` .
223
+ not exceed 28.5Gi and storage doesn't exceed 88Gi.
224
+ Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi,
225
+ or if overall disk usage exceeds 88Gi If all processes on the node consume as
226
+ much CPU as they can, pods together cannot consume more than 14.5 CPUs.
235
227
236
228
If ` kube-reserved ` and/or ` system-reserved ` is not enforced and system daemons
237
229
exceed their reservation, ` kubelet ` evicts pods whenever the overall node memory
238
- usage is higher than ` 31.5Gi ` or ` storage ` is greater than ` 90Gi `
230
+ usage is higher than 31.5Gi or ` storage ` is greater than 90Gi.
239
231
0 commit comments