@@ -5,7 +5,6 @@ reviewers:
5
5
- dashpole
6
6
title : Reserve Compute Resources for System Daemons
7
7
content_type : task
8
- min-kubernetes-server-version : 1.8
9
8
weight : 290
10
9
---
11
10
@@ -25,10 +24,10 @@ on each node.
25
24
26
25
## {{% heading "prerequisites" %}}
27
26
28
- {{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
29
- Your Kubernetes server must be at or later than version 1.17 to use
30
- the kubelet command line option ` --reserved-cpus ` to set an
31
- [ explicitly reserved CPU list ] ( #explicitly-reserved-cpu-list ) .
27
+ {{< include "task-tutorial-prereqs.md" >}}
28
+
29
+ You can configure below kubelet [ configuration settings ] ( /docs/reference/config-api/kubelet-config.v1beta1/ )
30
+ using the [ kubelet configuration file ] ( /docs/tasks/administer-cluster/kubelet-config-file/ ) .
32
31
33
32
<!-- steps -->
34
33
@@ -48,15 +47,14 @@ Resources can be reserved for two categories of system daemons in the `kubelet`.
48
47
### Enabling QoS and Pod level cgroups
49
48
50
49
To properly enforce node allocatable constraints on the node, you must
51
- enable the new cgroup hierarchy via the ` --cgroups-per-qos ` flag . This flag is
50
+ enable the new cgroup hierarchy via the ` cgroupsPerQOS ` setting . This setting is
52
51
enabled by default. When enabled, the ` kubelet ` will parent all end-user pods
53
52
under a cgroup hierarchy managed by the ` kubelet ` .
54
53
55
54
### Configuring a cgroup driver
56
55
57
56
The ` kubelet ` supports manipulation of the cgroup hierarchy on
58
- the host using a cgroup driver. The driver is configured via the
59
- ` --cgroup-driver ` flag.
57
+ the host using a cgroup driver. The driver is configured via the ` cgroupDriver ` setting.
60
58
61
59
The supported values are the following:
62
60
@@ -73,41 +71,41 @@ be configured to use the `systemd` cgroup driver.
73
71
74
72
### Kube Reserved
75
73
76
- - ** Kubelet Flag ** : ` --kube-reserved=[ cpu= 100m][,][ memory= 100Mi][,][ ephemeral-storage= 1Gi][,][ pid=1000] `
77
- - ** Kubelet Flag ** : ` --kube-reserved-cgroup= `
74
+ - ** KubeletConfiguration Setting ** : ` kubeReserved: {} ` . Example value ` { cpu: 100m, memory: 100Mi, ephemeral-storage: 1Gi, pid=1000} `
75
+ - ** KubeletConfiguration Setting ** : ` kubeReservedCgroup: "" `
78
76
79
- ` kube-reserved ` is meant to capture resource reservation for kubernetes system
80
- daemons like the ` kubelet ` , ` container runtime ` , ` node problem detector ` , etc.
77
+ ` kubeReserved ` is meant to capture resource reservation for kubernetes system
78
+ daemons like the ` kubelet ` , ` container runtime ` , etc.
81
79
It is not meant to reserve resources for system daemons that are run as pods.
82
- ` kube-reserved ` is typically a function of ` pod density ` on the nodes.
80
+ ` kubeReserved ` is typically a function of ` pod density ` on the nodes.
83
81
84
82
In addition to ` cpu ` , ` memory ` , and ` ephemeral-storage ` , ` pid ` may be
85
83
specified to reserve the specified number of process IDs for
86
84
kubernetes system daemons.
87
85
88
- To optionally enforce ` kube-reserved ` on kubernetes system daemons, specify the parent
89
- control group for kube daemons as the value for ` --kube-reserved-cgroup ` kubelet
90
- flag .
86
+ To optionally enforce ` kubeReserved ` on kubernetes system daemons, specify the parent
87
+ control group for kube daemons as the value for ` kubeReservedCgroup ` setting,
88
+ and [ add ` kube-reserved ` to ` enforceNodeAllocatable ` ] ( #enforcing-node-allocatable ) .
91
89
92
90
It is recommended that the kubernetes system daemons are placed under a top
93
91
level control group (` runtime.slice ` on systemd machines for example). Each
94
92
system daemon should ideally run within its own child control group. Refer to
95
93
[ the design proposal] ( https://git.k8s.io/design-proposals-archive/node/node-allocatable.md#recommended-cgroups-setup )
96
94
for more details on recommended control group hierarchy.
97
95
98
- Note that Kubelet ** does not** create ` --kube-reserved-cgroup ` if it doesn't
96
+ Note that Kubelet ** does not** create ` kubeReservedCgroup ` if it doesn't
99
97
exist. The kubelet will fail to start if an invalid cgroup is specified. With ` systemd `
100
98
cgroup driver, you should follow a specific pattern for the name of the cgroup you
101
- define: the name should be the value you set for ` --kube-reserved-cgroup ` ,
99
+ define: the name should be the value you set for ` kubeReservedCgroup ` ,
102
100
with ` .slice ` appended.
103
101
104
102
### System Reserved
105
103
106
- - ** Kubelet Flag ** : ` --system-reserved=[ cpu= 100m][,][ memory= 100Mi][,][ ephemeral-storage= 1Gi][,][ pid=1000] `
107
- - ** Kubelet Flag ** : ` --system-reserved-cgroup= `
104
+ - ** KubeletConfiguration Setting ** : ` systemReserved: {} ` . Example value ` { cpu: 100m, memory: 100Mi, ephemeral-storage: 1Gi, pid=1000} `
105
+ - ** KubeletConfiguration Setting ** : ` systemReservedCgroup: "" `
108
106
109
- ` system-reserved ` is meant to capture resource reservation for OS system daemons
110
- like ` sshd ` , ` udev ` , etc. ` system-reserved ` should reserve ` memory ` for the
107
+ ` systemReserved ` is meant to capture resource reservation for OS system daemons
108
+ like ` sshd ` , ` udev ` , etc. ` systemReserved ` should reserve ` memory ` for the
111
109
` kernel ` too since ` kernel ` memory is not accounted to pods in Kubernetes at this time.
112
110
Reserving resources for user login sessions is also recommended (` user.slice ` in
113
111
systemd world).
@@ -116,33 +114,32 @@ In addition to `cpu`, `memory`, and `ephemeral-storage`, `pid` may be
116
114
specified to reserve the specified number of process IDs for OS system
117
115
daemons.
118
116
119
- To optionally enforce ` system-reserved ` on system daemons, specify the parent
120
- control group for OS system daemons as the value for ` --system-reserved-cgroup `
121
- kubelet flag .
117
+ To optionally enforce ` systemReserved ` on system daemons, specify the parent
118
+ control group for OS system daemons as the value for ` systemReservedCgroup ` setting,
119
+ and [ add ` system-reserved ` to ` enforceNodeAllocatable ` ] ( #enforcing-node-allocatable ) .
122
120
123
121
It is recommended that the OS system daemons are placed under a top level
124
122
control group (` system.slice ` on systemd machines for example).
125
123
126
- Note that ` kubelet ` ** does not** create ` --system-reserved-cgroup ` if it doesn't
124
+ Note that ` kubelet ` ** does not** create ` systemReservedCgroup ` if it doesn't
127
125
exist. ` kubelet ` will fail if an invalid cgroup is specified. With ` systemd `
128
126
cgroup driver, you should follow a specific pattern for the name of the cgroup you
129
- define: the name should be the value you set for ` --system-reserved-cgroup ` ,
127
+ define: the name should be the value you set for ` systemReservedCgroup ` ,
130
128
with ` .slice ` appended.
131
129
132
130
### Explicitly Reserved CPU List
133
131
134
132
{{< feature-state for_k8s_version="v1.17" state="stable" >}}
135
133
136
- ** Kubelet Flag** : ` --reserved-cpus=0-3 `
137
- ** KubeletConfiguration Flag** : ` reservedSystemCPUs: 0-3 `
134
+ ** KubeletConfiguration Setting** : ` reservedSystemCPUs: ` . Example value ` 0-3 `
138
135
139
- ` reserved-cpus ` is meant to define an explicit CPU set for OS system daemons and
140
- kubernetes system daemons. ` reserved-cpus ` is for systems that do not intend to
136
+ ` reservedSystemCPUs ` is meant to define an explicit CPU set for OS system daemons and
137
+ kubernetes system daemons. ` reservedSystemCPUs ` is for systems that do not intend to
141
138
define separate top level cgroups for OS system daemons and kubernetes system daemons
142
139
with regard to cpuset resource.
143
- If the Kubelet ** does not** have ` --system-reserved-cgroup ` and ` --kube-reserved-cgroup ` ,
144
- the explicit cpuset provided by ` reserved-cpus ` will take precedence over the CPUs
145
- defined by ` --kube-reserved ` and ` --system-reserved ` options.
140
+ If the Kubelet ** does not** have ` kubeReservedCgroup ` and ` systemReservedCgroup ` ,
141
+ the explicit cpuset provided by ` reservedSystemCPUs ` will take precedence over the CPUs
142
+ defined by ` kubeReservedCgroup ` and ` systemReservedCgroup ` options.
146
143
147
144
This option is specifically designed for Telco/NFV use cases where uncontrolled
148
145
interrupts/timers may impact the workload performance. you can use this option
@@ -155,23 +152,23 @@ For example: in Centos, you can do this using the tuned toolset.
155
152
156
153
### Eviction Thresholds
157
154
158
- ** Kubelet Flag ** : ` --eviction-hard=[ memory.available<500Mi] `
155
+ ** KubeletConfiguration Setting ** : ` evictionHard: { memory.available: "100Mi", nodefs.available: "10%", nodefs.inodesFree: "5%", imagefs.available: "15%"} ` . Example value: ` {memory.available: " <500Mi"} `
159
156
160
157
Memory pressure at the node level leads to System OOMs which affects the entire
161
158
node and all pods running on it. Nodes can go offline temporarily until memory
162
159
has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet
163
160
provides [ out of resource] ( /docs/concepts/scheduling-eviction/node-pressure-eviction/ )
164
161
management. Evictions are
165
162
supported for ` memory ` and ` ephemeral-storage ` only. By reserving some memory via
166
- ` --eviction-hard ` flag , the ` kubelet ` attempts to evict pods whenever memory
163
+ ` evictionHard ` setting , the ` kubelet ` attempts to evict pods whenever memory
167
164
availability on the node drops below the reserved value. Hypothetically, if
168
165
system daemons did not exist on a node, pods cannot use more than `capacity -
169
166
eviction-hard`. For this reason, resources reserved for evictions are not
170
167
available for pods.
171
168
172
169
### Enforcing Node Allocatable
173
170
174
- ** Kubelet Flag ** : ` --enforce-node-allocatable=pods[,][ system-reserved][,][ kube-reserved]`
171
+ ** KubeletConfiguration setting ** : ` enforceNodeAllocatable: [pods] ` . Example value: ` [pods, system-reserved, kube-reserved]`
175
172
176
173
The scheduler treats 'Allocatable' as the available ` capacity ` for pods.
177
174
@@ -180,35 +177,35 @@ by evicting pods whenever the overall usage across all pods exceeds
180
177
'Allocatable'. More details on eviction policy can be found
181
178
on the [ node pressure eviction] ( /docs/concepts/scheduling-eviction/node-pressure-eviction/ )
182
179
page. This enforcement is controlled by
183
- specifying ` pods ` value to the kubelet flag ` --enforce-node-allocatable ` .
180
+ specifying ` pods ` value to the KubeletConfiguration setting ` enforceNodeAllocatable ` .
184
181
185
- Optionally, ` kubelet ` can be made to enforce ` kube-reserved ` and
186
- ` system-reserved ` by specifying ` kube-reserved ` & ` system-reserved ` values in
187
- the same flag . Note that to enforce ` kube-reserved ` or ` system-reserved ` ,
188
- ` --kube-reserved-cgroup ` or ` --system-reserved-cgroup ` needs to be specified
182
+ Optionally, ` kubelet ` can be made to enforce ` kubeReserved ` and
183
+ ` systemReserved ` by specifying ` kube-reserved ` & ` system-reserved ` values in
184
+ the same setting . Note that to enforce ` kubeReserved ` or ` systemReserved ` ,
185
+ ` kubeReservedCgroup ` or ` systemReservedCgroup ` needs to be specified
189
186
respectively.
190
187
191
188
## General Guidelines
192
189
193
- System daemons are expected to be treated similar to
194
- [ Guaranteed pods] ( /docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed ) .
190
+ System daemons are expected to be treated similar to
191
+ [ Guaranteed pods] ( /docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed ) .
195
192
System daemons can burst within their bounding control groups and this behavior needs
196
193
to be managed as part of kubernetes deployments. For example, ` kubelet ` should
197
- have its own control group and share ` kube-reserved ` resources with the
194
+ have its own control group and share ` kubeReserved ` resources with the
198
195
container runtime. However, Kubelet cannot burst and use up all available Node
199
- resources if ` kube-reserved ` is enforced.
196
+ resources if ` kubeReserved ` is enforced.
200
197
201
- Be extra careful while enforcing ` system-reserved ` reservation since it can lead
198
+ Be extra careful while enforcing ` systemReserved ` reservation since it can lead
202
199
to critical system services being CPU starved, OOM killed, or unable
203
200
to fork on the node. The
204
- recommendation is to enforce ` system-reserved ` only if a user has profiled their
201
+ recommendation is to enforce ` systemReserved ` only if a user has profiled their
205
202
nodes exhaustively to come up with precise estimates and is confident in their
206
203
ability to recover if any process in that group is oom-killed.
207
204
208
205
* To begin with enforce 'Allocatable' on ` pods ` .
209
206
* Once adequate monitoring and alerting is in place to track kube system
210
- daemons, attempt to enforce ` kube-reserved ` based on usage heuristics.
211
- * If absolutely necessary, enforce ` system-reserved ` over time.
207
+ daemons, attempt to enforce ` kubeReserved ` based on usage heuristics.
208
+ * If absolutely necessary, enforce ` systemReserved ` over time.
212
209
213
210
The resource requirements of kube system daemons may grow over time as more and
214
211
more features are added. Over time, kubernetes project will attempt to bring
@@ -222,9 +219,9 @@ So expect a drop in `Allocatable` capacity in future releases.
222
219
Here is an example to illustrate Node Allocatable computation:
223
220
224
221
* Node has ` 32Gi ` of ` memory ` , ` 16 CPUs ` and ` 100Gi ` of ` Storage `
225
- * ` --kube-reserved ` is set to ` cpu=1, memory= 2Gi,ephemeral-storage= 1Gi `
226
- * ` --system-reserved ` is set to ` cpu= 500m,memory= 1Gi,ephemeral-storage= 1Gi `
227
- * ` --eviction-hard ` is set to ` memory.available<500Mi, nodefs.available<10% `
222
+ * ` kubeReserved ` is set to ` { cpu: 1000m, memory: 2Gi, ephemeral-storage: 1Gi} `
223
+ * ` systemReserved ` is set to ` { cpu: 500m, memory: 1Gi, ephemeral-storage: 1Gi} `
224
+ * ` evictionHard ` is set to ` { memory.available: " <500Mi", nodefs.available: " <10%"} `
228
225
229
226
Under this scenario, 'Allocatable' will be 14.5 CPUs, 28.5Gi of memory and
230
227
` 88Gi ` of local storage.
@@ -234,7 +231,7 @@ Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi
234
231
or if overall disk usage exceeds 88Gi. If all processes on the node consume as
235
232
much CPU as they can, pods together cannot consume more than 14.5 CPUs.
236
233
237
- If ` kube-reserved ` and/or ` system-reserved ` is not enforced and system daemons
234
+ If ` kubeReserved ` and/or ` systemReserved ` is not enforced and system daemons
238
235
exceed their reservation, ` kubelet ` evicts pods whenever the overall node memory
239
236
usage is higher than 31.5Gi or ` storage ` is greater than 90Gi.
240
237
0 commit comments