@@ -10,18 +10,19 @@ slug: qos-memory-resources
10
10
Kubernetes v1.27, released in April 2023, introduced changes to
11
11
Memory QoS (alpha) to improve memory management capabilites in Linux nodes.
12
12
13
- Support for Memory QoS was initially added in Kubernetes v1.22, and later some
14
- [ limitations] ( https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127 )
15
- around the formula for calculating ` memory.high ` were identified. These limitations are
13
+ Support for Memory QoS was initially added in Kubernetes v1.22, and later some
14
+ [ limitations] ( https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127 )
15
+ around the formula for calculating ` memory.high ` were identified. These limitations are
16
16
addressed in Kubernetes v1.27.
17
17
18
18
## Background
19
19
20
20
Kubernetes allows you to optionally specify how much of each resources a container needs
21
- in the Pod specification. The most common resources to specify are CPU and Memory.
21
+ in the Pod specification. The most common resources to specify are CPU and Memory.
22
22
23
23
For example, a Pod manifest that defines container resource requirements could look like:
24
- ```
24
+
25
+ ``` yaml
25
26
apiVersion : v1
26
27
kind : Pod
27
28
metadata :
@@ -40,19 +41,19 @@ spec:
40
41
41
42
* ` spec.containers[].resources.requests`
42
43
43
- When you specify the resource request for containers in a Pod, the
44
+ When you specify the resource request for containers in a Pod, the
44
45
[Kubernetes scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler)
45
46
uses this information to decide which node to place the Pod on. The scheduler
46
- ensures that for each resource type, the sum of the resource requests of the
47
+ ensures that for each resource type, the sum of the resource requests of the
47
48
scheduled containers is less than the total allocatable resources on the node.
48
49
49
50
* `spec.containers[].resources.limits`
50
51
51
- When you specify the resource limit for containers in a Pod, the kubelet enforces
52
- those limits so that the running containers are not allowed to use more of those
52
+ When you specify the resource limit for containers in a Pod, the kubelet enforces
53
+ those limits so that the running containers are not allowed to use more of those
53
54
resources than the limits you set.
54
55
55
- When the kubelet starts a container as a part of a Pod, kubelet passes the
56
+ When the kubelet starts a container as a part of a Pod, kubelet passes the
56
57
container's requests and limits for CPU and memory to the container runtime.
57
58
The container runtime assigns both CPU request and CPU limit to a container.
58
59
Provided the system has free CPU time, the containers are guaranteed to be
@@ -61,36 +62,36 @@ the configured limit i.e. containers CPU usage will be throttled if they
61
62
use more CPU than the specified limit within a given time slice.
62
63
63
64
Prior to Memory QoS feature, the container runtime only used the memory
64
- limit and discarded the memory ` request ` (requests were, and still are,
65
+ limit and discarded the memory `request` (requests were, and still are,
65
66
also used to influence [scheduling](/docs/concepts/scheduling-eviction/#scheduling)).
66
- If a container uses more memory than the configured limit,
67
+ If a container uses more memory than the configured limit,
67
68
the Linux Out Of Memory (OOM) killer will be invoked.
68
69
69
70
Let's compare how the container runtime on Linux typically configures memory
70
71
request and limit in cgroups, with and without Memory QoS feature :
71
72
72
73
* **Memory request**
73
74
74
- The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
75
+ The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
75
76
scheduling. In cgroups v1, there are no controls to specify the minimum amount
76
77
of memory the cgroups must always retain. Hence, the container runtime did not
77
78
use the value of requested memory set in the Pod spec.
78
79
79
- cgroups v2 introduced a ` memory.min ` setting, used to specify the minimum
80
+ cgroups v2 introduced a `memory.min` setting, used to specify the minimum
80
81
amount of memory that should remain available to the processes within
81
82
a given cgroup. If the memory usage of a cgroup is within its effective
82
83
min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
83
- If the kernel cannot maintain at least ` memory.min ` bytes of memory for the
84
+ If the kernel cannot maintain at least `memory.min` bytes of memory for the
84
85
processes within the cgroup, the kernel invokes its OOM killer. In other words,
85
- the kernel guarantees at least this much memory is available or terminates
86
+ the kernel guarantees at least this much memory is available or terminates
86
87
processes (which may be outside the cgroup) in order to make memory more available.
87
88
Memory QoS maps `memory.min` to `spec.containers[].resources.requests.memory`
88
- to ensure the availability of memory for containers in Kubernetes Pods.
89
+ to ensure the availability of memory for containers in Kubernetes Pods.
89
90
90
91
* **Memory limit**
91
92
92
93
The `memory.limit` specifies the memory limit, beyond which if the container tries
93
- to allocate more memory, Linux kernel will terminate a process with an
94
+ to allocate more memory, Linux kernel will terminate a process with an
94
95
OOM (Out of Memory) kill. If the terminated process was the main (or only) process
95
96
inside the container, the container may exit.
96
97
@@ -103,7 +104,7 @@ request and limit in cgroups, with and without Memory QoS feature:
103
104
specify the hard limit for memory usage. If the memory consumption goes above this
104
105
level, the kernel invokes its OOM Killer.
105
106
106
- cgroups v2 also added ` memory.high ` configuration . Memory QoS uses ` memory.high `
107
+ cgroups v2 also added `memory.high` configuration. Memory QoS uses `memory.high`
107
108
to set memory usage throttle limit. If the `memory.high` limit is breached,
108
109
the offending cgroups are throttled, and the kernel tries to reclaim memory
109
110
which may avoid an OOM kill.
@@ -113,40 +114,49 @@ request and limit in cgroups, with and without Memory QoS feature:
113
114
# ## Cgroups v2 memory controller interfaces & Kubernetes container resources mapping
114
115
115
116
Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in
116
- Kubernetes. cgroupv2 interfaces that this feature uses are:
117
+ Kubernetes. cgroupv2 interfaces that this feature uses are :
118
+
117
119
* `memory.max`
118
120
* `memory.min`
119
121
* `memory.high`.
120
122
121
123
{{< figure src="/blog/2023/05/05/qos-memory-resources/memory-qos-cal.svg" title="Memory QoS Levels" alt="Memory QoS Levels" >}}
122
124
123
- ` memory.max ` is mapped to ` limits.memory ` specified in the Pod spec. The kubelet and
124
- the container runtime configure the limit in the respective cgroup. The kernel
125
+ ` memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and
126
+ the container runtime configure the limit in the respective cgroup. The kernel
125
127
enforces the limit to prevent the container from using more than the configured
126
- resource limit. If a process in a container tries to consume more than the
127
- specified limit, kernel terminates a process(es) with an out of
128
- memory Out of Memory (OOM) error.
128
+ resource limit. If a process in a container tries to consume more than the
129
+ specified limit, kernel terminates a process(es) with an Out of Memory (OOM) error.
129
130
130
- {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-max.svg" title="memory.max maps to limits.memory" alt="memory.max maps to limits.memory" >}}
131
+ ` ` ` formula
132
+ memory.max = pod.spec.containers[i].resources.limits[memory]
133
+ ` ` `
131
134
132
135
` memory.min` is mapped to `requests.memory`, which results in reservation of memory resources
133
- that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
134
- memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
136
+ that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
137
+ memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
135
138
killer is invoked to make more memory available.
136
139
137
- {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-min.svg" title="memory.min maps to requests.memory" alt="memory.min maps to requests.memory" >}}
140
+ ` ` ` formula
141
+ memory.min = pod.spec.containers[i].resources.requests[memory]
142
+ ` ` `
138
143
139
144
For memory protection, in addition to the original way of limiting memory usage, Memory QoS
140
145
throttles workload approaching its memory limit, ensuring that the system is not overwhelmed
141
146
by sporadic increases in memory usage. A new field, `memoryThrottlingFactor`, is available in
142
- the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
147
+ the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
143
148
` memory.high` is mapped to throttling limit calculated by using `memoryThrottlingFactor`,
144
- ` requests.memory ` and ` limits.memory ` as in the formula below, and rounding down the
149
+ ` requests.memory` and `limits.memory` as in the formula below, and rounding down the
145
150
value to the nearest page size :
146
151
147
- {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high.svg" title="memory.high formula" alt="memory.high formula" >}}
152
+ ` ` ` formula
153
+ memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
154
+ {(pod.spec.containers[i].resources.limits[memory] or NodeAllocatableMemory) - pod.spec.containers[i].resources.requests[memory]}
155
+ ` ` `
148
156
149
- ** Note** : If a container has no memory limits specified, ` limits.memory ` is substituted for node allocatable memory.
157
+ {{< note >}}
158
+ If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory.
159
+ {{< /note >}}
150
160
151
161
**Summary:**
152
162
<table>
@@ -158,16 +168,16 @@ value to the nearest page size:
158
168
<td>memory.max</td>
159
169
<td><code>memory.max</code> specifies the maximum memory limit,
160
170
a container is allowed to use. If a process within the container
161
- tries to consume more memory than the configured limit,
162
- the kernel terminates the process with an Out of Memory (OOM) error.
171
+ tries to consume more memory than the configured limit,
172
+ the kernel terminates the process with an Out of Memory (OOM) error.
163
173
<br>
164
174
<br>
165
175
<i>It is mapped to the container's memory limit specified in Pod manifest.</i>
166
176
</td>
167
177
</tr>
168
178
<tr>
169
179
<td>memory.min</td>
170
- <td><code>memory.min</code> specifies a minimum amount of memory
180
+ <td><code>memory.min</code> specifies a minimum amount of memory
171
181
the cgroups must always retain, i.e., memory that should never be
172
182
reclaimed by the system.
173
183
If there's no unprotected reclaimable memory available, OOM kill is invoked.
@@ -178,8 +188,8 @@ value to the nearest page size:
178
188
</tr>
179
189
<tr>
180
190
<td>memory.high</td>
181
- <td><code>memory.high</code> specifies the memory usage throttle limit.
182
- This is the main mechanism to control a cgroup's memory use. If
191
+ <td><code>memory.high</code> specifies the memory usage throttle limit.
192
+ This is the main mechanism to control a cgroup's memory use. If
183
193
cgroups memory use goes over the high boundary specified here,
184
194
the cgroups processes are throttled and put under heavy reclaim pressure.
185
195
<br>
@@ -193,66 +203,79 @@ value to the nearest page size:
193
203
</tr>
194
204
</table>
195
205
196
- ** Note** ` memory.high ` is set only on container level cgroups while ` memory.min ` is set on
206
+ {{< note >}}
207
+ ` memory.high` is set only on container level cgroups while `memory.min` is set on
197
208
container, pod, and node level cgroups.
209
+ {{< /note >}}
198
210
199
211
# ## `memory.min` calculations for cgroups heirarchy
200
212
201
213
When container memory requests are made, kubelet passes `memory.min` to the back-end
202
214
CRI runtime (such as containerd or CRI-O) via the `Unified` field in CRI during
203
- container creation. The ` memory.min ` in container level cgroups will be set to:
204
-
205
- $memory.min = pod.spec.containers[ i] .resources.requests[ memory] $
206
- <sub >for every i<sup >th</sup > container in a pod</sub >
207
- <br >
208
- <br >
209
- Since the ` memory.min ` interface requires that the ancestor cgroups directories are all
210
- set, the pod and node cgroups directories need to be set correctly.
211
-
212
- ` memory.min ` in pod level cgroup:
213
- $memory.min = \sum_ {i=0}^{no. of pods}pod.spec.containers[ i] .resources.requests[ memory] $
214
- <sub >for every i<sup >th</sup > container in a pod</sub >
215
- <br >
216
- <br >
217
- ` memory.min ` in node level cgroup:
218
- $memory.min = \sum_ {i}^{no. of nodes}\sum_ {j}^{no. of pods}pod[ i] .spec.containers[ j] .resources.requests[ memory] $
219
- <sub >for every j<sup >th</sup > container in every i<sup >th</sup > pod on a node</sub >
220
- <br >
221
- <br >
222
- Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
215
+ container creation. For every i<sup>th</sup> container in a pod, the `memory.min`
216
+ in container level cgroups will be set to :
217
+
218
+ ` ` ` formula
219
+ memory.min = pod.spec.containers[i].resources.requests[memory]
220
+ ` ` `
221
+
222
+ Since the `memory.min` interface requires that the ancestor cgroups directories are all
223
+ set, the pod and node cgroups directories need to be set correctly.
224
+
225
+ For every i<sup>th</sup> container in a pod, `memory.min` in pod level cgroup :
226
+
227
+ ` ` ` formula
228
+ memory.min = \s um_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]
229
+ ` ` `
230
+
231
+ For every j<sup>th</sup> container in every i<sup>th</sup> pod on a node, `memory.min` in node level cgroup :
232
+
233
+ ` ` ` formula
234
+ memory.min = \s um_{i}^{no. of nodes}\s um_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]
235
+ ` ` `
236
+
237
+ Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
223
238
directly using the libcontainer library (from the runc project), while container
224
239
cgroups limits are managed by the container runtime.
225
240
226
241
# ## Support for Pod QoS classes
227
242
228
- Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
243
+ Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
229
244
to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling.
230
- Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
245
+ Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
231
246
Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high
232
247
as per QOS classes :
233
248
234
- 1 . ** Guaranteed pods** by their QoS definition require memory requests=memory limits and are
235
- not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
236
- memory.high. This ensures that Guaranteed pods can fully use their memory requests up
237
- to their set limit, and not hit any throttling.
249
+ 1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are
250
+ not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
251
+ memory.high. This ensures that Guaranteed pods can fully use their memory requests up
252
+ to their set limit, and not hit any throttling.
238
253
239
- 2 . ** Burstable pods** by their QoS definition require at least one container in the Pod with
240
- CPU or memory request or limit set.
254
+ 1 . **Burstable pods** by their QoS definition require at least one container in the Pod with
255
+ CPU or memory request or limit set.
241
256
242
257
* When requests.memory and limits.memory are set, the formula is used as-is:
243
258
244
- {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-limit.svg" title="memory.high when requests and limits are set" alt="memory.high when requests and limits are set" >}}
259
+ ` ` ` formula
260
+ memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
261
+ {(pod.spec.containers[i].resources.limits[memory]) - pod.spec.containers[i].resources.requests[memory]}
262
+ ` ` `
245
263
246
264
* When requests.memory is set and limits.memory is not set, limits.memory is substituted
247
265
for node allocatable memory in the formula :
248
266
249
- {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-no-limits.svg" title="memory.high when requests and limits are not set" alt="memory.high when requests and limits are not set" >}}
267
+ ` ` ` formula
268
+ memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
269
+ {(NodeAllocatableMemory) - pod.spec.containers[i].resources.requests[memory]}
270
+ ` ` `
250
271
251
- 3 . ** BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
272
+ 1 . **BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
252
273
For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable
253
274
memory in the formula :
254
275
255
- {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-best-effort.svg" title="memory.high for BestEffort Pod" alt="memory.high for BestEffort Pod" >}}
276
+ ` ` ` formula
277
+ memory.high = MemoryThrottlingFactor * NodeAllocatableMemory
278
+ ` ` `
256
279
257
280
**Summary**: Only Pods in Burstable and BestEffort QoS classes will set `memory.high`.
258
281
Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
@@ -261,10 +284,10 @@ Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
261
284
262
285
The prerequisites for enabling Memory QoS feature on your Linux node are :
263
286
264
- 1 . Verify the [ requirements] ( /docs/concepts/architecture/cgroups/#requirements )
287
+ 1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements)
265
288
related to [Kubernetes support for cgroups v2](/docs/concepts/architecture/cgroups)
266
- are met.
267
- 2 . Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
289
+ are met.
290
+ 1 . Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
268
291
and CRI-O provide support compatible with Memory QoS (alpha). This was implemented
269
292
in the following PRs :
270
293
* Containerd: [Feature: containerd-cri support LinuxContainerResources.Unified # 5627](https://github.com/containerd/containerd/pull/5627).
@@ -291,8 +314,9 @@ and review of this feature:
291
314
* David Porter([bobbypage](https://github.com/bobbypage))
292
315
* Mrunal Patel([mrunalp](https://github.com/mrunalp))
293
316
294
- For those interested in getting involved in future discussions on Memory QoS feature,
317
+ For those interested in getting involved in future discussions on Memory QoS feature,
295
318
you can reach out SIG Node by several means :
319
+
296
320
- Slack : [#sig-node](https://kubernetes.slack.com/messages/sig-node)
297
321
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
298
322
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
0 commit comments