Skip to content

Commit 6f6b78d

Browse files
committed
Try to improve 2023-05-05-memory-qos-cgroups-v2
1 parent 8662094 commit 6f6b78d

File tree

1 file changed

+99
-75
lines changed
  • content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2

1 file changed

+99
-75
lines changed

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/index.md

Lines changed: 99 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,19 @@ slug: qos-memory-resources
1010
Kubernetes v1.27, released in April 2023, introduced changes to
1111
Memory QoS (alpha) to improve memory management capabilites in Linux nodes.
1212

13-
Support for Memory QoS was initially added in Kubernetes v1.22, and later some
14-
[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127)
15-
around the formula for calculating `memory.high` were identified. These limitations are
13+
Support for Memory QoS was initially added in Kubernetes v1.22, and later some
14+
[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127)
15+
around the formula for calculating `memory.high` were identified. These limitations are
1616
addressed in Kubernetes v1.27.
1717

1818
## Background
1919

2020
Kubernetes allows you to optionally specify how much of each resources a container needs
21-
in the Pod specification. The most common resources to specify are CPU and Memory.
21+
in the Pod specification. The most common resources to specify are CPU and Memory.
2222

2323
For example, a Pod manifest that defines container resource requirements could look like:
24-
```
24+
25+
```yaml
2526
apiVersion: v1
2627
kind: Pod
2728
metadata:
@@ -40,19 +41,19 @@ spec:
4041
4142
* `spec.containers[].resources.requests`
4243

43-
When you specify the resource request for containers in a Pod, the
44+
When you specify the resource request for containers in a Pod, the
4445
[Kubernetes scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler)
4546
uses this information to decide which node to place the Pod on. The scheduler
46-
ensures that for each resource type, the sum of the resource requests of the
47+
ensures that for each resource type, the sum of the resource requests of the
4748
scheduled containers is less than the total allocatable resources on the node.
4849

4950
* `spec.containers[].resources.limits`
5051

51-
When you specify the resource limit for containers in a Pod, the kubelet enforces
52-
those limits so that the running containers are not allowed to use more of those
52+
When you specify the resource limit for containers in a Pod, the kubelet enforces
53+
those limits so that the running containers are not allowed to use more of those
5354
resources than the limits you set.
5455

55-
When the kubelet starts a container as a part of a Pod, kubelet passes the
56+
When the kubelet starts a container as a part of a Pod, kubelet passes the
5657
container's requests and limits for CPU and memory to the container runtime.
5758
The container runtime assigns both CPU request and CPU limit to a container.
5859
Provided the system has free CPU time, the containers are guaranteed to be
@@ -61,36 +62,36 @@ the configured limit i.e. containers CPU usage will be throttled if they
6162
use more CPU than the specified limit within a given time slice.
6263

6364
Prior to Memory QoS feature, the container runtime only used the memory
64-
limit and discarded the memory `request` (requests were, and still are,
65+
limit and discarded the memory `request` (requests were, and still are,
6566
also used to influence [scheduling](/docs/concepts/scheduling-eviction/#scheduling)).
66-
If a container uses more memory than the configured limit,
67+
If a container uses more memory than the configured limit,
6768
the Linux Out Of Memory (OOM) killer will be invoked.
6869

6970
Let's compare how the container runtime on Linux typically configures memory
7071
request and limit in cgroups, with and without Memory QoS feature:
7172

7273
* **Memory request**
7374

74-
The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
75+
The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
7576
scheduling. In cgroups v1, there are no controls to specify the minimum amount
7677
of memory the cgroups must always retain. Hence, the container runtime did not
7778
use the value of requested memory set in the Pod spec.
7879

79-
cgroups v2 introduced a `memory.min` setting, used to specify the minimum
80+
cgroups v2 introduced a `memory.min` setting, used to specify the minimum
8081
amount of memory that should remain available to the processes within
8182
a given cgroup. If the memory usage of a cgroup is within its effective
8283
min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
83-
If the kernel cannot maintain at least `memory.min` bytes of memory for the
84+
If the kernel cannot maintain at least `memory.min` bytes of memory for the
8485
processes within the cgroup, the kernel invokes its OOM killer. In other words,
85-
the kernel guarantees at least this much memory is available or terminates
86+
the kernel guarantees at least this much memory is available or terminates
8687
processes (which may be outside the cgroup) in order to make memory more available.
8788
Memory QoS maps `memory.min` to `spec.containers[].resources.requests.memory`
88-
to ensure the availability of memory for containers in Kubernetes Pods.
89+
to ensure the availability of memory for containers in Kubernetes Pods.
8990

9091
* **Memory limit**
9192

9293
The `memory.limit` specifies the memory limit, beyond which if the container tries
93-
to allocate more memory, Linux kernel will terminate a process with an
94+
to allocate more memory, Linux kernel will terminate a process with an
9495
OOM (Out of Memory) kill. If the terminated process was the main (or only) process
9596
inside the container, the container may exit.
9697

@@ -103,7 +104,7 @@ request and limit in cgroups, with and without Memory QoS feature:
103104
specify the hard limit for memory usage. If the memory consumption goes above this
104105
level, the kernel invokes its OOM Killer.
105106

106-
cgroups v2 also added `memory.high` configuration . Memory QoS uses `memory.high`
107+
cgroups v2 also added `memory.high` configuration. Memory QoS uses `memory.high`
107108
to set memory usage throttle limit. If the `memory.high` limit is breached,
108109
the offending cgroups are throttled, and the kernel tries to reclaim memory
109110
which may avoid an OOM kill.
@@ -113,40 +114,49 @@ request and limit in cgroups, with and without Memory QoS feature:
113114
### Cgroups v2 memory controller interfaces & Kubernetes container resources mapping
114115

115116
Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in
116-
Kubernetes. cgroupv2 interfaces that this feature uses are:
117+
Kubernetes. cgroupv2 interfaces that this feature uses are:
118+
117119
* `memory.max`
118120
* `memory.min`
119121
* `memory.high`.
120122

121123
{{< figure src="/blog/2023/05/05/qos-memory-resources/memory-qos-cal.svg" title="Memory QoS Levels" alt="Memory QoS Levels" >}}
122124

123-
`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and
124-
the container runtime configure the limit in the respective cgroup. The kernel
125+
`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and
126+
the container runtime configure the limit in the respective cgroup. The kernel
125127
enforces the limit to prevent the container from using more than the configured
126-
resource limit. If a process in a container tries to consume more than the
127-
specified limit, kernel terminates a process(es) with an out of
128-
memory Out of Memory (OOM) error.
128+
resource limit. If a process in a container tries to consume more than the
129+
specified limit, kernel terminates a process(es) with an Out of Memory (OOM) error.
129130

130-
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-max.svg" title="memory.max maps to limits.memory" alt="memory.max maps to limits.memory" >}}
131+
```formula
132+
memory.max = pod.spec.containers[i].resources.limits[memory]
133+
```
131134

132135
`memory.min` is mapped to `requests.memory`, which results in reservation of memory resources
133-
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
134-
memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
136+
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
137+
memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
135138
killer is invoked to make more memory available.
136139

137-
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-min.svg" title="memory.min maps to requests.memory" alt="memory.min maps to requests.memory" >}}
140+
```formula
141+
memory.min = pod.spec.containers[i].resources.requests[memory]
142+
```
138143

139144
For memory protection, in addition to the original way of limiting memory usage, Memory QoS
140145
throttles workload approaching its memory limit, ensuring that the system is not overwhelmed
141146
by sporadic increases in memory usage. A new field, `memoryThrottlingFactor`, is available in
142-
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
147+
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
143148
`memory.high` is mapped to throttling limit calculated by using `memoryThrottlingFactor`,
144-
`requests.memory` and `limits.memory` as in the formula below, and rounding down the
149+
`requests.memory` and `limits.memory` as in the formula below, and rounding down the
145150
value to the nearest page size:
146151

147-
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high.svg" title="memory.high formula" alt="memory.high formula" >}}
152+
```formula
153+
memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
154+
{(pod.spec.containers[i].resources.limits[memory] or NodeAllocatableMemory) - pod.spec.containers[i].resources.requests[memory]}
155+
```
148156

149-
**Note**: If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory.
157+
{{< note >}}
158+
If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory.
159+
{{< /note >}}
150160

151161
**Summary:**
152162
<table>
@@ -158,16 +168,16 @@ value to the nearest page size:
158168
<td>memory.max</td>
159169
<td><code>memory.max</code> specifies the maximum memory limit,
160170
a container is allowed to use. If a process within the container
161-
tries to consume more memory than the configured limit,
162-
the kernel terminates the process with an Out of Memory (OOM) error.
171+
tries to consume more memory than the configured limit,
172+
the kernel terminates the process with an Out of Memory (OOM) error.
163173
<br>
164174
<br>
165175
<i>It is mapped to the container's memory limit specified in Pod manifest.</i>
166176
</td>
167177
</tr>
168178
<tr>
169179
<td>memory.min</td>
170-
<td><code>memory.min</code> specifies a minimum amount of memory
180+
<td><code>memory.min</code> specifies a minimum amount of memory
171181
the cgroups must always retain, i.e., memory that should never be
172182
reclaimed by the system.
173183
If there's no unprotected reclaimable memory available, OOM kill is invoked.
@@ -178,8 +188,8 @@ value to the nearest page size:
178188
</tr>
179189
<tr>
180190
<td>memory.high</td>
181-
<td><code>memory.high</code> specifies the memory usage throttle limit.
182-
This is the main mechanism to control a cgroup's memory use. If
191+
<td><code>memory.high</code> specifies the memory usage throttle limit.
192+
This is the main mechanism to control a cgroup's memory use. If
183193
cgroups memory use goes over the high boundary specified here,
184194
the cgroups processes are throttled and put under heavy reclaim pressure.
185195
<br>
@@ -193,66 +203,79 @@ value to the nearest page size:
193203
</tr>
194204
</table>
195205

196-
**Note** `memory.high` is set only on container level cgroups while `memory.min` is set on
206+
{{< note >}}
207+
`memory.high` is set only on container level cgroups while `memory.min` is set on
197208
container, pod, and node level cgroups.
209+
{{< /note >}}
198210

199211
### `memory.min` calculations for cgroups heirarchy
200212

201213
When container memory requests are made, kubelet passes `memory.min` to the back-end
202214
CRI runtime (such as containerd or CRI-O) via the `Unified` field in CRI during
203-
container creation. The `memory.min` in container level cgroups will be set to:
204-
205-
$memory.min = pod.spec.containers[i].resources.requests[memory]$
206-
<sub>for every i<sup>th</sup> container in a pod</sub>
207-
<br>
208-
<br>
209-
Since the `memory.min` interface requires that the ancestor cgroups directories are all
210-
set, the pod and node cgroups directories need to be set correctly.
211-
212-
`memory.min` in pod level cgroup:
213-
$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$
214-
<sub>for every i<sup>th</sup> container in a pod</sub>
215-
<br>
216-
<br>
217-
`memory.min` in node level cgroup:
218-
$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$
219-
<sub>for every j<sup>th</sup> container in every i<sup>th</sup> pod on a node</sub>
220-
<br>
221-
<br>
222-
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
215+
container creation. For every i<sup>th</sup> container in a pod, the `memory.min`
216+
in container level cgroups will be set to:
217+
218+
```formula
219+
memory.min = pod.spec.containers[i].resources.requests[memory]
220+
```
221+
222+
Since the `memory.min` interface requires that the ancestor cgroups directories are all
223+
set, the pod and node cgroups directories need to be set correctly.
224+
225+
For every i<sup>th</sup> container in a pod, `memory.min` in pod level cgroup:
226+
227+
```formula
228+
memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]
229+
```
230+
231+
For every j<sup>th</sup> container in every i<sup>th</sup> pod on a node, `memory.min` in node level cgroup:
232+
233+
```formula
234+
memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]
235+
```
236+
237+
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
223238
directly using the libcontainer library (from the runc project), while container
224239
cgroups limits are managed by the container runtime.
225240

226241
### Support for Pod QoS classes
227242

228-
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
243+
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
229244
to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling.
230-
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
245+
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
231246
Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high
232247
as per QOS classes:
233248

234-
1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are
235-
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
236-
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
237-
to their set limit, and not hit any throttling.
249+
1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are
250+
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
251+
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
252+
to their set limit, and not hit any throttling.
238253

239-
2. **Burstable pods** by their QoS definition require at least one container in the Pod with
240-
CPU or memory request or limit set.
254+
1. **Burstable pods** by their QoS definition require at least one container in the Pod with
255+
CPU or memory request or limit set.
241256

242257
* When requests.memory and limits.memory are set, the formula is used as-is:
243258

244-
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-limit.svg" title="memory.high when requests and limits are set" alt="memory.high when requests and limits are set" >}}
259+
```formula
260+
memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
261+
{(pod.spec.containers[i].resources.limits[memory]) - pod.spec.containers[i].resources.requests[memory]}
262+
```
245263

246264
* When requests.memory is set and limits.memory is not set, limits.memory is substituted
247265
for node allocatable memory in the formula:
248266

249-
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-no-limits.svg" title="memory.high when requests and limits are not set" alt="memory.high when requests and limits are not set" >}}
267+
```formula
268+
memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
269+
{(NodeAllocatableMemory) - pod.spec.containers[i].resources.requests[memory]}
270+
```
250271

251-
3. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
272+
1. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
252273
For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable
253274
memory in the formula:
254275

255-
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-best-effort.svg" title="memory.high for BestEffort Pod" alt="memory.high for BestEffort Pod" >}}
276+
```formula
277+
memory.high = MemoryThrottlingFactor * NodeAllocatableMemory
278+
```
256279

257280
**Summary**: Only Pods in Burstable and BestEffort QoS classes will set `memory.high`.
258281
Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
@@ -261,10 +284,10 @@ Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
261284

262285
The prerequisites for enabling Memory QoS feature on your Linux node are:
263286

264-
1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements)
287+
1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements)
265288
related to [Kubernetes support for cgroups v2](/docs/concepts/architecture/cgroups)
266-
are met.
267-
2. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
289+
are met.
290+
1. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
268291
and CRI-O provide support compatible with Memory QoS (alpha). This was implemented
269292
in the following PRs:
270293
* Containerd: [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627).
@@ -291,8 +314,9 @@ and review of this feature:
291314
* David Porter([bobbypage](https://github.com/bobbypage))
292315
* Mrunal Patel([mrunalp](https://github.com/mrunalp))
293316

294-
For those interested in getting involved in future discussions on Memory QoS feature,
317+
For those interested in getting involved in future discussions on Memory QoS feature,
295318
you can reach out SIG Node by several means:
319+
296320
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
297321
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
298322
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)

0 commit comments

Comments
 (0)