Skip to content

Commit 41a64bd

Browse files
author
Tim Bannister
authored
Merge pull request #39853 from ndixita/dev-1.27
Add Memory QOS Alpha 2 KEP 2570 blog post
2 parents 73437c2 + be1de2e commit 41a64bd

File tree

8 files changed

+1240
-0
lines changed

8 files changed

+1240
-0
lines changed

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-best-effort.svg

Lines changed: 87 additions & 0 deletions
Loading

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-limit.svg

Lines changed: 226 additions & 0 deletions
Loading

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-no-limits.svg

Lines changed: 203 additions & 0 deletions
Loading

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high.svg

Lines changed: 252 additions & 0 deletions
Loading

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-max.svg

Lines changed: 86 additions & 0 deletions
Loading

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-min.svg

Lines changed: 87 additions & 0 deletions
Loading
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
---
2+
layout: blog
3+
title: 'Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)'
4+
date: 2023-05-05
5+
slug: qos-memory-resources
6+
---
7+
8+
**Authors:** Dixita Narang (Google)
9+
10+
Kubernetes v1.27, released in April 2023, introduced changes to
11+
Memory QoS (alpha) to improve memory management capabilites in Linux nodes.
12+
13+
Support for Memory QoS was initially added in Kubernetes v1.22, and later some
14+
[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127)
15+
around the formula for calculating `memory.high` were identified. These limitations are
16+
addressed in Kubernetes v1.27.
17+
18+
## Background
19+
20+
Kubernetes allows you to optionally specify how much of each resources a container needs
21+
in the Pod specification. The most common resources to specify are CPU and Memory.
22+
23+
For example, a Pod manifest that defines container resource requirements could look like:
24+
```
25+
apiVersion: v1
26+
kind: Pod
27+
metadata:
28+
name: example
29+
spec:
30+
containers:
31+
- name: nginx
32+
resources:
33+
requests:
34+
memory: "64Mi"
35+
cpu: "250m"
36+
limits:
37+
memory: "64Mi"
38+
cpu: "500m"
39+
```
40+
41+
* `spec.containers[].resources.requests`
42+
43+
When you specify the resource request for containers in a Pod, the
44+
[Kubernetes scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler)
45+
uses this information to decide which node to place the Pod on. The scheduler
46+
ensures that for each resource type, the sum of the resource requests of the
47+
scheduled containers is less than the total allocatable resources on the node.
48+
49+
* `spec.containers[].resources.limits`
50+
51+
When you specify the resource limit for containers in a Pod, the kubelet enforces
52+
those limits so that the running containers are not allowed to use more of those
53+
resources than the limits you set.
54+
55+
When the kubelet starts a container as a part of a Pod, kubelet passes the
56+
container's requests and limits for CPU and memory to the container runtime.
57+
The container runtime assigns both CPU request and CPU limit to a container.
58+
Provided the system has free CPU time, the containers are guaranteed to be
59+
allocated as much CPU as they request. Containers cannot use more CPU than
60+
the configured limit i.e. containers CPU usage will be throttled if they
61+
use more CPU than the specified limit within a given time slice.
62+
63+
Prior to Memory QoS feature, the container runtime only used the memory
64+
limit and discarded the memory `request` (requests were, and still are,
65+
also used to influence [scheduling](/docs/concepts/scheduling-eviction/#scheduling)).
66+
If a container uses more memory than the configured limit,
67+
the Linux Out Of Memory (OOM) killer will be invoked.
68+
69+
Let's compare how the container runtime on Linux typically configures memory
70+
request and limit in cgroups, with and without Memory QoS feature:
71+
72+
* **Memory request**
73+
74+
The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
75+
scheduling. In cgroups v1, there are no controls to specify the minimum amount
76+
of memory the cgroups must always retain. Hence, the container runtime did not
77+
use the value of requested memory set in the Pod spec.
78+
79+
cgroups v2 introduced a `memory.min` setting, used to specify the minimum
80+
amount of memory that should remain available to the processes within
81+
a given cgroup. If the memory usage of a cgroup is within its effective
82+
min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
83+
If the kernel cannot maintain at least `memory.min` bytes of memory for the
84+
processes within the cgroup, the kernel invokes its OOM killer. In other words,
85+
the kernel guarantees at least this much memory is available or terminates
86+
processes (which may be outside the cgroup) in order to make memory more available.
87+
Memory QoS maps `memory.min` to `spec.containers[].resources.requests.memory`
88+
to ensure the availability of memory for containers in Kubernetes Pods.
89+
90+
* **Memory limit**
91+
92+
The `memory.limit` specifies the memory limit, beyond which if the container tries
93+
to allocate more memory, Linux kernel will terminate a process with an
94+
OOM (Out of Memory) kill. If the terminated process was the main (or only) process
95+
inside the container, the container may exit.
96+
97+
In cgroups v1, `memory.limit_in_bytes` interface is used to set the memory usage limit.
98+
However, unlike CPU, it was not possible to apply memory throttling: as soon as a
99+
container crossed the memory limit, it would be OOM killed.
100+
101+
In cgroups v2, `memory.max` is analogous to `memory.limit_in_bytes` in cgroupv1.
102+
Memory QoS maps `memory.max` to `spec.containers[].resources.limits.memory` to
103+
specify the hard limit for memory usage. If the memory consumption goes above this
104+
level, the kernel invokes its OOM Killer.
105+
106+
cgroups v2 also added `memory.high` configuration . Memory QoS uses `memory.high`
107+
to set memory usage throttle limit. If the `memory.high` limit is breached,
108+
the offending cgroups are throttled, and the kernel tries to reclaim memory
109+
which may avoid an OOM kill.
110+
111+
## How it works
112+
113+
### Cgroups v2 memory controller interfaces & Kubernetes container resources mapping
114+
115+
Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in
116+
Kubernetes. cgroupv2 interfaces that this feature uses are:
117+
* `memory.max`
118+
* `memory.min`
119+
* `memory.high`.
120+
121+
{{< figure src="/blog/2023/05/05/qos-memory-resources/memory-qos-cal.svg" title="Memory QoS Levels" alt="Memory QoS Levels" >}}
122+
123+
`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and
124+
the container runtime configure the limit in the respective cgroup. The kernel
125+
enforces the limit to prevent the container from using more than the configured
126+
resource limit. If a process in a container tries to consume more than the
127+
specified limit, kernel terminates a process(es) with an out of
128+
memory Out of Memory (OOM) error.
129+
130+
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-max.svg" title="memory.max maps to limits.memory" alt="memory.max maps to limits.memory" >}}
131+
132+
`memory.min` is mapped to `requests.memory`, which results in reservation of memory resources
133+
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
134+
memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
135+
killer is invoked to make more memory available.
136+
137+
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-min.svg" title="memory.min maps to requests.memory" alt="memory.min maps to requests.memory" >}}
138+
139+
For memory protection, in addition to the original way of limiting memory usage, Memory QoS
140+
throttles workload approaching its memory limit, ensuring that the system is not overwhelmed
141+
by sporadic increases in memory usage. A new field, `memoryThrottlingFactor`, is available in
142+
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
143+
`memory.high` is mapped to throttling limit calculated by using `memoryThrottlingFactor`,
144+
`requests.memory` and `limits.memory` as in the formula below, and rounding down the
145+
value to the nearest page size:
146+
147+
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high.svg" title="memory.high formula" alt="memory.high formula" >}}
148+
149+
**Note**: If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory.
150+
151+
**Summary:**
152+
<table>
153+
<tr>
154+
<th style="text-align:center">File</th>
155+
<th style="text-align:center">Description</th>
156+
</tr>
157+
<tr>
158+
<td>memory.max</td>
159+
<td><code>memory.max</code> specifies the maximum memory limit,
160+
a container is allowed to use. If a process within the container
161+
tries to consume more memory than the configured limit,
162+
the kernel terminates the process with an Out of Memory (OOM) error.
163+
<br>
164+
<br>
165+
<i>It is mapped to the container's memory limit specified in Pod manifest.</i>
166+
</td>
167+
</tr>
168+
<tr>
169+
<td>memory.min</td>
170+
<td><code>memory.min</code> specifies a minimum amount of memory
171+
the cgroups must always retain, i.e., memory that should never be
172+
reclaimed by the system.
173+
If there's no unprotected reclaimable memory available, OOM kill is invoked.
174+
<br>
175+
<br>
176+
<i>It is mapped to the container's memory request specified in the Pod manifest.</i>
177+
</td>
178+
</tr>
179+
<tr>
180+
<td>memory.high</td>
181+
<td><code>memory.high</code> specifies the memory usage throttle limit.
182+
This is the main mechanism to control a cgroup's memory use. If
183+
cgroups memory use goes over the high boundary specified here,
184+
the cgroups processes are throttled and put under heavy reclaim pressure.
185+
<br>
186+
<br>
187+
<i>Kubernetes uses a formula to calculate <code>memory.high</code>,
188+
depending on container's memory request, memory limit or node allocatable memory
189+
(if container's memory limit is empty) and a throttling factor.
190+
Please refer to the <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos">KEP</a>
191+
for more details on the formula.</i>
192+
</td>
193+
</tr>
194+
</table>
195+
196+
**Note** `memory.high` is set only on container level cgroups while `memory.min` is set on
197+
container, pod, and node level cgroups.
198+
199+
### `memory.min` calculations for cgroups heirarchy
200+
201+
When container memory requests are made, kubelet passes `memory.min` to the back-end
202+
CRI runtime (such as containerd or CRI-O) via the `Unified` field in CRI during
203+
container creation. The `memory.min` in container level cgroups will be set to:
204+
205+
$memory.min = pod.spec.containers[i].resources.requests[memory]$
206+
<sub>for every i<sup>th</sup> container in a pod</sub>
207+
<br>
208+
<br>
209+
Since the `memory.min` interface requires that the ancestor cgroups directories are all
210+
set, the pod and node cgroups directories need to be set correctly.
211+
212+
`memory.min` in pod level cgroup:
213+
$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$
214+
<sub>for every i<sup>th</sup> container in a pod</sub>
215+
<br>
216+
<br>
217+
`memory.min` in node level cgroup:
218+
$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$
219+
<sub>for every j<sup>th</sup> container in every i<sup>th</sup> pod on a node</sub>
220+
<br>
221+
<br>
222+
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
223+
directly using the libcontainer library (from the runc project), while container
224+
cgroups limits are managed by the container runtime.
225+
226+
### Support for Pod QoS classes
227+
228+
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
229+
to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling.
230+
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
231+
Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high
232+
as per QOS classes:
233+
234+
1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are
235+
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
236+
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
237+
to their set limit, and not hit any throttling.
238+
239+
2. **Burstable pods** by their QoS definition require at least one container in the Pod with
240+
CPU or memory request or limit set.
241+
242+
* When requests.memory and limits.memory are set, the formula is used as-is:
243+
244+
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-limit.svg" title="memory.high when requests and limits are set" alt="memory.high when requests and limits are set" >}}
245+
246+
* When requests.memory is set and limits.memory is not set, limits.memory is substituted
247+
for node allocatable memory in the formula:
248+
249+
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-no-limits.svg" title="memory.high when requests and limits are not set" alt="memory.high when requests and limits are not set" >}}
250+
251+
3. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
252+
For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable
253+
memory in the formula:
254+
255+
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-best-effort.svg" title="memory.high for BestEffort Pod" alt="memory.high for BestEffort Pod" >}}
256+
257+
**Summary**: Only Pods in Burstable and BestEffort QoS classes will set `memory.high`.
258+
Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
259+
260+
## How do I use it?
261+
262+
The prerequisites for enabling Memory QoS feature on your Linux node are:
263+
264+
1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements)
265+
related to [Kubernetes support for cgroups v2](/docs/concepts/architecture/cgroups)
266+
are met.
267+
2. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
268+
and CRI-O provide support compatible with Memory QoS (alpha). This was implemented
269+
in the following PRs:
270+
* Containerd: [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627).
271+
* CRI-O: [implement kube alpha features for 1.22 #5207](https://github.com/cri-o/cri-o/pull/5207).
272+
273+
Memory QoS remains an alpha feature for Kubernetes v1.27. You can enable the feature by setting
274+
`MemoryQoS=true` in the kubelet configuration file:
275+
276+
```yaml
277+
apiVersion: kubelet.config.k8s.io/v1beta1
278+
kind: KubeletConfiguration
279+
featureGates:
280+
MemoryQoS: true
281+
```
282+
283+
## How do I get involved?
284+
285+
Huge thank you to all the contributors who helped with the design, implementation,
286+
and review of this feature:
287+
288+
* Dixita Narang ([ndixita](https://github.com/ndixita))
289+
* Tim Xu ([xiaoxubeii](https://github.com/xiaoxubeii))
290+
* Paco Xu ([pacoxu](https://github.com/pacoxu))
291+
* David Porter([bobbypage](https://github.com/bobbypage))
292+
* Mrunal Patel([mrunalp](https://github.com/mrunalp))
293+
294+
For those interested in getting involved in future discussions on Memory QoS feature,
295+
you can reach out SIG Node by several means:
296+
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
297+
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
298+
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)

content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/memory-qos-cal.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)