Skip to content

Commit 9383dd8

Browse files
authored
Merge pull request #28958 from rajula96reddy/memory-manager
Add memory manager moves to beta feature blog post 1.22
2 parents 7c2e229 + a783b05 commit 9383dd8

File tree

4 files changed

+153
-0
lines changed

4 files changed

+153
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes Memory Manager moves to beta"
4+
date: 2021-08-11
5+
slug: kubernetes-1-22-feature-memory-manager-moves-to-beta
6+
---
7+
8+
**Authors:** Artyom Lukianov (Red Hat), Cezary Zukowski (Samsung)
9+
10+
The blog post explains some of the internals of the _Memory manager_, a beta feature
11+
of Kubernetes 1.22. In Kubernetes, the Memory Manager is a
12+
[kubelet](https://kubernetes.io/docs/concepts/overview/components/#kubelet) subcomponent.
13+
The memory manage provides guaranteed memory (and hugepages)
14+
allocation for pods in the `Guaranteed` [QoS class](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#qos-classes).
15+
16+
This blog post covers:
17+
18+
1. [Why do you need it?](#Why-do-you-need-it?)
19+
2. [The internal details of how the **MemoryManager** works](#How-does-it-work?)
20+
3. [Current limitations of the **MemoryManager**](#Current-limitations)
21+
4. [Future work for the **MemoryManager**](#Future-work-for-the-Memory-Manager)
22+
23+
## Why do you need it?
24+
25+
Some Kubernetes workloads run on nodes with
26+
[non-uniform memory access](https://en.wikipedia.org/wiki/Non-uniform_memory_access) (NUMA).
27+
Suppose you have NUMA nodes in your cluster. In that case, you'll know about the potential for extra latency when
28+
compute resources need to access memory on the different NUMA locality.
29+
30+
To get the best performance and latency for your workload, container CPUs,
31+
peripheral devices, and memory should all be aligned to the same NUMA
32+
locality.
33+
Before Kubernetes v1.22, the kubelet already provided a set of managers to
34+
align CPUs and PCI devices, but you did not have a way to align memory.
35+
The Linux kernel was able to make best-effort attempts to allocate
36+
memory for tasks from the same NUMA node where the container is
37+
executing are placed, but without any guarantee about that placement.
38+
39+
## How does it work?
40+
41+
The memory manager is doing two main things:
42+
- provides the topology hint to the Topology Manager
43+
- allocates the memory for containers and updates the state
44+
45+
The overall sequence of the Memory Manager under the Kubelet
46+
47+
![MemoryManagerDiagram](/images/blog/2021-08-11-memory-manager-moves-to-beta/MemoryManagerDiagram.svg "MemoryManagerDiagram")
48+
49+
During the Admission phase:
50+
51+
1. When first handling a new pod, the kubelet calls the TopologyManager's `Admit()` method.
52+
2. The Topology Manager is calling `GetTopologyHints()` for every hint provider including the Memory Manager.
53+
3. The Memory Manager calculates all possible NUMA nodes combinations for every container inside the pod and returns hints to the Topology Manager.
54+
4. The Topology Manager calls to `Allocate()` for every hint provider including the Memory Manager.
55+
5. The Memory Manager allocates the memory under the state according to the hint that the Topology Manager chose.
56+
57+
During Pod creation:
58+
59+
1. The kubelet calls `PreCreateContainer()`.
60+
2. For each container, the Memory Manager looks the NUMA nodes where it allocated the
61+
memory for the container and then returns that information to the kubelet.
62+
3. The kubelet creates the container, via CRI, using a container specification
63+
that incorporates information from the Memory Manager information.
64+
65+
### Let's talk about the configuration
66+
67+
By default, the Memory Manager runs with the `None` policy, meaning it will just
68+
relax and not do anything. To make use of the Memory Manager, you should set
69+
two command line options for the kubelet:
70+
71+
- `--memory-manager-policy=Static`
72+
- `--reserved-memory="<numaNodeID>:<resourceName>=<quantity>"`
73+
74+
The value for `--memory-manager-policy` is straightforward: `Static`. Deciding what to specify for `--reserved-memory` takes more thought. To configure it correctly, you should follow two main rules:
75+
76+
- The amount of reserved memory for the `memory` resource must be greater than zero.
77+
- The amount of reserved memory for the resource type must be equal
78+
to [NodeAllocatable](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable)
79+
(`kube-reserved + system-reserved + eviction-hard`) for the resource.
80+
You can read more about memory reservations in [Reserve Compute Resources for System Daemons](/docs/tasks/administer-cluster/reserve-compute-resources/).
81+
82+
![Reserved memory](/images/blog/2021-08-11-memory-manager-moves-to-beta/ReservedMemory.svg)
83+
84+
## Current limitations
85+
86+
The 1.22 release and promotion to beta brings along enhancements and fixes, but the Memory Manager still has several limitations.
87+
88+
### Single vs Cross NUMA node allocation
89+
90+
The NUMA node can not have both single and cross NUMA node allocations. When the container memory is pinned to two or more NUMA nodes, we can not know from which NUMA node the container will consume the memory.
91+
92+
![Single vs Cross NUMA allocation](/images/blog/2021-08-11-memory-manager-moves-to-beta/SingleCrossNUMAAllocation.svg "SingleCrossNUMAAllocation")
93+
94+
1. The `container1` started on the NUMA node 0 and requests *5Gi* of the memory but currently is consuming only *3Gi* of the memory.
95+
2. For container2 the memory request is 10Gi, and no single NUMA node can satisfy it.
96+
3. The `container2` consumes *3.5Gi* of the memory from the NUMA node 0, but once the `container1` will require more memory, it will not have it, and the kernel will kill one of the containers with the *OOM* error.
97+
98+
To prevent such issues, the Memory Manager will fail the admission of the `container2` until the machine has two NUMA nodes without a single NUMA node allocation.
99+
100+
### Works only for Guaranteed pods
101+
102+
The Memory Manager can not guarantee memory allocation for Burstable pods,
103+
also when the Burstable pod has specified equal memory limit and request.
104+
105+
Let's assume you have two Burstable pods: `pod1` has containers with
106+
equal memory request and limits, and `pod2` has containers only with a
107+
memory request set. You want to guarantee memory allocation for the `pod1`.
108+
To the Linux kernel, processes in either pod have the same *OOM score*,
109+
once the kernel finds that it does not have enough memory, it can kill
110+
processes that belong to pod `pod1`.
111+
112+
### Memory fragmentation
113+
114+
The sequence of Pods and containers that start and stop can fragment the memory on NUMA nodes.
115+
The alpha implementation of the Memory Manager does not have any mechanism to balance pods and defragment memory back.
116+
117+
## Future work for the Memory Manager
118+
119+
We do not want to stop with the current state of the Memory Manager and are looking to
120+
make improvements, including in the following areas.
121+
122+
### Make the Memory Manager allocation algorithm smarter
123+
124+
The current algorithm ignores distances between NUMA nodes during the
125+
calculation of the allocation. If same-node placement isn't available, we can still
126+
provide better performance compared to the current implementation, by changing the
127+
Memory Manager to prefer the closest NUMA nodes for cross-node allocation.
128+
129+
### Reduce the number of admission errors
130+
131+
The default Kubernetes scheduler is not aware of the node's NUMA topology, and it can be a reason for many admission errors during the pod start.
132+
We're hoping to add a KEP (Kubernetes Enhancement Proposal) to cover improvements in this area.
133+
Follow [Topology aware scheduler plugin in kube-scheduler](https://github.com/kubernetes/enhancements/issues/2044) to see how this idea progresses.
134+
135+
136+
## Conclusion
137+
With the promotion of the Memory Manager to beta in 1.22, we encourage everyone to give it a try and look forward to any feedback you may have. While there are still several limitations, we have a set of enhancements planned to address them and look forward to providing you with many new features in upcoming releases.
138+
If you have ideas for additional enhancements or a desire for certain features, please let us know. The team is always open to suggestions to enhance and improve the Memory Manager.
139+
We hope you have found this blog informative and helpful! Let us know if you have any questions or comments.
140+
141+
You can contact us via:
142+
- The Kubernetes [#sig-node ](https://kubernetes.slack.com/messages/sig-node)
143+
channel in Slack (visit https://slack.k8s.io/ for an invitation if you need one)
144+
- The SIG Node mailing list, [[email protected]](https://groups.google.com/g/kubernetes-sig-node)

0 commit comments

Comments
 (0)