|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)' |
| 4 | +date: 2023-05-05 |
| 5 | +slug: qos-memory-resources |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Dixita Narang (Google) |
| 9 | + |
| 10 | +Kubernetes v1.27, released in April 2023, introduced changes to |
| 11 | +Memory QoS (alpha) to improve memory management capabilites in Linux nodes. |
| 12 | + |
| 13 | +Support for Memory QoS was initially added in Kubernetes v1.22, and later some |
| 14 | +[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127) |
| 15 | +around the formula for calculating `memory.high` were identified. These limitations are |
| 16 | +addressed in Kubernetes v1.27. |
| 17 | + |
| 18 | +## Background |
| 19 | + |
| 20 | +Kubernetes allows you to optionally specify how much of each resources a container needs |
| 21 | +in the Pod specification. The most common resources to specify are CPU and Memory. |
| 22 | + |
| 23 | +For example, a Pod manifest that defines container resource requirements could look like: |
| 24 | +``` |
| 25 | +apiVersion: v1 |
| 26 | +kind: Pod |
| 27 | +metadata: |
| 28 | + name: example |
| 29 | +spec: |
| 30 | + containers: |
| 31 | + - name: nginx |
| 32 | + resources: |
| 33 | + requests: |
| 34 | + memory: "64Mi" |
| 35 | + cpu: "250m" |
| 36 | + limits: |
| 37 | + memory: "64Mi" |
| 38 | + cpu: "500m" |
| 39 | +``` |
| 40 | + |
| 41 | +* `spec.containers[].resources.requests` |
| 42 | + |
| 43 | + When you specify the resource request for containers in a Pod, the |
| 44 | + [Kubernetes scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler) |
| 45 | + uses this information to decide which node to place the Pod on. The scheduler |
| 46 | + ensures that for each resource type, the sum of the resource requests of the |
| 47 | + scheduled containers is less than the total allocatable resources on the node. |
| 48 | + |
| 49 | +* `spec.containers[].resources.limits` |
| 50 | + |
| 51 | + When you specify the resource limit for containers in a Pod, the kubelet enforces |
| 52 | + those limits so that the running containers are not allowed to use more of those |
| 53 | + resources than the limits you set. |
| 54 | + |
| 55 | +When the kubelet starts a container as a part of a Pod, kubelet passes the |
| 56 | +container's requests and limits for CPU and memory to the container runtime. |
| 57 | +The container runtime assigns both CPU request and CPU limit to a container. |
| 58 | +Provided the system has free CPU time, the containers are guaranteed to be |
| 59 | +allocated as much CPU as they request. Containers cannot use more CPU than |
| 60 | +the configured limit i.e. containers CPU usage will be throttled if they |
| 61 | +use more CPU than the specified limit within a given time slice. |
| 62 | + |
| 63 | +Prior to Memory QoS feature, the container runtime only used the memory |
| 64 | +limit and discarded the memory `request` (requests were, and still are, |
| 65 | +also used to influence [scheduling](/docs/concepts/scheduling-eviction/#scheduling)). |
| 66 | +If a container uses more memory than the configured limit, |
| 67 | +the Linux Out Of Memory (OOM) killer will be invoked. |
| 68 | + |
| 69 | +Let's compare how the container runtime on Linux typically configures memory |
| 70 | +request and limit in cgroups, with and without Memory QoS feature: |
| 71 | + |
| 72 | +* **Memory request** |
| 73 | + |
| 74 | + The memory request is mainly used by kube-scheduler during (Kubernetes) Pod |
| 75 | + scheduling. In cgroups v1, there are no controls to specify the minimum amount |
| 76 | + of memory the cgroups must always retain. Hence, the container runtime did not |
| 77 | + use the value of requested memory set in the Pod spec. |
| 78 | + |
| 79 | + cgroups v2 introduced a `memory.min` setting, used to specify the minimum |
| 80 | + amount of memory that should remain available to the processes within |
| 81 | + a given cgroup. If the memory usage of a cgroup is within its effective |
| 82 | + min boundary, the cgroup’s memory won’t be reclaimed under any conditions. |
| 83 | + If the kernel cannot maintain at least `memory.min` bytes of memory for the |
| 84 | + processes within the cgroup, the kernel invokes its OOM killer. In other words, |
| 85 | + the kernel guarantees at least this much memory is available or terminates |
| 86 | + processes (which may be outside the cgroup) in order to make memory more available. |
| 87 | + Memory QoS maps `memory.min` to `spec.containers[].resources.requests.memory` |
| 88 | + to ensure the availability of memory for containers in Kubernetes Pods. |
| 89 | + |
| 90 | +* **Memory limit** |
| 91 | + |
| 92 | + The `memory.limit` specifies the memory limit, beyond which if the container tries |
| 93 | + to allocate more memory, Linux kernel will terminate a process with an |
| 94 | + OOM (Out of Memory) kill. If the terminated process was the main (or only) process |
| 95 | + inside the container, the container may exit. |
| 96 | + |
| 97 | + In cgroups v1, `memory.limit_in_bytes` interface is used to set the memory usage limit. |
| 98 | + However, unlike CPU, it was not possible to apply memory throttling: as soon as a |
| 99 | + container crossed the memory limit, it would be OOM killed. |
| 100 | + |
| 101 | + In cgroups v2, `memory.max` is analogous to `memory.limit_in_bytes` in cgroupv1. |
| 102 | + Memory QoS maps `memory.max` to `spec.containers[].resources.limits.memory` to |
| 103 | + specify the hard limit for memory usage. If the memory consumption goes above this |
| 104 | + level, the kernel invokes its OOM Killer. |
| 105 | + |
| 106 | + cgroups v2 also added `memory.high` configuration . Memory QoS uses `memory.high` |
| 107 | + to set memory usage throttle limit. If the `memory.high` limit is breached, |
| 108 | + the offending cgroups are throttled, and the kernel tries to reclaim memory |
| 109 | + which may avoid an OOM kill. |
| 110 | + |
| 111 | +## How it works |
| 112 | + |
| 113 | +### Cgroups v2 memory controller interfaces & Kubernetes container resources mapping |
| 114 | + |
| 115 | +Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in |
| 116 | +Kubernetes. cgroupv2 interfaces that this feature uses are: |
| 117 | +* `memory.max` |
| 118 | +* `memory.min` |
| 119 | +* `memory.high`. |
| 120 | + |
| 121 | +{{< figure src="/blog/2023/05/05/qos-memory-resources/memory-qos-cal.svg" title="Memory QoS Levels" alt="Memory QoS Levels" >}} |
| 122 | + |
| 123 | +`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and |
| 124 | +the container runtime configure the limit in the respective cgroup. The kernel |
| 125 | +enforces the limit to prevent the container from using more than the configured |
| 126 | +resource limit. If a process in a container tries to consume more than the |
| 127 | +specified limit, kernel terminates a process(es) with an out of |
| 128 | +memory Out of Memory (OOM) error. |
| 129 | + |
| 130 | +{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-max.svg" title="memory.max maps to limits.memory" alt="memory.max maps to limits.memory" >}} |
| 131 | + |
| 132 | +`memory.min` is mapped to `requests.memory`, which results in reservation of memory resources |
| 133 | +that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of |
| 134 | +memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM |
| 135 | +killer is invoked to make more memory available. |
| 136 | + |
| 137 | +{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-min.svg" title="memory.min maps to requests.memory" alt="memory.min maps to requests.memory" >}} |
| 138 | + |
| 139 | +For memory protection, in addition to the original way of limiting memory usage, Memory QoS |
| 140 | +throttles workload approaching its memory limit, ensuring that the system is not overwhelmed |
| 141 | +by sporadic increases in memory usage. A new field, `memoryThrottlingFactor`, is available in |
| 142 | +the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default. |
| 143 | +`memory.high` is mapped to throttling limit calculated by using `memoryThrottlingFactor`, |
| 144 | +`requests.memory` and `limits.memory` as in the formula below, and rounding down the |
| 145 | +value to the nearest page size: |
| 146 | + |
| 147 | +{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high.svg" title="memory.high formula" alt="memory.high formula" >}} |
| 148 | + |
| 149 | +**Note**: If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory. |
| 150 | + |
| 151 | +**Summary:** |
| 152 | +<table> |
| 153 | + <tr> |
| 154 | + <th style="text-align:center">File</th> |
| 155 | + <th style="text-align:center">Description</th> |
| 156 | + </tr> |
| 157 | + <tr> |
| 158 | + <td>memory.max</td> |
| 159 | + <td><code>memory.max</code> specifies the maximum memory limit, |
| 160 | + a container is allowed to use. If a process within the container |
| 161 | + tries to consume more memory than the configured limit, |
| 162 | + the kernel terminates the process with an Out of Memory (OOM) error. |
| 163 | + <br> |
| 164 | + <br> |
| 165 | + <i>It is mapped to the container's memory limit specified in Pod manifest.</i> |
| 166 | + </td> |
| 167 | + </tr> |
| 168 | + <tr> |
| 169 | + <td>memory.min</td> |
| 170 | + <td><code>memory.min</code> specifies a minimum amount of memory |
| 171 | + the cgroups must always retain, i.e., memory that should never be |
| 172 | + reclaimed by the system. |
| 173 | + If there's no unprotected reclaimable memory available, OOM kill is invoked. |
| 174 | + <br> |
| 175 | + <br> |
| 176 | + <i>It is mapped to the container's memory request specified in the Pod manifest.</i> |
| 177 | + </td> |
| 178 | + </tr> |
| 179 | + <tr> |
| 180 | + <td>memory.high</td> |
| 181 | + <td><code>memory.high</code> specifies the memory usage throttle limit. |
| 182 | + This is the main mechanism to control a cgroup's memory use. If |
| 183 | + cgroups memory use goes over the high boundary specified here, |
| 184 | + the cgroups processes are throttled and put under heavy reclaim pressure. |
| 185 | + <br> |
| 186 | + <br> |
| 187 | + <i>Kubernetes uses a formula to calculate <code>memory.high</code>, |
| 188 | + depending on container's memory request, memory limit or node allocatable memory |
| 189 | + (if container's memory limit is empty) and a throttling factor. |
| 190 | + Please refer to the <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos">KEP</a> |
| 191 | + for more details on the formula.</i> |
| 192 | + </td> |
| 193 | + </tr> |
| 194 | +</table> |
| 195 | + |
| 196 | +**Note** `memory.high` is set only on container level cgroups while `memory.min` is set on |
| 197 | +container, pod, and node level cgroups. |
| 198 | + |
| 199 | +### `memory.min` calculations for cgroups heirarchy |
| 200 | + |
| 201 | +When container memory requests are made, kubelet passes `memory.min` to the back-end |
| 202 | +CRI runtime (such as containerd or CRI-O) via the `Unified` field in CRI during |
| 203 | +container creation. The `memory.min` in container level cgroups will be set to: |
| 204 | + |
| 205 | +$memory.min = pod.spec.containers[i].resources.requests[memory]$ |
| 206 | +<sub>for every i<sup>th</sup> container in a pod</sub> |
| 207 | +<br> |
| 208 | +<br> |
| 209 | +Since the `memory.min` interface requires that the ancestor cgroups directories are all |
| 210 | +set, the pod and node cgroups directories need to be set correctly. |
| 211 | + |
| 212 | +`memory.min` in pod level cgroup: |
| 213 | +$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$ |
| 214 | +<sub>for every i<sup>th</sup> container in a pod</sub> |
| 215 | +<br> |
| 216 | +<br> |
| 217 | +`memory.min` in node level cgroup: |
| 218 | +$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$ |
| 219 | +<sub>for every j<sup>th</sup> container in every i<sup>th</sup> pod on a node</sub> |
| 220 | +<br> |
| 221 | +<br> |
| 222 | +Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups |
| 223 | +directly using the libcontainer library (from the runc project), while container |
| 224 | +cgroups limits are managed by the container runtime. |
| 225 | + |
| 226 | +### Support for Pod QoS classes |
| 227 | + |
| 228 | +Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like |
| 229 | +to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling. |
| 230 | +Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per |
| 231 | +Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high |
| 232 | +as per QOS classes: |
| 233 | + |
| 234 | +1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are |
| 235 | +not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting |
| 236 | +memory.high. This ensures that Guaranteed pods can fully use their memory requests up |
| 237 | +to their set limit, and not hit any throttling. |
| 238 | + |
| 239 | +2. **Burstable pods** by their QoS definition require at least one container in the Pod with |
| 240 | +CPU or memory request or limit set. |
| 241 | + |
| 242 | + * When requests.memory and limits.memory are set, the formula is used as-is: |
| 243 | + |
| 244 | + {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-limit.svg" title="memory.high when requests and limits are set" alt="memory.high when requests and limits are set" >}} |
| 245 | + |
| 246 | + * When requests.memory is set and limits.memory is not set, limits.memory is substituted |
| 247 | + for node allocatable memory in the formula: |
| 248 | + |
| 249 | + {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-no-limits.svg" title="memory.high when requests and limits are not set" alt="memory.high when requests and limits are not set" >}} |
| 250 | + |
| 251 | +3. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests. |
| 252 | + For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable |
| 253 | + memory in the formula: |
| 254 | + |
| 255 | + {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-best-effort.svg" title="memory.high for BestEffort Pod" alt="memory.high for BestEffort Pod" >}} |
| 256 | + |
| 257 | +**Summary**: Only Pods in Burstable and BestEffort QoS classes will set `memory.high`. |
| 258 | +Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed. |
| 259 | + |
| 260 | +## How do I use it? |
| 261 | + |
| 262 | +The prerequisites for enabling Memory QoS feature on your Linux node are: |
| 263 | + |
| 264 | +1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements) |
| 265 | + related to [Kubernetes support for cgroups v2](/docs/concepts/architecture/cgroups) |
| 266 | + are met. |
| 267 | +2. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd |
| 268 | + and CRI-O provide support compatible with Memory QoS (alpha). This was implemented |
| 269 | + in the following PRs: |
| 270 | + * Containerd: [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627). |
| 271 | + * CRI-O: [implement kube alpha features for 1.22 #5207](https://github.com/cri-o/cri-o/pull/5207). |
| 272 | + |
| 273 | +Memory QoS remains an alpha feature for Kubernetes v1.27. You can enable the feature by setting |
| 274 | +`MemoryQoS=true` in the kubelet configuration file: |
| 275 | + |
| 276 | +```yaml |
| 277 | +apiVersion: kubelet.config.k8s.io/v1beta1 |
| 278 | +kind: KubeletConfiguration |
| 279 | +featureGates: |
| 280 | + MemoryQoS: true |
| 281 | +``` |
| 282 | +
|
| 283 | +## How do I get involved? |
| 284 | +
|
| 285 | +Huge thank you to all the contributors who helped with the design, implementation, |
| 286 | +and review of this feature: |
| 287 | +
|
| 288 | +* Dixita Narang ([ndixita](https://github.com/ndixita)) |
| 289 | +* Tim Xu ([xiaoxubeii](https://github.com/xiaoxubeii)) |
| 290 | +* Paco Xu ([pacoxu](https://github.com/pacoxu)) |
| 291 | +* David Porter([bobbypage](https://github.com/bobbypage)) |
| 292 | +* Mrunal Patel([mrunalp](https://github.com/mrunalp)) |
| 293 | +
|
| 294 | +For those interested in getting involved in future discussions on Memory QoS feature, |
| 295 | +you can reach out SIG Node by several means: |
| 296 | +- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) |
| 297 | +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) |
| 298 | +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) |
0 commit comments