|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.28: Beta support for using swap on Linux" |
| 4 | +date: 2023-08-24T10:00:00-08:00 |
| 5 | +slug: swap-linux-beta |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Itamar Holder (Red Hat) |
| 9 | + |
| 10 | +The 1.22 release [introduced Alpha support](/blog/2021/08/09/run-nodes-with-swap-alpha/) |
| 11 | +for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. |
| 12 | +Now, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many |
| 13 | +new improvements. |
| 14 | + |
| 15 | +Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. |
| 16 | +This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization |
| 17 | +when swap memory was involved. As a result, swap support was deemed out of scope in the initial |
| 18 | +design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory |
| 19 | +was detected on a node. |
| 20 | + |
| 21 | +In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This represented |
| 22 | +a significant advancement, providing Linux users with the opportunity to experiment with the swap |
| 23 | +feature for the first time. However, as an Alpha version, it was not fully developed and had |
| 24 | +several issues, including inadequate support for cgroup v2, insufficient metrics and summary |
| 25 | +API statistics, inadequate testing, and more. |
| 26 | + |
| 27 | +Swap in Kubernetes has numerous [use cases](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#user-stories) |
| 28 | +for a wide range of users. As a result, the node special interest group within the Kubernetes project |
| 29 | +has invested significant effort into supporting swap on Linux nodes for beta. |
| 30 | +Compared to the alpha, the kubelet's support for running with swap enabled is more stable and |
| 31 | +robust, more user-friendly, and addresses many known shortcomings. This graduation to beta |
| 32 | +represents a crucial step towards achieving the goal of fully supporting swap in Kubernetes. |
| 33 | + |
| 34 | +## How do I use it? |
| 35 | + |
| 36 | +The utilization of swap memory on a node where it has already been provisioned can be |
| 37 | +facilitated by the activation of the `NodeSwap` feature gate on the kubelet. |
| 38 | +Additionally, you must disable the `failSwapOn` configuration setting, or the deprecated |
| 39 | +`--fail-swap-on` command line flag must be deactivated. |
| 40 | + |
| 41 | +It is possible to configure the `memorySwap.swapBehavior` option to define the manner in which a node utilizes swap memory. For instance, |
| 42 | + |
| 43 | +```yaml |
| 44 | +# this fragment goes into the kubelet's configuration file |
| 45 | +memorySwap: |
| 46 | + swapBehavior: UnlimitedSwap |
| 47 | +``` |
| 48 | +
|
| 49 | +The available configuration options for `swapBehavior` are: |
| 50 | +- `UnlimitedSwap` (default): Kubernetes workloads can use as much swap memory as they |
| 51 | + request, up to the system limit. |
| 52 | +- `LimitedSwap`: The utilization of swap memory by Kubernetes workloads is subject to limitations. |
| 53 | +Only Pods of [Burstable](docs/concepts/workloads/pods/pod-qos/#burstable) QoS are permitted to employ swap. |
| 54 | + |
| 55 | +If configuration for `memorySwap` is not specified and the feature gate is |
| 56 | +enabled, by default the kubelet will apply the same behaviour as the |
| 57 | +`UnlimitedSwap` setting. |
| 58 | + |
| 59 | +Note that `NodeSwap` is supported for **cgroup v2** only. For Kubernetes v1.28, |
| 60 | +using swap along with cgroup v1 is no longer supported. |
| 61 | + |
| 62 | +## Install a swap-enabled cluster with kubeadm |
| 63 | + |
| 64 | +### Before you begin |
| 65 | + |
| 66 | +It is required for this demo that the kubeadm tool be installed, following the steps outlined in the |
| 67 | +[kubeadm installation guide](/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm). |
| 68 | +If swap is already enabled on the node, cluster creation may |
| 69 | +proceed. If swap is not enabled, please refer to the provided instructions for enabling swap. |
| 70 | + |
| 71 | +### Create a swap file and turn swap on |
| 72 | + |
| 73 | +I'll demonstrate creating 4GiB of unencrypted swap. |
| 74 | + |
| 75 | +```bash |
| 76 | +dd if=/dev/zero of=/swapfile bs=128M count=32 |
| 77 | +chmod 600 /swapfile |
| 78 | +mkswap /swapfile |
| 79 | +swapon /swapfile |
| 80 | +swapon -s # enable the swap file only until this node is rebooted |
| 81 | +``` |
| 82 | + |
| 83 | +To start the swap file at boot time, add line like `/swapfile swap swap defaults 0 0` to `/etc/fstab` file. |
| 84 | + |
| 85 | +### Set up a Kubernetes cluster that uses swap-enabled nodes |
| 86 | + |
| 87 | +To make things clearer, here is an example kubeadm configuration file `kubeadm-config.yaml` for the swap enabled cluster. |
| 88 | + |
| 89 | +```yaml |
| 90 | +--- |
| 91 | +apiVersion: "kubeadm.k8s.io/v1beta3" |
| 92 | +kind: InitConfiguration |
| 93 | +--- |
| 94 | +apiVersion: kubelet.config.k8s.io/v1beta1 |
| 95 | +kind: KubeletConfiguration |
| 96 | +failSwapOn: false |
| 97 | +featureGates: |
| 98 | + NodeSwap: true |
| 99 | +memorySwap: |
| 100 | + swapBehavior: LimitedSwap |
| 101 | +``` |
| 102 | + |
| 103 | +Then create a single-node cluster using `kubeadm init --config kubeadm-config.yaml`. |
| 104 | +During init, there is a warning that swap is enabled on the node and in case the kubelet |
| 105 | +`failSwapOn` is set to true. We plan to remove this warning in a future release. |
| 106 | + |
| 107 | +## How is the swap limit being determined with LimitedSwap? |
| 108 | + |
| 109 | +The configuration of swap memory, including its limitations, presents a significant |
| 110 | +challenge. Not only is it prone to misconfiguration, but as a system-level property, any |
| 111 | +misconfiguration could potentially compromise the entire node rather than just a specific |
| 112 | +workload. To mitigate this risk and ensure the health of the node, we have implemented |
| 113 | +Swap in Beta with automatic configuration of limitations. |
| 114 | + |
| 115 | +With `LimitedSwap`, Pods that do not fall under the Burstable QoS classification (i.e. |
| 116 | +`BestEffort`/`Guaranteed` Qos Pods) are prohibited from utilizing swap memory. |
| 117 | +`BestEffort` QoS Pods exhibit unpredictable memory consumption patterns and lack |
| 118 | +information regarding their memory usage, making it difficult to determine a safe |
| 119 | +allocation of swap memory. Conversely, `Guaranteed` QoS Pods are typically employed for |
| 120 | +applications that rely on the precise allocation of resources specified by the workload, |
| 121 | +with memory being immediately available. To maintain the aforementioned security and node |
| 122 | +health guarantees, these Pods are not permitted to use swap memory when `LimitedSwap` is |
| 123 | +in effect. |
| 124 | + |
| 125 | +Prior to detailing the calculation of the swap limit, it is necessary to define the following terms: |
| 126 | +* `nodeTotalMemory`: The total amount of physical memory available on the node. |
| 127 | +* `totalPodsSwapAvailable`: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use). |
| 128 | +* `containerMemoryRequest`: The container's memory request. |
| 129 | + |
| 130 | +Swap limitation is configured as: |
| 131 | +`(containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable` |
| 132 | + |
| 133 | +In other words, the amount of swap that a container is able to use is proportionate to its |
| 134 | +memory request, the node's total physical memory and the total amount of swap memory on |
| 135 | +the node that is available for use by Pods. |
| 136 | + |
| 137 | +It is important to note that, for containers within Burstable QoS Pods, it is possible to |
| 138 | +opt-out of swap usage by specifying memory requests that are equal to memory limits. |
| 139 | +Containers configured in this manner will not have access to swap memory. |
| 140 | + |
| 141 | +## How does it work? |
| 142 | + |
| 143 | +There are a number of possible ways that one could envision swap use on a node. |
| 144 | +When swap is already provisioned and available on a node, |
| 145 | +SIG Node have [proposed](https://github.com/kubernetes/enhancements/blob/9d127347773ad19894ca488ee04f1cd3af5774fc/keps/sig-node/2400-node-swap/README.md#proposal) |
| 146 | +the kubelet should be able to be configured so that: |
| 147 | +- It can start with swap on. |
| 148 | +- It will direct the Container Runtime Interface to allocate zero swap memory |
| 149 | + to Kubernetes workloads by default. |
| 150 | + |
| 151 | +Swap configuration on a node is exposed to a cluster admin via the |
| 152 | +[`memorySwap` in the KubeletConfiguration](/docs/reference/config-api/kubelet-config.v1). |
| 153 | +As a cluster administrator, you can specify the node's behaviour in the |
| 154 | +presence of swap memory by setting `memorySwap.swapBehavior`. |
| 155 | + |
| 156 | +The kubelet [employs the CRI](https://kubernetes.io/docs/concepts/architecture/cri/) |
| 157 | +(container runtime interface) API to direct the CRI to |
| 158 | +configure specific cgroup v2 parameters (such as `memory.swap.max`) in a manner that will |
| 159 | +enable the desired swap configuration for a container. The CRI is then responsible to |
| 160 | +write these settings to the container-level cgroup. |
| 161 | + |
| 162 | +## How can I monitor swap? |
| 163 | + |
| 164 | +A notable deficiency in the Alpha version was the inability to monitor and introspect swap |
| 165 | +usage. This issue has been addressed in the Beta version introduced in Kubernetes 1.28, which now |
| 166 | +provides the capability to monitor swap usage through several different methods. |
| 167 | + |
| 168 | +The beta version of kubelet now collects |
| 169 | +[node-level metric statistics](/docs/reference/instrumentation/node-metrics/), |
| 170 | +which can be accessed at the `/metrics/resource` and `/stats/summary` kubelet HTTP endpoints. |
| 171 | +This allows clients who can directly interrogate the kubelet to |
| 172 | +monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a |
| 173 | +`machine_swap_bytes` metric has been added to cadvisor to show the total physical swap capacity of the |
| 174 | +machine. |
| 175 | + |
| 176 | +## Caveats |
| 177 | + |
| 178 | +Having swap available on a system reduces predictability. Swap's performance is |
| 179 | +worse than regular memory, sometimes by many orders of magnitude, which can |
| 180 | +cause unexpected performance regressions. Furthermore, swap changes a system's |
| 181 | +behaviour under memory pressure. Since enabling swap permits |
| 182 | +greater memory usage for workloads in Kubernetes that cannot be predictably |
| 183 | +accounted for, it also increases the risk of noisy neighbours and unexpected |
| 184 | +packing configurations, as the scheduler cannot account for swap memory usage. |
| 185 | + |
| 186 | +The performance of a node with swap memory enabled depends on the underlying |
| 187 | +physical storage. When swap memory is in use, performance will be significantly |
| 188 | +worse in an I/O operations per second (IOPS) constrained environment, such as a |
| 189 | +cloud VM with I/O throttling, when compared to faster storage mediums like |
| 190 | +solid-state drives or NVMe. |
| 191 | + |
| 192 | +As such, we do not advocate the utilization of swap memory for workloads or |
| 193 | +environments that are subject to performance constraints. Furthermore, it is |
| 194 | +recommended to employ `LimitedSwap`, as this significantly mitigates the risks |
| 195 | +posed to the node. |
| 196 | + |
| 197 | +Cluster administrators and developers should benchmark their nodes and applications |
| 198 | +before using swap in production scenarios, and [we need your help](#how-do-i-get-involved) with that! |
| 199 | + |
| 200 | +### Security risk |
| 201 | + |
| 202 | +Enabling swap on a system without encryption poses a security risk, as critical information, |
| 203 | +such as volumes that represent Kubernetes Secrets, [may be swapped out to the disk](/docs/concepts/configuration/secret/#information-security-for-secrets). |
| 204 | +If an unauthorized individual gains |
| 205 | +access to the disk, they could potentially obtain these confidential data. To mitigate this risk, the |
| 206 | +Kubernetes project strongly recommends that you encrypt your swap space. |
| 207 | +However, handling encrypted swap is not within the scope of |
| 208 | +kubelet; rather, it is a general OS configuration concern and should be addressed at that level. |
| 209 | +It is the administrator's responsibility to provision encrypted swap to mitigate this risk. |
| 210 | + |
| 211 | +Furthermore, as previously mentioned, with `LimitedSwap` the user has the option to completely |
| 212 | +disable swap usage for a container by specifying memory requests that are equal to memory limits. |
| 213 | +This will prevent the corresponding containers from accessing swap memory. |
| 214 | + |
| 215 | +## Looking ahead |
| 216 | + |
| 217 | +The Kubernetes 1.28 release introduced Beta support for swap memory on Linux nodes, |
| 218 | +and we will continue to work towards [general availability](/docs/reference/command-line-tools-reference/feature-gates/#feature-stages) |
| 219 | +for this feature. I hope that this will include: |
| 220 | + |
| 221 | +* Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host. |
| 222 | +* Adding support for controlling swap consumption at the Pod level via cgroups. |
| 223 | + * This point is still under discussion. |
| 224 | +* Collecting feedback from test user cases. |
| 225 | + * We will consider introducing new configuration modes for swap, such as a |
| 226 | + node-wide swap limit for workloads. |
| 227 | + |
| 228 | +## How can I learn more? |
| 229 | + |
| 230 | +You can review the current [documentation](/docs/concepts/architecture/nodes/#swap-memory) |
| 231 | +for using swap with Kubernetes. |
| 232 | + |
| 233 | +For more information, and to assist with testing and provide feedback, please |
| 234 | +see [KEP-2400](https://github.com/kubernetes/enhancements/issues/4128) and its |
| 235 | +[design proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md). |
| 236 | + |
| 237 | +## How do I get involved? |
| 238 | + |
| 239 | +Your feedback is always welcome! SIG Node [meets regularly](https://github.com/kubernetes/community/tree/master/sig-node#meetings) |
| 240 | +and [can be reached](https://github.com/kubernetes/community/tree/master/sig-node#contact) |
| 241 | +via [Slack](https://slack.k8s.io/) (channel **#sig-node**), or the SIG's |
| 242 | +[mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node). A Slack |
| 243 | +channel dedicated to swap is also available at **#sig-node-swap**. |
| 244 | + |
| 245 | +Feel free to reach out to me, Itamar Holder (**@iholder101** on Slack and GitHub) |
| 246 | +if you'd like to help or ask further questions. |
| 247 | + |
| 248 | + |
0 commit comments