|
1 |
| -# KEP-2400: Node system swap support |
| 1 | +# KEP-2400: Node memory swap support |
2 | 2 |
|
3 | 3 | <!-- toc -->
|
4 | 4 | - [Release Signoff Checklist](#release-signoff-checklist)
|
@@ -117,6 +117,31 @@ support to nodes in a controlled, predictable manner so that Kubernetes users
|
117 | 117 | can perform testing and provide data to continue building cluster capabilities
|
118 | 118 | on top of swap.
|
119 | 119 |
|
| 120 | +This KEP aims to |
| 121 | +introduce basic swap enablement and leave further extensions to follow-up KEPs. |
| 122 | +This way Kubernetes users / vendors would be able to use swap in a basic manner |
| 123 | +quickly while extensions would be brought to discussion in dedicated KEPs that |
| 124 | +would progress in the meantime. |
| 125 | + |
| 126 | +For example, to achieve this goal, this KEP does not introduce any APIs |
| 127 | +that allow customizing how the feature behaves, but instead only determines |
| 128 | +whether the feature is enabled or disabled. |
| 129 | +From an API perspective, this is being done by presenting the kubelet `swapBehavior` |
| 130 | +configuration field. |
| 131 | +Within the scope of this KEP we will support only two basic behaviors: `NoSwap` and `LimitedSwap`. |
| 132 | +Both do not provide any customizability, as `NoSwap` disables swap for workloads and |
| 133 | +`LimitedSwap`'s behaviour is automatic and implicit that requires minimum user |
| 134 | +intervention (see [proposal below](#steps-to-calculate-swap-limit) for more details). |
| 135 | +As mentioned above, in the very near future, follow-up KEPs would bring API extension |
| 136 | +and customizability, supporting zswap, and many other extensions to discussion. |
| 137 | +These customization capabilities will probably be introduced as additional |
| 138 | +"swap behaviors" which will probably bring some API changes, perhaps at the pod level, |
| 139 | +that will extend this feature and make it more suitable for advanced usage. |
| 140 | + |
| 141 | +While this KEP sets the ground for extending the API in follow-ups through "swap behaviors", |
| 142 | +changing APIs, especially at the pod-level, is highly complex and controversial. |
| 143 | +Therefore, it is out of scope for this KEP. |
| 144 | + |
120 | 145 | ## Motivation
|
121 | 146 |
|
122 | 147 | There are two distinct types of user for swap, who may overlap:
|
@@ -161,9 +186,11 @@ will be necessary to implement the third scenario.
|
161 | 186 | - Setting [swappiness]. This can already be set on a system-wide level outside
|
162 | 187 | of Kubernetes.
|
163 | 188 | - Allocating swap on a per-workload basis with accounting (e.g. pod-level
|
164 |
| - specification of swap). If desired, this should be designed and implemented |
165 |
| - as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, |
166 |
| - swap will be an overcommitted resource in the context of this KEP. |
| 189 | + specification of swap), and/or APIs to customize and control the way kubelet |
| 190 | + calculates swap limits, grants swap access, etc. If desired, this should be |
| 191 | + designed and implemented as part of a follow-up KEP. This KEP is a |
| 192 | + prerequisite for that work. Hence, swap will be an overcommitted resource |
| 193 | + in the context of this KEP. |
167 | 194 | - Supporting zram, zswap, or other memory types like SGX EPC. These could be
|
168 | 195 | addressed in a follow-up KEP, and are out of scope.
|
169 | 196 | - Use of swap for cgroupsv1.
|
@@ -194,7 +221,10 @@ Allocate the swap limit equal to the requested memory for each container and adj
|
194 | 221 |
|
195 | 222 | #### Set Aside Swap for System Critical Daemons
|
196 | 223 |
|
197 |
| -**Note** In Beta2, we found that having system critical daemons swapping memory could cause degration of services. |
| 224 | +**Note** In Beta2, we found that having system-critical daemons swapping memory could cause degradation of services. |
| 225 | +Therefore, Kubelet will not automatically configure this, although the admin can still manually configure it |
| 226 | +this way. In the near future, when a follow-up KEP regarding customizability is presented, this will be considered |
| 227 | +to automatically be configured under a dedicated configuration. |
198 | 228 |
|
199 | 229 | System critical daemons (such as Kubelet) are essential for node health. Usually, an appropriate portion of system resources (e.g., memory, CPU) is reserved as system reserved. However, swap doesn't inherently support reserving a portion out of the total available. For instance, in the case of memory, we set `memory.min` on the node-level cgroup to ensure an adequate amount of memory is set aside, away from the pods, and for system critical daemons. But there is no equivalent for swap; i.e., no `memory.swap.min` is supported in the kernel.
|
200 | 230 |
|
@@ -290,6 +320,10 @@ nodes could improve better resource pressure handling and recovery.
|
290 | 320 |
|
291 | 321 | This user story is addressed by scenario 1 and 2, and could benefit from 3.
|
292 | 322 |
|
| 323 | +Note: critical / high-priority pods would not be able to access swap, but can |
| 324 | +still be configured otherwise to gain swap access. In the future, APIs / more |
| 325 | +swap behaviors would be able to be used to control swap in a more customized way. |
| 326 | + |
293 | 327 | #### Long-running applications that swap out startup memory
|
294 | 328 |
|
295 | 329 | - Applications such as the Java and Node runtimes rely on swap for optimal
|
|
0 commit comments