You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Enabling swap as an end user](#enabling-swap-as-an-end-user)
22
+
-[API Changes](#api-changes)
23
+
-[KubeConfig addition](#kubeconfig-addition)
24
+
-[CRI Changes](#cri-changes)
37
25
-[Test Plan](#test-plan)
38
26
-[Graduation Criteria](#graduation-criteria)
39
27
-[Alpha](#alpha)
@@ -121,20 +109,24 @@ This KEP will be limited in scope to the first two scenarios. The third can be a
121
109
- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
122
110
- Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
123
111
- Cluster administrators can enable and configure CRI swap utilization on a per-node basis.
112
+
- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
124
113
125
114
### Non-Goals
126
115
127
116
- Provisioning swap. Swap must already be available on the system.
117
+
- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes.
128
118
- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work.
129
119
- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
I propose that, when swap is provisioned and available on a node, we allow cluster administrators to configure the Kubelet and CRI such that:
125
+
We propose that, when swap is provisioned and available on a node, cluster administrators can configure the Kubelet and CRI such that:
134
126
135
127
- The kubelet can start with swap on.
136
128
- The CRI is updated such that by default, workloads will use 0 swap.
137
-
- The CRI will have configuration available such that swap utilization can be configured for the entire node (e.g. as a percentage of pod memory requests).
129
+
- The CRI will have configuration available such that swap utilization can be configured for the entire node.
138
130
139
131
This proposal enables scenarios 1 and 2 above, but not 3.
140
132
@@ -201,133 +193,121 @@ This user story is addressed by scenario 2, and could benefit from 3.
201
193
202
194
### Notes/Constraints/Caveats (Optional)
203
195
204
-
<!--
205
-
What are the caveats to the proposal?
206
-
What are some important details that didn't come across above?
207
-
Go in to as much detail as necessary here.
208
-
This might be a good place to talk about core concepts and how they relate.
209
-
-->
196
+
In changing the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
210
197
211
-
### Risks and Mitigations
198
+
We considered adding parameters for both per-workload `memory-swap`and `swappiness`. These are documented as part of the Open Containers [runtime specification] for Linux memory configuration. Since `memory-swap` is a per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads.
212
199
213
-
Having swap available on a system reduces predictability. When swap is available to workloads, and is not accounted for on an individual workload-by-workload basis
200
+
Since we are not currently setting `memory-swap` in the CRI, the default behaviour is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting `memory-swap` equal to `limit`.
214
201
215
-
First, this risk is mitigated by preventing any workloads from using swap by default, even if it is enabled on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
Additionally, we mitigate this risk by quantifying system stability and then gathering test and production data to determine if system stability remains the same or is improved when swap is available to the system and/or workloads.
204
+
### Risks and Mitigations
218
205
219
-
Since swap provisioning is out of scope of this proposal, this enhancement poses little risk to Kubernetes clusters that will not enable swap.
206
+
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
220
207
221
-
## Design Details
208
+
This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
222
209
223
-
### TL;DR
210
+
Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
224
211
225
-
In a nutshell, the following implementation are planned for Memory Swap Support
226
-
in 1.22 GKE alpha
212
+
Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
227
213
228
-
1. Having a feature gate `SupportNodeMemorySwap` guarding against the memory
229
-
swap support feature
230
-
2. Keep the default value of kubelet flag `--fail-on-swap` to `true` in order
231
-
to minimize the blast radius
232
-
3. Introducing two new kubelet config `MemorySwapLimit` and `Swappiness`
233
-
4. Introducing two new CRI parameter `memory_swap_limit_in_bytes` and `memory_swappiness`
234
-
5. End to end wiring from kubelet config file to CRI
214
+
## Design Details
235
215
236
-
### Expected User Behaviour
216
+
We summarize the implementation plan as following:
237
217
238
-
For alpha, the feature gate `SupportNodeMemorySwap` is default to disabled, and
239
-
`--fail-on-swap` flag value is the same as 1.21. Therefore, from Kubernetes
240
-
user’s perspective, no behavior changes out of the box.
218
+
1. Add a feature gate `NodeSwapEnabled` to enable swap support.
219
+
1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid changing default behaviour.
220
+
1. Introduce a new kubelet config parameter, `MemorySwapLimit`.
221
+
1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`.
222
+
1. Integrate new kubelet config and pass values to CRI for container creation.
223
+
1. Ensure container runtimes are updated so they can make use of the new CRI configuration.
241
224
242
-
For users that are ready to explore the Memory Swap feature in 1.22 Alpha, they
243
-
will need to complete the following steps
225
+
### Enabling swap as an end user
244
226
245
-
1. provision swap enable `SupportNodeMemorySwap` flag AND
246
-
2. set `--fail-on-swap` flag to `false`
227
+
Swap can be enabled as follows:
247
228
248
-
Then, the user can start experimenting/fine tuning kubelet configuration
249
-
`MemorySwapLimit` and/or `Swappiness` and observe the changes.
229
+
1. Provision swap on the target worker nodes,
230
+
1. Enable `NodeMemorySwap` flag on the kubelet,
231
+
1. Set `--fail-on-swap` flag to `false`, and
232
+
1. (Optional) Configure `MemorySwapLimit` in the KubeletConfig for tuning.
250
233
251
-
### New Kubelet Configuration
234
+
### API Changes
252
235
253
-
We will be introducing two new parameters to `KubeletConfiguration struct`
These two configurations, if set, will apply to every container of the Node
257
-
where kubelet is running.
236
+
#### KubeConfig addition
258
237
259
-
|Name|Description|Default Value|Feature Gate|
260
-
|--- |--- |--- |--- |
261
-
|MemorySwapLimit|This parameter sets total memory limit (memory + swap). This limits the total amount of memory this container is allowed to swap to disk.|-2, which enable disable swap|SupportNodeMemorySwap|
262
-
|MemorySwappiness|This configuration sets how aggressively the kernel will swap memory pages. By default, the host kernel can swap out a percentage of anonymous pages used by a container. Users can set value between 0 and 100, to tune this percentage.|Unset, which will use host value|SupportNodeMemorySwap|
238
+
We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
0 commit comments