@@ -112,10 +112,11 @@ node.
112
112
1 . Swap is enabled on a node's host system, but the kubelet does not permit
113
113
Kubernetes workloads to use swap. (This scenario is a prerequisite for the
114
114
following use cases.)
115
- 1 . Swap is enabled at the node level. The CRI can be globally configured to
116
- permit user workloads scheduled on the node to use some quantity of swap.
117
- 1 . Swap is set on a per-workload basis. The CRI sets permitted swap utilization
118
- on each individual workload.
115
+ 1 . Swap is enabled at the node level. The kubelet can permit Kubernetes
116
+ workloads scheduled on the node to use some quantity of swap, depending on
117
+ the configuration.
118
+ 1 . Swap is set on a per-workload basis. The kubelet sets swap limits for each
119
+ individual workload.
119
120
120
121
This KEP will be limited in scope to the first two scenarios. The third can be
121
122
addressed in a follow-up KEP. The enablement work that is in scope for this KEP
@@ -164,9 +165,9 @@ This proposal enables scenarios 1 and 2 above, but not 3.
164
165
165
166
#### Improved Node Stability
166
167
167
- cgroupsv2 improved memory management algos , such as oomd, currently require
168
- swap. Hence, having a small amount of swap available on nodes could improve
169
- better resource pressure handling and recovery.
168
+ cgroupsv2 improved memory management algorithms , such as oomd, strongly
169
+ recommend the use of swap. Hence, having a small amount of swap available on
170
+ nodes could improve better resource pressure handling and recovery.
170
171
171
172
- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
172
173
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
@@ -253,7 +254,7 @@ This user story is addressed by scenario 2, and could benefit from 3.
253
254
254
255
### Notes/Constraints/Caveats (Optional)
255
256
256
- In changing the CRI, we must ensure that container runtime downstreams are able
257
+ In updating the CRI, we must ensure that container runtime downstreams are able
257
258
to support the new configurations.
258
259
259
260
We considered adding parameters for both per-workload ` memory-swap ` and
@@ -301,26 +302,31 @@ We summarize the implementation plan as following:
301
302
1 . Add a feature gate ` NodeSwapEnabled ` to enable swap support.
302
303
1 . Leave the default value of kubelet flag ` --fail-on-swap ` to ` true ` , to avoid
303
304
changing default behaviour.
304
- 1 . Introduce a new kubelet config parameter, ` MemorySwapLimit ` .
305
+ 1 . Introduce a new kubelet config parameter, ` MemorySwap ` , which configures how
306
+ much swap Kubernetes workloads can use on the node.
305
307
1 . Introduce a new CRI parameter, ` memory_swap_limit_in_bytes ` .
306
- 1 . Integrate new kubelet config and pass values to CRI for container creation.
307
- 1 . Ensure container runtimes are updated so they can make use of the new CRI configuration.
308
+ 1 . Ensure container runtimes are updated so they can make use of the new CRI
309
+ configuration.
310
+ 1 . Based on the behaviour set in the kubelet config, the kubelet will instruct
311
+ the CRI on the amount of swap to allocate to each container. The container
312
+ runtime will then write the swap settings to the container level cgroup.
308
313
309
314
### Enabling swap as an end user
310
315
311
316
Swap can be enabled as follows:
312
317
313
318
1 . Provision swap on the target worker nodes,
314
- 1 . Enable ` NodeMemorySwap ` flag on the kubelet,
319
+ 1 . Enable the ` NodeMemorySwap ` feature flag on the kubelet,
315
320
1 . Set ` --fail-on-swap ` flag to ` false ` , and
316
- 1 . (Optional) Configure ` MemorySwapLimit ` in the KubeletConfig for tuning.
321
+ 1 . (Optional) Allow Kubernetes workloads to use swap by setting
322
+ ` MemorySwap.SwapBehavior=UnlimitedSwap ` in the kubelet config.
317
323
318
324
### API Changes
319
325
320
326
#### KubeConfig addition
321
327
322
- We will add an optional ` MemorySwapLimit ` value to the ` KubeletConfig ` struct
323
- in [ pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
328
+ We will add an optional ` MemorySwap ` value to the ` KubeletConfig ` struct
329
+ in [ pkg/kubelet/apis/config/types.go] as follows:
324
330
325
331
[ pkg/kubelet/apis/config/types.go ] : https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81
326
332
@@ -339,14 +345,7 @@ type MemorySwapConfiguration struct {
339
345
// Configure swap memory available to container workloads. May be one of
340
346
// "", "NoSwap": workloads cannot use swap
341
347
// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
342
- // "LimitedSwap": workloads can use up to this limit of swap.
343
348
SwapBehavior string
344
-
345
- LimitedSwap *LimitedSwapConfiguration
346
- }
347
-
348
- type LimitedSwapConfiguration struct {
349
- PerWorkloadMemorySwapLimit resource.Quantity
350
349
}
351
350
```
352
351
@@ -359,14 +358,8 @@ container specification for the `--memory-swap` flag. Thus, the
359
358
even if it is enabled on a system.
360
359
* If ` SwapBehavior ` is set to ` "UnlimitedSwap" ` , the container is allowed to
361
360
use unlimited swap, up to the maximum amount available on the host system.
362
- * If ` SwapBehavior ` is set to ` "LimitedSwap" ` , then the ` LimitedSwap `
363
- configuration must also be set. ` LimitedSwap.PerWorkloadMemorySwapLimit `
364
- represents the system-wide maximum limit for swap usage of a container. Note
365
- that this limit applies to individual containers, and not at the pod-level,
366
- in order to be set via the CRI rather than e.g. a [ pod cgroup limit] .
367
361
368
362
[ docker ] : https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
369
- [ pod cgroup limit ] : #restrict-swap-usage-at-the-cgroup-level
370
363
371
364
#### CRI Changes
372
365
@@ -406,24 +399,36 @@ phase of graduation.
406
399
407
400
#### Alpha
408
401
409
- - Kubelet can be started with swap enabled.
410
- - KubeletConfig allows CRI to be configured with a percentage of swap available
411
- to workloads. This will default to 0.
402
+ - Kubelet can be started with swap enabled and will support two configurations
403
+ for Kubernetes workloads: ` NoSwap ` and ` UnlimitedSwap ` .
404
+ - Kubelet can configure CRI to allocate swap to Kubernetes workloads. By
405
+ default, workloads will not be allocated any swap.
412
406
- e2e test jobs are configured for Linux systems with swap enabled.
413
407
414
408
#### Beta
415
409
416
- (Tentative.)
410
+ _ (Tentative.)_
417
411
412
+ - Add support for controlling swap consumption at the pod level [ via cgroups] .
413
+ - Handle usage of swap during container restart boundaries for writes to tmpfs
414
+ (which may require pod cgroup change beyond what container runtime will do at
415
+ container cgroup boundary).
416
+ - Add the ability to set a system-reserved quantity of swap from what kubelet
417
+ detects on the host.
418
+ - Consider introducing new configuration modes for swap, such as a node-wide
419
+ swap limit for workloads.
418
420
- Determine a set of metrics for node QoS in order to evaluate the performance
419
421
of nodes with and without swap enabled.
422
+ - Better understand relationship of swap with memory QoS in cgroup v2
423
+ (particularly ` memory.high ` usage).
420
424
- Collect feedback from test user cases.
421
425
- Improve coverage for appropriate scenarios in testgrid.
422
426
427
+ [ via cgroups ] : #restrict-swap-usage-at-the-cgroup-level
428
+
423
429
#### GA
424
430
425
- - Test a wide variety of scenarios that may be affected by swap support, such
426
- as workloads using tmpfs storage.
431
+ - Test a wide variety of scenarios that may be affected by swap support.
427
432
- Remove feature flag.
428
433
429
434
### Upgrade / Downgrade Strategy
0 commit comments