21
21
- [ Enabling swap as an end user] ( #enabling-swap-as-an-end-user )
22
22
- [ API Changes] ( #api-changes )
23
23
- [ KubeConfig addition] ( #kubeconfig-addition )
24
- - [ CRI Changes] ( #cri-changes )
24
+ - [ CRI Changes] ( #cri-changes )
25
25
- [ Test Plan] ( #test-plan )
26
26
- [ Graduation Criteria] ( #graduation-criteria )
27
27
- [ Alpha] ( #alpha )
40
40
- [ Drawbacks] ( #drawbacks )
41
41
- [ Alternatives] ( #alternatives )
42
42
- [ Just set <code >--fail-swap-on=false</code >] ( #just-set- )
43
+ - [ Restrict swap usage at the cgroup level] ( #restrict-swap-usage-at-the-cgroup-level )
43
44
- [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
44
45
<!-- /toc -->
45
46
@@ -108,7 +109,7 @@ node.
108
109
109
110
### Scenarios
110
111
111
- 1 . Swap is enabled on a node's host system, but the CRI does not permit
112
+ 1 . Swap is enabled on a node's host system, but the kubelet does not permit
112
113
Kubernetes workloads to use swap. (This scenario is a prerequisite for the
113
114
following use cases.)
114
115
1 . Swap is enabled at the node level. The CRI can be globally configured to
@@ -125,20 +126,23 @@ will be necessary to implement the third scenario.
125
126
126
127
- On Linux systems, when swap is provisioned and available, Kubelet can start
127
128
up with swap on.
128
- - Configuration is available for CRI to set swap utilization available to
129
+ - Configuration is available for kubelet to set swap utilization available to
129
130
Kubernetes workloads, defaulting to 0 swap.
130
- - Cluster administrators can enable and configure CRI swap utilization on a
131
+ - Cluster administrators can enable and configure kubelet swap utilization on a
131
132
per-node basis.
132
133
- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
133
134
134
135
### Non-Goals
135
136
137
+ - Addressing non-Linux operating systems. Swap support will only be available
138
+ for Linux.
136
139
- Provisioning swap. Swap must already be available on the system.
137
140
- Setting [ swappiness] . This can already be set on a system-wide level outside
138
141
of Kubernetes.
139
142
- Allocating swap on a per-workload basis with accounting (e.g. pod-level
140
143
specification of swap). If desired, this should be designed and implemented
141
- as part of a follow-up KEP. This KEP is a prerequisite for that work.
144
+ as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence,
145
+ swap will be an overcommitted resource in the context of this KEP.
142
146
- Supporting zram, zswap, or other memory types like SGX EPC. These could be
143
147
addressed in a follow-up KEP, and are out of scope.
144
148
@@ -147,12 +151,12 @@ will be necessary to implement the third scenario.
147
151
## Proposal
148
152
149
153
We propose that, when swap is provisioned and available on a node, cluster
150
- administrators can configure the Kubelet and CRI such that:
154
+ administrators can configure the kubelet such that:
151
155
152
- - The kubelet can start with swap on.
153
- - The CRI is updated such that by default, workloads will use 0 swap.
154
- - The CRI will have configuration available such that swap utilization can be
155
- configured for the entire node.
156
+ - It can start with swap on.
157
+ - It will direct the CRI to allocate Kubernetes workloads 0 swap by default .
158
+ - It will have configuration options to configure swap utilization for the
159
+ entire node.
156
160
157
161
This proposal enables scenarios 1 and 2 above, but not 3.
158
162
@@ -334,10 +338,8 @@ type KubeletConfiguration struct {
334
338
type MemorySwapConfiguration struct {
335
339
// Configure swap memory available to container workloads. May be one of
336
340
// "", "NoSwap": workloads cannot use swap
337
- // "WorkloadSpecifiedSwapLimit": workloads can use as much swap as their memory limit.
338
- // "UnlimitedSwap": workloads can use unlimited swap, up to the system limit.
339
- // "LimitedSwap": workloads can use a total of memory and swap up to this
340
- // limit. When containers request more memory than this limit, they cannot use swap.
341
+ // "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
342
+ // "LimitedSwap": workloads can use up to this limit of swap.
341
343
SwapBehavior string
342
344
343
345
LimitedSwap *LimitedSwapConfiguration
@@ -348,33 +350,25 @@ type LimitedSwapConfiguration struct {
348
350
}
349
351
```
350
352
351
- We want to expose all possible swap settings based on the [ Docker] and open
353
+ We want to expose common swap configurations based on the [ Docker] and open
352
354
container specification for the ` --memory-swap ` flag. Thus, the
353
355
` MemorySwapConfiguration.SwapBehavior ` setting will have the following effects:
354
356
355
357
* If ` SwapBehavior ` is not set or set to ` "NoSwap" ` , containers do not have
356
358
access to swap. This value effectively prevents a container from using swap,
357
359
even if it is enabled on a system.
358
- * If ` SwapBehavior ` is set to ` "WorkloadSpecifiedSwapLimit" ` , then for
359
- containers with memory limit is set, the container can use as much swap as
360
- its memory limit setting. For instance, if a container requests 300Mi memory
361
- and ` MemorySwapLimit ` is not set, the container can use 600Mi total memory
362
- and swap.
363
360
* If ` SwapBehavior ` is set to ` "UnlimitedSwap" ` , the container is allowed to
364
361
use unlimited swap, up to the maximum amount available on the host system.
365
362
* If ` SwapBehavior ` is set to ` "LimitedSwap" ` , then the ` LimitedSwap `
366
363
configuration must also be set. ` LimitedSwap.PerWorkloadMemorySwapLimit `
367
- represents the system-wide maximum limit for combined memory and swap usage
368
- of a container. For example, if the limit is set to ` 1Gi ` :
369
- * If the container's memory limit is 300Mi, it can use 1Gi combined memory
370
- and swap (e.g. up to 700Mi swap).
371
- * If the container's memory limit is 700Mi, it can use 1Gi combined memory
372
- and swap (e.g. up to 300Mi swap).
373
- * If the container's memory limit is 1Gi or greater, it cannot use swap.
364
+ represents the system-wide maximum limit for swap usage of a container. Note
365
+ that this limit applies to individual containers, and not at the pod-level,
366
+ in order to be set via the CRI rather than e.g. a [ pod cgroup limit] .
374
367
375
368
[ docker ] : https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
369
+ [ pod cgroup limit ] : #restrict-swap-usage-at-the-cgroup-level
376
370
377
- ### CRI Changes
371
+ #### CRI Changes
378
372
379
373
The CRI requires a corresponding change in order to allow the kubelet to set
380
374
swap usage in container runtimes. We will introduce a parameter
@@ -417,7 +411,6 @@ phase of graduation.
417
411
to workloads. This will default to 0.
418
412
- e2e test jobs are configured for Linux systems with swap enabled.
419
413
420
-
421
414
#### Beta
422
415
423
416
(Tentative.)
@@ -825,6 +818,24 @@ This inconsistency makes it difficult or impossible to use swap in production,
825
818
particularly if a user wants to restrict workloads from using swap when using
826
819
the CRI rather than dockershim.
827
820
821
+ ### Restrict swap usage at the cgroup level
822
+
823
+ Setting a swap limit at the cgroup level would allow us to restrict the usage
824
+ of swap on a pod-level, rather than container-level basis.
825
+
826
+ For alpha, we are opting for the container-level basis to simplify the
827
+ implementation (as the container runtimes already support configuration of swap
828
+ with the ` memory-swap-limit ` parameter). This will also provide the necessary
829
+ plumbing for container-level accounting of swap, if that is proposed in the
830
+ future.
831
+
832
+ In beta, we may want to revisit this.
833
+
834
+ See the [ Pod Resource Management design proposal] for more background on the
835
+ cgroup limits the kubelet currently sets based on each QoS class.
836
+
837
+ [ Pod Resource Management design proposal ] : https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-resource-management.md#pod-level-cgroups
838
+
828
839
## Infrastructure Needed (Optional)
829
840
830
841
<!--
0 commit comments