Skip to content

Commit 277c51a

Browse files
committed
Update based on reviewer feedback
1 parent 64e639a commit 277c51a

File tree

1 file changed

+40
-29
lines changed

1 file changed

+40
-29
lines changed

keps/sig-node/2400-node-swap/README.md

Lines changed: 40 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
- [Enabling swap as an end user](#enabling-swap-as-an-end-user)
2222
- [API Changes](#api-changes)
2323
- [KubeConfig addition](#kubeconfig-addition)
24-
- [CRI Changes](#cri-changes)
24+
- [CRI Changes](#cri-changes)
2525
- [Test Plan](#test-plan)
2626
- [Graduation Criteria](#graduation-criteria)
2727
- [Alpha](#alpha)
@@ -40,6 +40,7 @@
4040
- [Drawbacks](#drawbacks)
4141
- [Alternatives](#alternatives)
4242
- [Just set <code>--fail-swap-on=false</code>](#just-set-)
43+
- [Restrict swap usage at the cgroup level](#restrict-swap-usage-at-the-cgroup-level)
4344
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
4445
<!-- /toc -->
4546

@@ -108,7 +109,7 @@ node.
108109

109110
### Scenarios
110111

111-
1. Swap is enabled on a node's host system, but the CRI does not permit
112+
1. Swap is enabled on a node's host system, but the kubelet does not permit
112113
Kubernetes workloads to use swap. (This scenario is a prerequisite for the
113114
following use cases.)
114115
1. Swap is enabled at the node level. The CRI can be globally configured to
@@ -125,20 +126,23 @@ will be necessary to implement the third scenario.
125126

126127
- On Linux systems, when swap is provisioned and available, Kubelet can start
127128
up with swap on.
128-
- Configuration is available for CRI to set swap utilization available to
129+
- Configuration is available for kubelet to set swap utilization available to
129130
Kubernetes workloads, defaulting to 0 swap.
130-
- Cluster administrators can enable and configure CRI swap utilization on a
131+
- Cluster administrators can enable and configure kubelet swap utilization on a
131132
per-node basis.
132133
- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
133134

134135
### Non-Goals
135136

137+
- Addressing non-Linux operating systems. Swap support will only be available
138+
for Linux.
136139
- Provisioning swap. Swap must already be available on the system.
137140
- Setting [swappiness]. This can already be set on a system-wide level outside
138141
of Kubernetes.
139142
- Allocating swap on a per-workload basis with accounting (e.g. pod-level
140143
specification of swap). If desired, this should be designed and implemented
141-
as part of a follow-up KEP. This KEP is a prerequisite for that work.
144+
as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence,
145+
swap will be an overcommitted resource in the context of this KEP.
142146
- Supporting zram, zswap, or other memory types like SGX EPC. These could be
143147
addressed in a follow-up KEP, and are out of scope.
144148

@@ -147,12 +151,12 @@ will be necessary to implement the third scenario.
147151
## Proposal
148152

149153
We propose that, when swap is provisioned and available on a node, cluster
150-
administrators can configure the Kubelet and CRI such that:
154+
administrators can configure the kubelet such that:
151155

152-
- The kubelet can start with swap on.
153-
- The CRI is updated such that by default, workloads will use 0 swap.
154-
- The CRI will have configuration available such that swap utilization can be
155-
configured for the entire node.
156+
- It can start with swap on.
157+
- It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
158+
- It will have configuration options to configure swap utilization for the
159+
entire node.
156160

157161
This proposal enables scenarios 1 and 2 above, but not 3.
158162

@@ -334,10 +338,8 @@ type KubeletConfiguration struct {
334338
type MemorySwapConfiguration struct {
335339
// Configure swap memory available to container workloads. May be one of
336340
// "", "NoSwap": workloads cannot use swap
337-
// "WorkloadSpecifiedSwapLimit": workloads can use as much swap as their memory limit.
338-
// "UnlimitedSwap": workloads can use unlimited swap, up to the system limit.
339-
// "LimitedSwap": workloads can use a total of memory and swap up to this
340-
// limit. When containers request more memory than this limit, they cannot use swap.
341+
// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
342+
// "LimitedSwap": workloads can use up to this limit of swap.
341343
SwapBehavior string
342344

343345
LimitedSwap *LimitedSwapConfiguration
@@ -348,33 +350,25 @@ type LimitedSwapConfiguration struct {
348350
}
349351
```
350352

351-
We want to expose all possible swap settings based on the [Docker] and open
353+
We want to expose common swap configurations based on the [Docker] and open
352354
container specification for the `--memory-swap` flag. Thus, the
353355
`MemorySwapConfiguration.SwapBehavior` setting will have the following effects:
354356

355357
* If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have
356358
access to swap. This value effectively prevents a container from using swap,
357359
even if it is enabled on a system.
358-
* If `SwapBehavior` is set to `"WorkloadSpecifiedSwapLimit"`, then for
359-
containers with memory limit is set, the container can use as much swap as
360-
its memory limit setting. For instance, if a container requests 300Mi memory
361-
and `MemorySwapLimit` is not set, the container can use 600Mi total memory
362-
and swap.
363360
* If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to
364361
use unlimited swap, up to the maximum amount available on the host system.
365362
* If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap`
366363
configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit`
367-
represents the system-wide maximum limit for combined memory and swap usage
368-
of a container. For example, if the limit is set to `1Gi`:
369-
* If the container's memory limit is 300Mi, it can use 1Gi combined memory
370-
and swap (e.g. up to 700Mi swap).
371-
* If the container's memory limit is 700Mi, it can use 1Gi combined memory
372-
and swap (e.g. up to 300Mi swap).
373-
* If the container's memory limit is 1Gi or greater, it cannot use swap.
364+
represents the system-wide maximum limit for swap usage of a container. Note
365+
that this limit applies to individual containers, and not at the pod-level,
366+
in order to be set via the CRI rather than e.g. a [pod cgroup limit].
374367

375368
[docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
369+
[pod cgroup limit]: #restrict-swap-usage-at-the-cgroup-level
376370

377-
### CRI Changes
371+
#### CRI Changes
378372

379373
The CRI requires a corresponding change in order to allow the kubelet to set
380374
swap usage in container runtimes. We will introduce a parameter
@@ -417,7 +411,6 @@ phase of graduation.
417411
to workloads. This will default to 0.
418412
- e2e test jobs are configured for Linux systems with swap enabled.
419413

420-
421414
#### Beta
422415

423416
(Tentative.)
@@ -825,6 +818,24 @@ This inconsistency makes it difficult or impossible to use swap in production,
825818
particularly if a user wants to restrict workloads from using swap when using
826819
the CRI rather than dockershim.
827820

821+
### Restrict swap usage at the cgroup level
822+
823+
Setting a swap limit at the cgroup level would allow us to restrict the usage
824+
of swap on a pod-level, rather than container-level basis.
825+
826+
For alpha, we are opting for the container-level basis to simplify the
827+
implementation (as the container runtimes already support configuration of swap
828+
with the `memory-swap-limit` parameter). This will also provide the necessary
829+
plumbing for container-level accounting of swap, if that is proposed in the
830+
future.
831+
832+
In beta, we may want to revisit this.
833+
834+
See the [Pod Resource Management design proposal] for more background on the
835+
cgroup limits the kubelet currently sets based on each QoS class.
836+
837+
[Pod Resource Management design proposal]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-resource-management.md#pod-level-cgroups
838+
828839
## Infrastructure Needed (Optional)
829840

830841
<!--

0 commit comments

Comments
 (0)