Skip to content

Commit b5c0dae

Browse files
committed
Address next round of reviewer feedback
1 parent 277c51a commit b5c0dae

File tree

1 file changed

+39
-34
lines changed

1 file changed

+39
-34
lines changed

keps/sig-node/2400-node-swap/README.md

Lines changed: 39 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -112,10 +112,11 @@ node.
112112
1. Swap is enabled on a node's host system, but the kubelet does not permit
113113
Kubernetes workloads to use swap. (This scenario is a prerequisite for the
114114
following use cases.)
115-
1. Swap is enabled at the node level. The CRI can be globally configured to
116-
permit user workloads scheduled on the node to use some quantity of swap.
117-
1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization
118-
on each individual workload.
115+
1. Swap is enabled at the node level. The kubelet can permit Kubernetes
116+
workloads scheduled on the node to use some quantity of swap, depending on
117+
the configuration.
118+
1. Swap is set on a per-workload basis. The kubelet sets swap limits for each
119+
individual workload.
119120

120121
This KEP will be limited in scope to the first two scenarios. The third can be
121122
addressed in a follow-up KEP. The enablement work that is in scope for this KEP
@@ -164,9 +165,9 @@ This proposal enables scenarios 1 and 2 above, but not 3.
164165

165166
#### Improved Node Stability
166167

167-
cgroupsv2 improved memory management algos, such as oomd, currently require
168-
swap. Hence, having a small amount of swap available on nodes could improve
169-
better resource pressure handling and recovery.
168+
cgroupsv2 improved memory management algorithms, such as oomd, strongly
169+
recommend the use of swap. Hence, having a small amount of swap available on
170+
nodes could improve better resource pressure handling and recovery.
170171

171172
- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
172173
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
@@ -253,7 +254,7 @@ This user story is addressed by scenario 2, and could benefit from 3.
253254

254255
### Notes/Constraints/Caveats (Optional)
255256

256-
In changing the CRI, we must ensure that container runtime downstreams are able
257+
In updating the CRI, we must ensure that container runtime downstreams are able
257258
to support the new configurations.
258259

259260
We considered adding parameters for both per-workload `memory-swap` and
@@ -301,26 +302,31 @@ We summarize the implementation plan as following:
301302
1. Add a feature gate `NodeSwapEnabled` to enable swap support.
302303
1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid
303304
changing default behaviour.
304-
1. Introduce a new kubelet config parameter, `MemorySwapLimit`.
305+
1. Introduce a new kubelet config parameter, `MemorySwap`, which configures how
306+
much swap Kubernetes workloads can use on the node.
305307
1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`.
306-
1. Integrate new kubelet config and pass values to CRI for container creation.
307-
1. Ensure container runtimes are updated so they can make use of the new CRI configuration.
308+
1. Ensure container runtimes are updated so they can make use of the new CRI
309+
configuration.
310+
1. Based on the behaviour set in the kubelet config, the kubelet will instruct
311+
the CRI on the amount of swap to allocate to each container. The container
312+
runtime will then write the swap settings to the container level cgroup.
308313

309314
### Enabling swap as an end user
310315

311316
Swap can be enabled as follows:
312317

313318
1. Provision swap on the target worker nodes,
314-
1. Enable `NodeMemorySwap` flag on the kubelet,
319+
1. Enable the `NodeMemorySwap` feature flag on the kubelet,
315320
1. Set `--fail-on-swap` flag to `false`, and
316-
1. (Optional) Configure `MemorySwapLimit` in the KubeletConfig for tuning.
321+
1. (Optional) Allow Kubernetes workloads to use swap by setting
322+
`MemorySwap.SwapBehavior=UnlimitedSwap` in the kubelet config.
317323

318324
### API Changes
319325

320326
#### KubeConfig addition
321327

322-
We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct
323-
in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
328+
We will add an optional `MemorySwap` value to the `KubeletConfig` struct
329+
in [pkg/kubelet/apis/config/types.go] as follows:
324330

325331
[pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81
326332

@@ -339,14 +345,7 @@ type MemorySwapConfiguration struct {
339345
// Configure swap memory available to container workloads. May be one of
340346
// "", "NoSwap": workloads cannot use swap
341347
// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
342-
// "LimitedSwap": workloads can use up to this limit of swap.
343348
SwapBehavior string
344-
345-
LimitedSwap *LimitedSwapConfiguration
346-
}
347-
348-
type LimitedSwapConfiguration struct {
349-
PerWorkloadMemorySwapLimit resource.Quantity
350349
}
351350
```
352351

@@ -359,14 +358,8 @@ container specification for the `--memory-swap` flag. Thus, the
359358
even if it is enabled on a system.
360359
* If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to
361360
use unlimited swap, up to the maximum amount available on the host system.
362-
* If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap`
363-
configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit`
364-
represents the system-wide maximum limit for swap usage of a container. Note
365-
that this limit applies to individual containers, and not at the pod-level,
366-
in order to be set via the CRI rather than e.g. a [pod cgroup limit].
367361

368362
[docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
369-
[pod cgroup limit]: #restrict-swap-usage-at-the-cgroup-level
370363

371364
#### CRI Changes
372365

@@ -406,24 +399,36 @@ phase of graduation.
406399

407400
#### Alpha
408401

409-
- Kubelet can be started with swap enabled.
410-
- KubeletConfig allows CRI to be configured with a percentage of swap available
411-
to workloads. This will default to 0.
402+
- Kubelet can be started with swap enabled and will support two configurations
403+
for Kubernetes workloads: `NoSwap` and `UnlimitedSwap`.
404+
- Kubelet can configure CRI to allocate swap to Kubernetes workloads. By
405+
default, workloads will not be allocated any swap.
412406
- e2e test jobs are configured for Linux systems with swap enabled.
413407

414408
#### Beta
415409

416-
(Tentative.)
410+
_(Tentative.)_
417411

412+
- Add support for controlling swap consumption at the pod level [via cgroups].
413+
- Handle usage of swap during container restart boundaries for writes to tmpfs
414+
(which may require pod cgroup change beyond what container runtime will do at
415+
container cgroup boundary).
416+
- Add the ability to set a system-reserved quantity of swap from what kubelet
417+
detects on the host.
418+
- Consider introducing new configuration modes for swap, such as a node-wide
419+
swap limit for workloads.
418420
- Determine a set of metrics for node QoS in order to evaluate the performance
419421
of nodes with and without swap enabled.
422+
- Better understand relationship of swap with memory QoS in cgroup v2
423+
(particularly `memory.high` usage).
420424
- Collect feedback from test user cases.
421425
- Improve coverage for appropriate scenarios in testgrid.
422426

427+
[via cgroups]: #restrict-swap-usage-at-the-cgroup-level
428+
423429
#### GA
424430

425-
- Test a wide variety of scenarios that may be affected by swap support, such
426-
as workloads using tmpfs storage.
431+
- Test a wide variety of scenarios that may be affected by swap support.
427432
- Remove feature flag.
428433

429434
### Upgrade / Downgrade Strategy

0 commit comments

Comments
 (0)