Skip to content

Commit 27c3038

Browse files
committed
Rewrite implementation details, address feedback
1 parent 3577d17 commit 27c3038

File tree

2 files changed

+97
-116
lines changed

2 files changed

+97
-116
lines changed

keps/sig-node/2400-node-swap/README.md

Lines changed: 96 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,5 @@
11
# KEP-2400: Node system swap support
22

3-
<!--
4-
This is the title of your KEP. Keep it short, simple, and descriptive. A good
5-
title can help communicate what the KEP is and should be considered as part of
6-
any review.
7-
-->
8-
9-
<!--
10-
A table of contents is helpful for quickly jumping to sections of a KEP and for
11-
highlighting any additional information provided beyond the standard KEP
12-
template.
13-
14-
Ensure the TOC is wrapped with
15-
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
16-
tags, and then generate with `hack/update-toc.sh`.
17-
-->
18-
193
<!-- toc -->
204
- [Release Signoff Checklist](#release-signoff-checklist)
215
- [Summary](#summary)
@@ -34,6 +18,10 @@ tags, and then generate with `hack/update-toc.sh`.
3418
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
3519
- [Risks and Mitigations](#risks-and-mitigations)
3620
- [Design Details](#design-details)
21+
- [Enabling swap as an end user](#enabling-swap-as-an-end-user)
22+
- [API Changes](#api-changes)
23+
- [KubeConfig addition](#kubeconfig-addition)
24+
- [CRI Changes](#cri-changes)
3725
- [Test Plan](#test-plan)
3826
- [Graduation Criteria](#graduation-criteria)
3927
- [Alpha](#alpha)
@@ -121,20 +109,24 @@ This KEP will be limited in scope to the first two scenarios. The third can be a
121109
- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
122110
- Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
123111
- Cluster administrators can enable and configure CRI swap utilization on a per-node basis.
112+
- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
124113

125114
### Non-Goals
126115

127116
- Provisioning swap. Swap must already be available on the system.
117+
- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes.
128118
- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work.
129119
- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
130120

121+
[swappiness]: https://en.wikipedia.org/wiki/Memory_paging#Swappiness
122+
131123
## Proposal
132124

133-
I propose that, when swap is provisioned and available on a node, we allow cluster administrators to configure the Kubelet and CRI such that:
125+
We propose that, when swap is provisioned and available on a node, cluster administrators can configure the Kubelet and CRI such that:
134126

135127
- The kubelet can start with swap on.
136128
- The CRI is updated such that by default, workloads will use 0 swap.
137-
- The CRI will have configuration available such that swap utilization can be configured for the entire node (e.g. as a percentage of pod memory requests).
129+
- The CRI will have configuration available such that swap utilization can be configured for the entire node.
138130

139131
This proposal enables scenarios 1 and 2 above, but not 3.
140132

@@ -201,133 +193,121 @@ This user story is addressed by scenario 2, and could benefit from 3.
201193

202194
### Notes/Constraints/Caveats (Optional)
203195

204-
<!--
205-
What are the caveats to the proposal?
206-
What are some important details that didn't come across above?
207-
Go in to as much detail as necessary here.
208-
This might be a good place to talk about core concepts and how they relate.
209-
-->
196+
In changing the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
210197

211-
### Risks and Mitigations
198+
We considered adding parameters for both per-workload `memory-swap` and `swappiness`. These are documented as part of the Open Containers [runtime specification] for Linux memory configuration. Since `memory-swap` is a per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads.
212199

213-
Having swap available on a system reduces predictability. When swap is available to workloads, and is not accounted for on an individual workload-by-workload basis
200+
Since we are not currently setting `memory-swap` in the CRI, the default behaviour is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting `memory-swap` equal to `limit`.
214201

215-
First, this risk is mitigated by preventing any workloads from using swap by default, even if it is enabled on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
202+
[runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
216203

217-
Additionally, we mitigate this risk by quantifying system stability and then gathering test and production data to determine if system stability remains the same or is improved when swap is available to the system and/or workloads.
204+
### Risks and Mitigations
218205

219-
Since swap provisioning is out of scope of this proposal, this enhancement poses little risk to Kubernetes clusters that will not enable swap.
206+
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
220207

221-
## Design Details
208+
This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
222209

223-
### TL;DR
210+
Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
224211

225-
In a nutshell, the following implementation are planned for Memory Swap Support
226-
in 1.22 GKE alpha
212+
Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
227213

228-
1. Having a feature gate `SupportNodeMemorySwap` guarding against the memory
229-
swap support feature
230-
2. Keep the default value of kubelet flag `--fail-on-swap` to `true` in order
231-
to minimize the blast radius
232-
3. Introducing two new kubelet config `MemorySwapLimit` and `Swappiness`
233-
4. Introducing two new CRI parameter `memory_swap_limit_in_bytes` and `memory_swappiness`
234-
5. End to end wiring from kubelet config file to CRI
214+
## Design Details
235215

236-
### Expected User Behaviour
216+
We summarize the implementation plan as following:
237217

238-
For alpha, the feature gate `SupportNodeMemorySwap` is default to disabled, and
239-
`--fail-on-swap` flag value is the same as 1.21. Therefore, from Kubernetes
240-
user’s perspective, no behavior changes out of the box.
218+
1. Add a feature gate `NodeSwapEnabled` to enable swap support.
219+
1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid changing default behaviour.
220+
1. Introduce a new kubelet config parameter, `MemorySwapLimit`.
221+
1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`.
222+
1. Integrate new kubelet config and pass values to CRI for container creation.
223+
1. Ensure container runtimes are updated so they can make use of the new CRI configuration.
241224

242-
For users that are ready to explore the Memory Swap feature in 1.22 Alpha, they
243-
will need to complete the following steps
225+
### Enabling swap as an end user
244226

245-
1. provision swap enable `SupportNodeMemorySwap` flag AND
246-
2. set `--fail-on-swap` flag to `false`
227+
Swap can be enabled as follows:
247228

248-
Then, the user can start experimenting/fine tuning kubelet configuration
249-
`MemorySwapLimit` and/or `Swappiness` and observe the changes.
229+
1. Provision swap on the target worker nodes,
230+
1. Enable `NodeMemorySwap` flag on the kubelet,
231+
1. Set `--fail-on-swap` flag to `false`, and
232+
1. (Optional) Configure `MemorySwapLimit` in the KubeletConfig for tuning.
250233

251-
### New Kubelet Configuration
234+
### API Changes
252235

253-
We will be introducing two new parameters to `KubeletConfiguration struct`
254-
defined in
255-
[https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go).
256-
These two configurations, if set, will apply to every container of the Node
257-
where kubelet is running.
236+
#### KubeConfig addition
258237

259-
|Name|Description|Default Value|Feature Gate|
260-
|--- |--- |--- |--- |
261-
|MemorySwapLimit|This parameter sets total memory limit (memory + swap). This limits the total amount of memory this container is allowed to swap to disk.|-2, which enable disable swap|SupportNodeMemorySwap|
262-
|MemorySwappiness|This configuration sets how aggressively the kernel will swap memory pages. By default, the host kernel can swap out a percentage of anonymous pages used by a container. Users can set value between 0 and 100, to tune this percentage.|Unset, which will use host value|SupportNodeMemorySwap|
238+
We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
263239

264-
#### MemorySwapLimit details
240+
[pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81
265241

266-
MemorySwapLimit configuration is a kubelet flag that only takes effect on a
267-
container that has a memory limit set, either explicitly from
268-
[PodSpec]([https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits)
269-
) or implicitly from [Resource
270-
Quota]([https://kubernetes.io/docs/concepts/policy/resource-quotas/](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
271-
).
242+
```go
243+
// KubeletConfiguration contains the configuration for the Kubelet
244+
type KubeletConfiguration struct {
245+
metav1.TypeMeta
246+
...
247+
// Configure swap memory available to container workloads.
248+
// If not set, workloads cannot use swap.
249+
// If set to 0, workloads can use as much swap as their memory limit.
250+
// If set to -1, workloads can use unlimited swap, up to the system limit.
251+
// If set to a positive integer, workloads can use a total of memory and swap up to this
252+
// limit. When containers request more memory than this limit, they cannot use swap.
253+
// +featureGate=NodeSwapEnabled
254+
// +optional
255+
MemorySwapLimit *int64
256+
}
257+
```
272258

273259
For container with memory limit set, MemorySwapLimit setting will have the
274-
following effects, [similar to
275-
docker](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details)
276-
277-
* If MemorySwapLimit is set to a positive integer,
278-
* If the memory limit of the container is greater or equal to
279-
MemorySwapLimit, then no swap is allowed, the container does not have
280-
access to swap.
281-
* If the memory limit of the container is less than MemorySwapLimit, then
282-
MemorySwapLimit represents the total amount of memory and swap that can be
283-
used. For example, for a container with memory limit set to 300m, and
284-
`MemorySwapLimit` set to 1g, the container can use 300m of memory and 700m (1g
285-
- 300m) swap.
286-
* If MemorySwapLimit is set to 0, for containers with memory limit is set, the
287-
container can use as much swap as the Memory limit setting, if the host
288-
container has swap memory configured. For instance, if a container requests
289-
memory="300m" and MemorySwapLimit is not set, the container can use 600m in
290-
total of memory and swap.
291-
* If MemorySwapLimit is explicitly set to -1, the container is allowed to use
292-
unlimited swap, up to the amount available on the host system.
293-
* If MemorySwapLimit is explicitly set to -2, the container does not have
294-
access to swap. This value effectively prevents a container from using swap.
295-
296-
In summary, for users experimenting with this feature
297-
298-
|MemorySwapLimit|container memory limit (explicit or implicit)|Expected Behavior|Comment|
299-
|--- |--- |--- |--- |
300-
|Any|not set|N/A|Same as docker|
301-
|-2|N|no swap allowed, this is the default value||
302-
|-1|N|unlimited swap|Same as docker|
303-
|0|N|container can use up to N swap (ie: 2N memory+swap)|Same as docker|
304-
|X where X > 0|N where N < X|container can use up to X-N swap (ie: 2N memory+swap)|Same as docker|
305-
|X where X > 0|N where N >= X|no swap allowed (ie: N memory only)|Same as docker|
306-
307-
#### MemorySwappiness details
308-
309-
* A value of 0 turns off anonymous page swapping.
310-
* A value of 100 sets all anonymous pages as swappable.
311-
* By default, if you do not set MemorySwappiness, the value is inherited from
312-
the host machine.
260+
following effects, following the [Docker] and open container specification:
261+
262+
* If `MemorySwapLimit` is not set, containers do not have access to swap. This
263+
value effectively prevents a container from using swap, even if it is enabled
264+
on a system.
265+
* If `MemorySwapLimit` is set to 0, for containers with memory limit is set, the
266+
container can use as much swap as its memory limit setting. For instance, if
267+
a container requests 300Mi memory and `MemorySwapLimit` is not set, the
268+
container can use 600Mi total memory and swap.
269+
* If `MemorySwapLimit` is set to -1, the container is allowed to use
270+
unlimited swap, up to the maximum amount available on the host system.
271+
* If `MemorySwapLimit` is set to a positive integer, then for containers with a
272+
memory limit set, that value represents the system-wide maximum limit for
273+
combined memory and swap usage of a container. For example, if
274+
`MemorySwapLimit` is set to 1073742000 (1Gi):
275+
* If the container's memory limit is 300Mi, it can use 1Gi combined memory
276+
and swap (e.g. up to 700Mi swap).
277+
* If the container's memory limit is 700Mi, it can use 1Gi combined memory
278+
and swap (e.g. up to 300Mi swap).
279+
* If the container's memory limit is 1Gi or greater, it cannot use swap.
280+
281+
[docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
313282

314283
### CRI Changes
315284

316-
We will be introducing the following two parameters
317-
`memory_swap_limit_in_bytes` and `memory_swappiness` to `message
318-
LinuxContainerResources` defined in
319-
[https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580)
320-
321-
|Name|Type|Description|Default Value|Feature Gate|
322-
|--- |--- |--- |--- |--- |
323-
|`memory_swap_limit_in_bytes`|int64|set/show limit of memory+swap usage|Default 0, which is unspecified.|SupportNodeMemorySwap|
324-
|`memory_swappiness`|int64|set/show swappiness parameter|Default 0, which is unspecified.|SupportNodeMemorySwap|
285+
The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes.
286+
We will introduce a parameter `memory_swap_limit_in_bytes` to the CRI API (found in [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]):
287+
288+
[k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580
289+
290+
```go
291+
// LinuxContainerResources specifies Linux specific configuration for
292+
// resources.
293+
message LinuxContainerResources {
294+
...
295+
// Memory limit in bytes. Default: 0 (not specified).
296+
int64 memory_limit_in_bytes = 4;
297+
// Memory + swap limit in bytes. Default: 0 (not specified).
298+
int64 memory_swap_limit_in_bytes = 9;
299+
...
300+
// List of HugepageLimits to limit the HugeTLB usage of container per page size. Default: nil (not specified).
301+
repeated HugepageLimit hugepage_limits = 8;
302+
}
303+
```
325304

326305
### Test Plan
327306

328307
For alpha:
329308

330309
- Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
310+
- Container runtimes must be bumped in CI to use the new CRI.
331311
- Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
332312

333313
Once this data is available, additional test plans should be added for the next phase of graduation.
@@ -426,7 +406,7 @@ Pick one of these and delete the rest.
426406

427407
- [x] Feature gate (also fill in values in `kep.yaml`)
428408
- Feature gate name: NodeSwapEnabled
429-
- Components depending on the feature gate: Kubelet
409+
- Components depending on the feature gate: API Server, Kubelet
430410
- [ ] Other
431411
- Describe the mechanism:
432412
- Will enabling / disabling the feature require downtime of the control

keps/sig-node/2400-node-swap/kep.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,5 +35,6 @@ milestone:
3535
feature-gates:
3636
- name: NodeSwapEnabled
3737
components:
38+
- kube-apiserver
3839
- kubelet
3940
disable-supported: false

0 commit comments

Comments
 (0)