Skip to content

Commit cdcf69e

Browse files
authored
Merge pull request #5434 from iholder101/swap/future-extensions
[KEP-2400] GA KEP update
2 parents 7f0748f + 6f62c84 commit cdcf69e

File tree

2 files changed

+13
-77
lines changed

2 files changed

+13
-77
lines changed

keps/sig-node/2400-node-swap/README.md

Lines changed: 12 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,6 @@
1818
- [Swap as the default](#swap-as-the-default)
1919
- [Steps to Calculate Swap Limit](#steps-to-calculate-swap-limit)
2020
- [Example](#example)
21-
- [Swap-aware Evictions](#swap-aware-evictions)
22-
- [Background](#background)
23-
- [Defining "accessible swap"](#defining-accessible-swap)
24-
- [Changes to the eviction manager's memory pressure handling](#changes-to-the-eviction-managers-memory-pressure-handling)
2521
- [User Stories](#user-stories)
2622
- [Improved Node Stability](#improved-node-stability)
2723
- [Long-running applications that swap out startup memory](#long-running-applications-that-swap-out-startup-memory)
@@ -187,7 +183,6 @@ will be necessary to implement the third scenario.
187183
- Cluster administrators can enable and configure kubelet swap utilization on a
188184
per-node basis.
189185
- Use of swap memory for cgroupsv2.
190-
- Swap-aware eviction manager.
191186

192187
### Non-Goals
193188

@@ -205,6 +200,11 @@ will be necessary to implement the third scenario.
205200
- Supporting zram, zswap, or other memory types like SGX EPC. These could be
206201
addressed in a follow-up KEP, and are out of scope.
207202
- Use of swap for cgroupsv1.
203+
- Add pod-level APIs to control swap configuration on a per-pod basis.
204+
- Add scheduling mechanisms to target nodes with swap enabled/disabled or a certain swap configuration.
205+
- Make the eviction manager swap-aware.
206+
The existing eviction hierarchy based on QoS tiers is preserved to ensure introduction of swap does not fundamentally
207+
change how Kubernetes handles node-pressure eviction.
208208

209209
[swappiness]: https://en.wikipedia.org/wiki/Memory_paging#Swappiness
210210

@@ -315,67 +315,6 @@ In this example, Container A would have a swap limit of 19 GB, and Container B w
315315

316316
This approach allocates swap limits based on each container's memory request and adjusts the proportion based on the total swap memory available in the system. It ensures that each container gets a fair share of the swap space and helps maintain resource allocation efficiency.
317317

318-
### Swap-aware Evictions
319-
320-
As part of this KEP, the eviction manager will be enhanced to be swap-aware.
321-
This update will enable the eviction manager to account for swap usage in its decision-making process.
322-
By doing so, it will help prevent the system from exhausting swap space, thereby maintaining system stability and responsiveness.
323-
324-
#### Background
325-
326-
Before this KEP, kubelet's eviction manager completely overlooked swap memory, leading to several issues:
327-
* Inaccessible Swap: The memory eviction threshold is configured in such a way that swap is never triggered during node-level pressure,
328-
as eviction occurs before the node starts swapping memory.
329-
* Unfairness & Instability: The eviction manager may evict the "wrong" or innocent pods, failing to address the actual memory pressure.
330-
* Unexpected Behavior: Pods that exceed their memory limits (with regular and swap memory) are not evicted first,
331-
even though they would immediately get killed if swap were not used.
332-
333-
Here we present an extension to the eviction manager that will address these issues by becoming swap-aware.
334-
The proposed logic is fully backward compatible and requires no additional configuration, making it completely transparent to the user.
335-
336-
To achieve this, we recommend enhancing the eviction manager's memory pressure handling to account for swap memory, rather than adding a distinct swap signal.
337-
Memory and swap are inherently connected and should be addressed as a single issue.
338-
By integrating swap memory into the eviction manager's logic, we ensure a more accurate and efficient handling of system resources.
339-
For example, separating memory and swap memory is problematic because swap is not used until memory is full.
340-
However, with the approach suggested in this KEP memory will not be considered full until the accessible swap is also full.
341-
342-
#### Defining "accessible swap"
343-
344-
Let `accessible swap` be the amount of swap that is accessible by pods according to the [LimitedSwap swap behavior](#steps-to-calculate-swap-limit).
345-
Note that the amount of accessible swap changes in time according to the pods running on the node.
346-
347-
In addition, note that since only some of the Burstable QoS pods will have access to swap, the swap space will
348-
almost never be used in its entirety by workloads. In other words, this approach will effectively leave some of
349-
the swap space inaccessible for pods and reserved for system daemons and other system processes.
350-
351-
#### Changes to the eviction manager's memory pressure handling
352-
353-
When dealing with evictions, there are two main questions needed to be answered:
354-
how to identify when the node is under pressure and how to rank pods for eviction.
355-
356-
The eviction manager will become swap aware by making the following changes to its memory pressure handling:
357-
- **How to identify pressure**: The eviction manager will consider the total sum of all running pods' accessible swap as additional memory capacity.
358-
- **How to rank pods for eviction**: In the context of ranking pods for evictions, swap memory is considered as additional "regular" memory
359-
and accessible swap is considered as additional memory request.
360-
This is relevant for checking whether memory requests are exceeded [1] or for identifying which pods uses more memory [2].
361-
362-
In other words, the order of evictions documented [3] will have to change to the following:
363-
> ```
364-
> The kubelet uses the following parameters to determine the pod eviction order:
365-
> 1. Whether the pod's resource usage with swap (memory usage + swap usage) exceeds requests with swap (memory requests + swap requests).
366-
> 2. [Pod Priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/).
367-
> 3. The pod's resource usage (memory usage + swap usage) relative to requests (memory requests + swap requests).
368-
> ```
369-
370-
That is, in (1) and (3) swap is considered as an additional resource usage and memory request. Step (2) is unchanged.
371-
372-
On nodes with swap disabled, the accessible swap will equal to zero and pods won't be able to use swap,
373-
hence the eviction manager will behave the same as before.
374-
375-
[1] https://github.com/kubernetes/kubernetes/blob/d8093cc40394b8e25a864576fe6a38306730d3cb/pkg/kubelet/eviction/helpers.go#L684
376-
[2] https://github.com/kubernetes/kubernetes/blob/d8093cc40394b8e25a864576fe6a38306730d3cb/pkg/kubelet/eviction/helpers.go#L703
377-
[3] https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#pod-selection-for-kubelet-eviction
378-
379318
### User Stories
380319

381320
#### Improved Node Stability
@@ -495,13 +434,14 @@ to not permit the use of swap by setting `memory-swap` equal to `limit`.
495434

496435
This feature was created so that we iterate on adding swap to Kubernetes.
497436
Due to this, pods will not be able to request swap memory directly nor explicitly for the current implementation.
498-
To make swap more useful for workloads, we acknowledge the need for proper APIs for swap to make it customizable and flexible.
437+
In additions, as stated in the KEP's summary, issues like evictions and scheduling won't be addressed in this KEP.
438+
To make swap more useful for workloads, we acknowledge the need for proper APIs for swap to make it customizable and flexible,
439+
eviction and scheduling aware.
499440

500441
For example, we're considering the following features for future KEPs:
501-
- Swap should be opt-in and opt-out at the workload level.
502-
- Customization of swap limit calculation for workloads.
503-
- Eviction Manager to be more flexible in regards to swap limits.
504-
- Eviction Manager should look at more advanced ways of determining swap pressure (PSI for example).
442+
- [KEP-5359] Pod-Level Swap Control: https://github.com/kubernetes/enhancements/issues/5359.
443+
- [KEP-5424] Swap-Aware Scheduling: https://github.com/kubernetes/enhancements/issues/5424.
444+
- [KEP-5433] Swap-Aware Evictions: https://github.com/kubernetes/enhancements/issues/5433.
505445

506446
### Risks and Mitigations
507447

@@ -548,9 +488,6 @@ This can cause problems where workloads can use up all swap.
548488
If all swap is used up on a node, it can make the node go unhealthy.
549489
To avoid exhausting swap on a node, `UnlimitedSwap` was dropped from the API in beta2.
550490

551-
It was determined that the eviction manager should still be able to protect the node in case of swap memory pressure.
552-
In this case, we will teach the eviction manager to be aware of swap as a resource to avoid exhausting swap resource.
553-
554491
#### Security risk
555492

556493
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.
@@ -600,7 +537,6 @@ We summarize the implementation plan as following:
600537
the CRI on the amount of swap to allocate to each container. The container
601538
runtime will then write the swap settings to the container level cgroup.
602539
1. Add node stats to report swap usage.
603-
1. Enhance eviction manager to protect against swap memory running out.
604540

605541
### Enabling swap as an end user
606542

@@ -877,14 +813,14 @@ cgroup knobs are validated to be defined as expected with no real memory stress
877813

878814
For beta 3:
879815

880-
- We want e2e tests that can confirm that eviction will take in account swap usage
881816
- Add a lane dedicated for swap testing, including stress tests and other tests that might be disruptive and intensive.
882817
These lanes are called "swap-conformance", and are (and should remain) consistently green:
883818
- [kubelet-swap-conformance-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-fedora-serial): Green.
884819
- [kubelet-swap-conformance-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-ubuntu-serial): Green.
885820

886821
For GA:
887822

823+
- Run memory eviction tests as part of the swap-conformance lanes.
888824
- Ensure that all e2e tests, especially swap-conformance tests, are consistently green.
889825

890826
### Graduation Criteria

keps/sig-node/2400-node-swap/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ authors:
1010
owning-sig: sig-node
1111
participating-sigs:
1212
- sig-node
13-
status: implementable
13+
status: implemented
1414
creation-date: 2021-04-06
1515
reviewers:
1616
- "@anguslees"

0 commit comments

Comments
 (0)