Skip to content

Commit c103abd

Browse files
committed
Complete PRR questionaire
Signed-off-by: Itamar Holder <[email protected]>
1 parent 3211347 commit c103abd

File tree

1 file changed

+68
-54
lines changed

1 file changed

+68
-54
lines changed

keps/sig-node/2400-node-swap/README.md

Lines changed: 68 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@
6868
- [Drawbacks](#drawbacks)
6969
- [Alternatives](#alternatives)
7070
- [Just set <code>--fail-swap-on=false</code>](#just-set---fail-swap-onfalse)
71-
- [Restrict swap usage at the cgroup level](#restrict-swap-usage-at-the-cgroup-level)
7271
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
7372
<!-- /toc -->
7473

@@ -1045,6 +1044,10 @@ automations, so be extremely careful here.
10451044

10461045
No. If the feature flag is enabled, the user must still set
10471046
`--fail-swap-on=false` to adjust the default behaviour.
1047+
In addition, since the default "swap behavior" is "NoSwap",
1048+
by default containers would not be able to access swap. Instead,
1049+
the administrator would need to set a non-default behavior in order
1050+
for swap to be accessible.
10481051

10491052
A node must have swap provisioned and available for this feature to work. If
10501053
there is no swap available, but the feature flag is set to true, there will
@@ -1077,7 +1080,8 @@ for workloads.
10771080

10781081
###### What happens if we reenable the feature if it was previously rolled back?
10791082

1080-
N/A
1083+
As described above, swap can be turned on and off, although kubelet would need to be
1084+
restarted.
10811085

10821086
###### Are there any tests for feature enablement/disablement?
10831087

@@ -1088,8 +1092,18 @@ with and without the feature, are necessary. At the very least, think about
10881092
conversion tests if API types are being modified.
10891093
-->
10901094

1091-
N/A. This should be tested separately for scenarios with the flag enabled and
1092-
disabled.
1095+
There are extensive tests to ensure that the swap feature as expected.
1096+
1097+
Unit tests are in place to test that this feature operates as expected with
1098+
cgroup v1/v2, the feature gate being on/off, and different swap behaviors defined.
1099+
1100+
In addition, node e2e tests are added and run as part of the node-conformance
1101+
suite. These tests ensure that the underlying cgroup knobs are being configured
1102+
as expected.
1103+
1104+
Furthermore, "swap-conformance" periodic lanes have been introduced for the purpose
1105+
testing swap on a stressed environment. These tests ensure that swap kicks in when
1106+
expected, tested while stressing both on the node-level and container-level.
10931107

10941108
### Rollout, Upgrade and Rollback Planning
10951109

@@ -1155,9 +1169,8 @@ This section must be completed when targeting beta to a release.
11551169

11561170
###### How can someone using this feature know that it is working for their instance?
11571171

1158-
See #swap-metrics
1159-
1160-
1. Kubelet stats API will be extended to show swap usage details.
1172+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1173+
which provide how and if swap is utilized in the node, pod and container level.
11611174

11621175
###### How can an operator determine if the feature is in use by workloads?
11631176

@@ -1167,6 +1180,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
11671180
logs or events for this purpose.
11681181
-->
11691182

1183+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1184+
which provide how and if swap is utilized in the node, pod and container level.
1185+
11701186
KubeletConfiguration has set `failOnSwap: false`.
11711187

11721188
The prometheus `node_exporter` will also export stats on swap memory
@@ -1178,19 +1194,22 @@ utilization.
11781194
Pick one more of these and delete the rest.
11791195
-->
11801196

1181-
TBD. We will determine a set of metrics as a requirement for beta graduation.
1182-
We will need more production data; there is not a single metric or set of
1183-
metrics that can be used to generally quantify node performance.
1184-
1185-
This section to be updated before the feature can be marked as graduated, and
1186-
to be worked on during 1.23 development.
1187-
1188-
We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.
1189-
1190-
- [ ] Metrics
1191-
- Metric name:
1192-
- [Optional] Aggregation method:
1193-
- Components exposing the metric:
1197+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1198+
which provide how and if swap is utilized in the node, pod and container level.
1199+
1200+
- [X] Metrics
1201+
- Metric names:
1202+
- `container_swap_usage_bytes`
1203+
- `pod_swap_usage_bytes`
1204+
- `node_swap_usage_bytes`
1205+
Components exposing the metric: `/metrics/resource` endpoint
1206+
- Metric names:
1207+
- `node.swap.swapUsageBytes`
1208+
- `node.swap.swapAvailableBytes`
1209+
- `node.systemContainers.swap.swapUsageBytes`
1210+
- `pods[i].swap.swapUsageBytes`
1211+
- `pods[i].containers[i].swap.swapUsageBytes`
1212+
Components exposing the metric: `/stats/summary` endpoint
11941213
- [ ] Other (treat as last resort)
11951214
- Details:
11961215

@@ -1206,7 +1225,14 @@ high level (needs more precise definitions) those may be things like:
12061225
- 99,9% of /health requests per day finish with 200 code
12071226
-->
12081227

1209-
N/A
1228+
Swap is being managed by the kernel, depends on many factors and configurations
1229+
that are outside of kubelet's reach like the nature of the workloads running on the node,
1230+
swap capacity, memory capacity and other distro-specific configurations. However, generally:
1231+
1232+
- Nodes with swap enabled -> `node.swap.swapAvailableBytes` should be non-zero.
1233+
- Nodes with memory pressure -> `node.swap.swapUsageBytes` should be non-zero.
1234+
- Containers that reach their memory limit threshold -> `pods[i].containers[i].swap.swapUsageBytes` should be non-zero.
1235+
- Pods with containers that reach their memory limit threshold -> `pods[i].swap.swapUsageBytes` should be non-zero.
12101236

12111237
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
12121238

@@ -1321,9 +1347,11 @@ Think about adding additional work or introducing new steps in between
13211347
-->
13221348

13231349
Yes, enabling swap can affect performance of other critical daemons on the system.
1324-
Any scenario where swap memory gets utilized is a result of system running out of physical RAM.
1350+
Any scenario where swap memory gets utilized is a result of system running out of physical RAM,
1351+
or a container reaching its memory limit threshold.
13251352
Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice
1326-
along with reserving adequate enough system reserved memory.
1353+
along with reserving adequate enough system reserved memory, giving io latency precedence to the system.slice, and more.
1354+
See #best practices for more info.
13271355

13281356
The SLI that could potentially be impacted is [pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md).
13291357
If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted.
@@ -1412,41 +1440,24 @@ When swap is enabled, particularly for workloads, the kubelet’s resource
14121440
accounting may become much less accurate. This may make cluster administration
14131441
more difficult and less predictable.
14141442

1415-
Currently, there exists an unsupported workaround, which is setting the kubelet
1416-
flag `--fail-swap-on` to false.
1443+
In general, swap is less predictable and might cause performance degradation.
1444+
It also might be hard in certain scenarios to understand why certain workloads
1445+
are the chosen candidates for swapping, which could occur for reasons external
1446+
to the workload.
1447+
1448+
In addition, containers with memory limits would be killed less frequently
1449+
since with swap enabled the kernel can usually reclaim a lot more memory.
1450+
While this can help to avoid crashes, it could also "hide a problem" of a container
1451+
reaching its memory limits.
14171452

14181453
## Alternatives
14191454

14201455
### Just set `--fail-swap-on=false`
14211456

1422-
This is insufficient for most use cases because there is inconsistent control
1423-
over how swap will be used by various container runtimes. Dockershim currently
1424-
sets swap available for workloads to 0. The CRI does not restrict it at all.
1425-
This inconsistency makes it difficult or impossible to use swap in production,
1426-
particularly if a user wants to restrict workloads from using swap when using
1427-
the CRI rather than dockershim.
1428-
1429-
This is also a breaking change.
1430-
Users have used --fail-swap-on=false to allow for kubernetes to run
1431-
on a swap enabled node.
1432-
1433-
### Restrict swap usage at the cgroup level
1434-
1435-
Setting a swap limit at the cgroup level would allow us to restrict the usage
1436-
of swap on a pod-level, rather than container-level basis.
1437-
1438-
For alpha, we are opting for the container-level basis to simplify the
1439-
implementation (as the container runtimes already support configuration of swap
1440-
with the `memory-swap-limit` parameter). This will also provide the necessary
1441-
plumbing for container-level accounting of swap, if that is proposed in the
1442-
future.
1443-
1444-
In beta, we may want to revisit this.
1445-
1446-
See the [Pod Resource Management design proposal] for more background on the
1447-
cgroup limits the kubelet currently sets based on each QoS class.
1448-
1449-
[Pod Resource Management design proposal]: https://github.com/kubernetes/design-proposals-archive/blob/master/node/pod-resource-management.md#pod-level-cgroups
1457+
When `--fail-swap-on=false` is provided to Kubelet but swap is not configured
1458+
otherwise it is guaranteed that, by default, no Kubernetes workloads would
1459+
be able to utilize swap. However, everything outside of kubelet's reach
1460+
(e.g. system daemons, kubelet, etc) would be able to use swap.
14501461

14511462
## Infrastructure Needed (Optional)
14521463

@@ -1456,4 +1467,7 @@ new subproject, repos requested, or GitHub details. Listing these here allows a
14561467
SIG to get the process for these resources started right away.
14571468
-->
14581469

1459-
We may need Linux VM images built with swap partitions for e2e testing in CI.
1470+
Added the "swap-conformance" lane for extensive swap testing under node pressure: [kubelet-swap-conformance-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-fedora-serial),
1471+
kubelet-swap-conformance-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-ubuntu-serial).
1472+
1473+
See #e2e tests above for more information

0 commit comments

Comments
 (0)