You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -1045,6 +1044,10 @@ automations, so be extremely careful here.
1045
1044
1046
1045
No. If the feature flag is enabled, the user must still set
1047
1046
`--fail-swap-on=false` to adjust the default behaviour.
1047
+
In addition, since the default "swap behavior" is "NoSwap",
1048
+
by default containers would not be able to access swap. Instead,
1049
+
the administrator would need to set a non-default behavior in order
1050
+
for swap to be accessible.
1048
1051
1049
1052
A node must have swap provisioned and available for this feature to work. If
1050
1053
there is no swap available, but the feature flag is set to true, there will
@@ -1077,7 +1080,8 @@ for workloads.
1077
1080
1078
1081
###### What happens if we reenable the feature if it was previously rolled back?
1079
1082
1080
-
N/A
1083
+
As described above, swap can be turned on and off, although kubelet would need to be
1084
+
restarted.
1081
1085
1082
1086
###### Are there any tests for feature enablement/disablement?
1083
1087
@@ -1088,8 +1092,18 @@ with and without the feature, are necessary. At the very least, think about
1088
1092
conversion tests if API types are being modified.
1089
1093
-->
1090
1094
1091
-
N/A. This should be tested separately for scenarios with the flag enabled and
1092
-
disabled.
1095
+
There are extensive tests to ensure that the swap feature as expected.
1096
+
1097
+
Unit tests are in place to test that this feature operates as expected with
1098
+
cgroup v1/v2, the feature gate being on/off, and different swap behaviors defined.
1099
+
1100
+
In addition, node e2e tests are added and run as part of the node-conformance
1101
+
suite. These tests ensure that the underlying cgroup knobs are being configured
1102
+
as expected.
1103
+
1104
+
Furthermore, "swap-conformance" periodic lanes have been introduced for the purpose
1105
+
testing swap on a stressed environment. These tests ensure that swap kicks in when
1106
+
expected, tested while stressing both on the node-level and container-level.
1093
1107
1094
1108
### Rollout, Upgrade and Rollback Planning
1095
1109
@@ -1155,9 +1169,8 @@ This section must be completed when targeting beta to a release.
1155
1169
1156
1170
###### How can someone using this feature know that it is working for their instance?
1157
1171
1158
-
See #swap-metrics
1159
-
1160
-
1. Kubelet stats API will be extended to show swap usage details.
1172
+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1173
+
which provide how and if swap is utilized in the node, pod and container level.
1161
1174
1162
1175
###### How can an operator determine if the feature is in use by workloads?
1163
1176
@@ -1167,6 +1180,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
1167
1180
logs or events for this purpose.
1168
1181
-->
1169
1182
1183
+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1184
+
which provide how and if swap is utilized in the node, pod and container level.
1185
+
1170
1186
KubeletConfiguration has set `failOnSwap: false`.
1171
1187
1172
1188
The prometheus `node_exporter` will also export stats on swap memory
@@ -1178,19 +1194,22 @@ utilization.
1178
1194
Pick one more of these and delete the rest.
1179
1195
-->
1180
1196
1181
-
TBD. We will determine a set of metrics as a requirement for beta graduation.
1182
-
We will need more production data; there is not a single metric or set of
1183
-
metrics that can be used to generally quantify node performance.
1184
-
1185
-
This section to be updated before the feature can be marked as graduated, and
1186
-
to be worked on during 1.23 development.
1187
-
1188
-
We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.
1189
-
1190
-
-[ ] Metrics
1191
-
- Metric name:
1192
-
-[Optional] Aggregation method:
1193
-
- Components exposing the metric:
1197
+
See #swap-metrics: available by both Summary API (/stats/summary) and Prometheus (/metrics/resource)
1198
+
which provide how and if swap is utilized in the node, pod and container level.
1199
+
1200
+
-[X] Metrics
1201
+
- Metric names:
1202
+
-`container_swap_usage_bytes`
1203
+
-`pod_swap_usage_bytes`
1204
+
-`node_swap_usage_bytes`
1205
+
Components exposing the metric: `/metrics/resource` endpoint
1206
+
- Metric names:
1207
+
-`node.swap.swapUsageBytes`
1208
+
-`node.swap.swapAvailableBytes`
1209
+
-`node.systemContainers.swap.swapUsageBytes`
1210
+
-`pods[i].swap.swapUsageBytes`
1211
+
-`pods[i].containers[i].swap.swapUsageBytes`
1212
+
Components exposing the metric: `/stats/summary` endpoint
1194
1213
-[ ] Other (treat as last resort)
1195
1214
- Details:
1196
1215
@@ -1206,7 +1225,14 @@ high level (needs more precise definitions) those may be things like:
1206
1225
- 99,9% of /health requests per day finish with 200 code
1207
1226
-->
1208
1227
1209
-
N/A
1228
+
Swap is being managed by the kernel, depends on many factors and configurations
1229
+
that are outside of kubelet's reach like the nature of the workloads running on the node,
1230
+
swap capacity, memory capacity and other distro-specific configurations. However, generally:
1231
+
1232
+
- Nodes with swap enabled -> `node.swap.swapAvailableBytes` should be non-zero.
1233
+
- Nodes with memory pressure -> `node.swap.swapUsageBytes` should be non-zero.
1234
+
- Containers that reach their memory limit threshold -> `pods[i].containers[i].swap.swapUsageBytes` should be non-zero.
1235
+
- Pods with containers that reach their memory limit threshold -> `pods[i].swap.swapUsageBytes` should be non-zero.
1210
1236
1211
1237
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1212
1238
@@ -1321,9 +1347,11 @@ Think about adding additional work or introducing new steps in between
1321
1347
-->
1322
1348
1323
1349
Yes, enabling swap can affect performance of other critical daemons on the system.
1324
-
Any scenario where swap memory gets utilized is a result of system running out of physical RAM.
1350
+
Any scenario where swap memory gets utilized is a result of system running out of physical RAM,
1351
+
or a container reaching its memory limit threshold.
1325
1352
Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice
1326
-
along with reserving adequate enough system reserved memory.
1353
+
along with reserving adequate enough system reserved memory, giving io latency precedence to the system.slice, and more.
1354
+
See #best practices for more info.
1327
1355
1328
1356
The SLI that could potentially be impacted is [pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md).
1329
1357
If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted.
@@ -1412,41 +1440,24 @@ When swap is enabled, particularly for workloads, the kubelet’s resource
1412
1440
accounting may become much less accurate. This may make cluster administration
1413
1441
more difficult and less predictable.
1414
1442
1415
-
Currently, there exists an unsupported workaround, which is setting the kubelet
1416
-
flag `--fail-swap-on` to false.
1443
+
In general, swap is less predictable and might cause performance degradation.
1444
+
It also might be hard in certain scenarios to understand why certain workloads
1445
+
are the chosen candidates for swapping, which could occur for reasons external
1446
+
to the workload.
1447
+
1448
+
In addition, containers with memory limits would be killed less frequently
1449
+
since with swap enabled the kernel can usually reclaim a lot more memory.
1450
+
While this can help to avoid crashes, it could also "hide a problem" of a container
1451
+
reaching its memory limits.
1417
1452
1418
1453
## Alternatives
1419
1454
1420
1455
### Just set `--fail-swap-on=false`
1421
1456
1422
-
This is insufficient for most use cases because there is inconsistent control
1423
-
over how swap will be used by various container runtimes. Dockershim currently
1424
-
sets swap available for workloads to 0. The CRI does not restrict it at all.
1425
-
This inconsistency makes it difficult or impossible to use swap in production,
1426
-
particularly if a user wants to restrict workloads from using swap when using
1427
-
the CRI rather than dockershim.
1428
-
1429
-
This is also a breaking change.
1430
-
Users have used --fail-swap-on=false to allow for kubernetes to run
1431
-
on a swap enabled node.
1432
-
1433
-
### Restrict swap usage at the cgroup level
1434
-
1435
-
Setting a swap limit at the cgroup level would allow us to restrict the usage
1436
-
of swap on a pod-level, rather than container-level basis.
1437
-
1438
-
For alpha, we are opting for the container-level basis to simplify the
1439
-
implementation (as the container runtimes already support configuration of swap
1440
-
with the `memory-swap-limit` parameter). This will also provide the necessary
1441
-
plumbing for container-level accounting of swap, if that is proposed in the
1442
-
future.
1443
-
1444
-
In beta, we may want to revisit this.
1445
-
1446
-
See the [Pod Resource Management design proposal] for more background on the
1447
-
cgroup limits the kubelet currently sets based on each QoS class.
When `--fail-swap-on=false` is provided to Kubelet but swap is not configured
1458
+
otherwise it is guaranteed that, by default, no Kubernetes workloads would
1459
+
be able to utilize swap. However, everything outside of kubelet's reach
1460
+
(e.g. system daemons, kubelet, etc) would be able to use swap.
1450
1461
1451
1462
## Infrastructure Needed (Optional)
1452
1463
@@ -1456,4 +1467,7 @@ new subproject, repos requested, or GitHub details. Listing these here allows a
1456
1467
SIG to get the process for these resources started right away.
1457
1468
-->
1458
1469
1459
-
We may need Linux VM images built with swap partitions for e2e testing in CI.
1470
+
Added the "swap-conformance" lane for extensive swap testing under node pressure: [kubelet-swap-conformance-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-fedora-serial),
0 commit comments