Skip to content

Commit c64eaed

Browse files
authored
Merge pull request #5413 from haircommander/userns-beta3
KEP 127: add a metric, describe an error kubelet will return, and target one more beta
2 parents 3de80fb + 9eead1a commit c64eaed

File tree

2 files changed

+30
-146
lines changed

2 files changed

+30
-146
lines changed

keps/sig-node/127-user-namespaces/README.md

Lines changed: 23 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
- [Example without idmap mounts](#example-without-idmap-mounts)
2626
- [Example with idmap mounts](#example-with-idmap-mounts)
2727
- [Regarding the previous implementation for volumes](#regarding-the-previous-implementation-for-volumes)
28+
- [Non-conformant volume types](#non-conformant-volume-types)
2829
- [Pod Security Standards (PSS) integration](#pod-security-standards-pss-integration)
2930
- [Unresolved](#unresolved)
3031
- [Test Plan](#test-plan)
@@ -476,6 +477,11 @@ components that implement the interface.
476477

477478
[kubeletVolumeHost-interface]: https://github.com/kubernetes/kubernetes/blob/36450ee422d57d53a3edaf960f86b356578fe996/pkg/volume/plugins.go#L322
478479

480+
#### Non-conformant volume types
481+
482+
Some volume types don't have support for idmapped mounts, like [raw block devices](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#volumedevice-v1-core).
483+
If a pod runs with such a volume type and a user namespace, the kubelet will fail to create the pod.
484+
479485
### Pod Security Standards (PSS) integration
480486

481487
[Pod Security Standards](https://k8s.io/docs/concepts/security/pod-security-standards)
@@ -493,6 +499,8 @@ For `baseline` and `restricted` namespaces, if a pod has `hostUsers` set to fals
493499
For `baseline` namespaces, pods with `hostUsers` set to false can set any value for the `capabilities.add` field,
494500
whereas normally in a `baseline` namespace a pod is restricted to adding certain capabilities.
495501

502+
Finally, for `restricted` namespaces, `hostUsers` will be required to be set to `false`.
503+
496504
The validation for capabilities can be relaxed in a `baseline` pod because capabilities
497505
are user namespaced in the linux kernel, and any pod does not have a seccomp profile (as baseline
498506
pods may not be required to, depending on the kubelet's `seccompDefault` configuration field)
@@ -892,14 +900,13 @@ When a pod hits this error returned by the kubelet, the status in `kubectl` is s
892900
Warning FailedCreatePodSandBox 12s (x23 over 5m6s) kubelet Failed to create pod sandbox: user namespaces is not supported by the runtime
893901
```
894902

895-
The following kubelet metrics are useful to check:
896-
- `kubelet_running_pods`: Shows the actual number of pods running
897-
- `kubelet_desired_pods`: The number of pods the kubelet is _trying_ to run
903+
The following kubelet metrics will be added
904+
- `started_user_namespaced_pods_total`: Shows the number of pods that have been attempted to be created with a user namespace.
905+
- `started_user_namespaced_pods_errors_total`: The number of pods that failed to create that had a user namespace.
898906

899-
If these metrics are very different, it means there are desired pods that can't be set to running.
900-
If that is the case, checking the pod events to see if they are failing for user namespaces reasons
901-
(like the errors shown in this KEP) is advised, in which case it is recommended to rollback or
902-
disable the feature gate.
907+
If the kubelet metric `started_user_namespaced_pods_errors_total` has a value close to `started_user_namespaced_pods_total`
908+
it means most of pods with userns started are failing. If that is the case, checking the pod events to see if they are failing for user namespaces reasons
909+
(like the errors shown in this KEP) is advised, in which case it is recommended to rollback or disable the feature gate.
903910

904911
<!--
905912
What signals should users be paying attention to when the feature is young
@@ -975,9 +982,7 @@ Recall that end users cannot usually observe component logs or access metrics.
975982
- Condition name:
976983
- Other field:
977984
- [x] Other (treat as last resort)
978-
- Details: check pods with pod.spec.hostUsers field set to false, and see if they are in RUNNING
979-
state. Exec into a container and run `cat /proc/self/uid_map` to verify that the mappings are different
980-
than the mappings on the host.
985+
- Details: `started_user_namespaced_pods_total` metric is greater than `started_user_namespaced_pods_errors_total` for a given node.
981986

982987
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
983988

@@ -1018,16 +1023,8 @@ Pick one more of these and delete the rest.
10181023

10191024
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10201025

1021-
No.
1022-
1023-
This feature is using yet another namespace when creating a pod. If the pod creation fails (by
1024-
an error on the kubelet or returned by the container runtime), a clear error is returned to the
1025-
user. The feedback on this is very direct to the user actions.
1026-
1027-
A metric like "errors returned in pods with user namespaces enabled" can be very noisy, as the error
1028-
can be completely unrelated (image pull secret errors, configmap referenced and not defined, any
1029-
other container runtime error, etc.). We can't see any metric that can be helpful, as the user has a
1030-
very direct feedback already.
1026+
Yes, two metrics will be added: `started_user_namespaced_pods_total` and `started_user_namespaced_pods_errors_total`.
1027+
If error == total for a given node, then there is a problem on that node with user namespace creation.
10311028

10321029
<!--
10331030
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
@@ -1072,64 +1069,22 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
10721069

10731070
### Scalability
10741071

1075-
<!--
1076-
For alpha, this section is encouraged: reviewers should consider these questions
1077-
and attempt to answer them.
1078-
1079-
For beta, this section is required: reviewers must answer these questions.
1080-
1081-
For GA, this section is required: approvers should be able to confirm the
1082-
previous answers based on experience in the field.
1083-
-->
1084-
10851072
###### Will enabling / using this feature result in any new API calls?
10861073

10871074
No.
10881075

1089-
<!--
1090-
Describe them, providing:
1091-
- API call type (e.g. PATCH pods)
1092-
- estimated throughput
1093-
- originating component(s) (e.g. Kubelet, Feature-X-controller)
1094-
Focusing mostly on:
1095-
- components listing and/or watching resources they didn't before
1096-
- API calls that may be triggered by changes of some Kubernetes resources
1097-
(e.g. update of object X triggers new updates of object Y)
1098-
- periodic API calls to reconcile state (e.g. periodic fetching state,
1099-
heartbeats, leader election, etc.)
1100-
-->
1101-
11021076
###### Will enabling / using this feature result in introducing new API types?
11031077

11041078
No.
11051079

1106-
<!--
1107-
Describe them, providing:
1108-
- API type
1109-
- Supported number of objects per cluster
1110-
- Supported number of objects per namespace (for namespace-scoped objects)
1111-
-->
1112-
11131080
###### Will enabling / using this feature result in any new calls to the cloud provider?
11141081

11151082
No.
1116-
<!--
1117-
Describe them, providing:
1118-
- Which API(s):
1119-
- Estimated increase:
1120-
-->
11211083

11221084
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
11231085

11241086
Yes. The pod.Spec.HostUsers field is a bool, should be small.
11251087

1126-
<!--
1127-
Describe them, providing:
1128-
- API type(s):
1129-
- Estimated increase in size: (e.g., new annotation of size 32B)
1130-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
1131-
-->
1132-
11331088
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
11341089

11351090
Not in any Kubernetes component, it might take more time for the container
@@ -1164,15 +1119,6 @@ The options we have for this plumbing to setup the rootfs:
11641119

11651120
In any case, the kubernetes components do not need any change.
11661121

1167-
<!--
1168-
Look at the [existing SLIs/SLOs].
1169-
1170-
Think about adding additional work or introducing new steps in between
1171-
(e.g. need to do X to start a container), etc. Please describe the details.
1172-
1173-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
1174-
-->
1175-
11761122
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
11771123

11781124
Not in any kubernetes component.
@@ -1184,27 +1130,8 @@ previous question for more details).
11841130
This is not needed on newer kernels, as they can rely on idmapped mounts for the
11851131
UID/GID shifting (it is just a bind mount).
11861132

1187-
<!--
1188-
Things to keep in mind include: additional in-memory state, additional
1189-
non-trivial computations, excessive access to disks (including increased log
1190-
volume), significant amount of data sent and/or received over network, etc.
1191-
This through this both in small and large cases, again with respect to the
1192-
[supported limits].
1193-
1194-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
1195-
-->
1196-
11971133
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
11981134

1199-
<!--
1200-
Focus not just on happy cases, but primarily on more pathological cases
1201-
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1202-
If any of the resources can be exhausted, how this is mitigated with the existing limits
1203-
(e.g. pods per node) or new limits added by this KEP?
1204-
Are there any tests that were run/should be run to understand performance characteristics better
1205-
and validate the declared limits?
1206-
-->
1207-
12081135
The kubelet is spliting the host UID/GID space for different pods, to use for
12091136
their user namespace mapping. The design allows for 65k pods per node, and the
12101137
resource is limited to maxPods per node (currently maxPods defaults to 110, it
@@ -1222,23 +1149,16 @@ appropiately.
12221149

12231150
### Troubleshooting
12241151

1225-
<!--
1226-
This section must be completed when targeting beta to a release.
1227-
1228-
For GA, this section is required: approvers should be able to confirm the
1229-
previous answers based on experience in the field.
1230-
1231-
The Troubleshooting section currently serves the `Playbook` role. We may consider
1232-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
1233-
details). For now, we leave it here.
1234-
-->
1235-
12361152
###### How does this feature react if the API server and/or etcd is unavailable?
12371153

12381154
No changes to current kubelet behaviors. The feature only uses kubelet-local information.
12391155

12401156
###### What are other known failure modes?
12411157

1158+
For all of the following error modes, two metrics will be added: `started_user_namespaced_pods_total` and `started_user_namespaced_pods_errors_total`.
1159+
For a given node, if total == error, then there is a problem creating user namespaces on that node.
1160+
If the admin wants to dig deeper into the reason, they will have to check specific pod statuses.
1161+
12421162
- Some filesystem used by the pod doesn't support idmap mounts on the kernel used.
12431163
- Detection: How can it be detected via metrics? Stated another way:
12441164
how can an operator troubleshoot without logging into a master or worker node?
@@ -1289,8 +1209,7 @@ No changes to current kubelet behaviors. The feature only uses kubelet-local inf
12891209
- Detection: How can it be detected via metrics? Stated another way:
12901210
how can an operator troubleshoot without logging into a master or worker node?
12911211

1292-
Errors are returned on pod creation, directly to the user (visible on the pod events). No
1293-
need to use metrics.
1212+
Errors are returned on pod creation, directly to the user (visible on the pod events).
12941213

12951214
See the pod events, it should contain something like:
12961215

@@ -1322,7 +1241,6 @@ writing to this file.
13221241
how can an operator troubleshoot without logging into a master or worker node?
13231242

13241243
Errors are returned to the operation failed (like pod creation, visible on the pod events),
1325-
no need to see metrics nor logs.
13261244

13271245
Errors are returned to the either on:
13281246
* Kubelet initialization: the initialization fails if the feature gate is active and there is a
@@ -1374,19 +1292,6 @@ writing to this file.
13741292

13751293
It is part of the system configuration.
13761294

1377-
<!--
1378-
For each of them, fill in the following information by copying the below template:
1379-
- [Failure mode brief description]
1380-
- Detection: How can it be detected via metrics? Stated another way:
1381-
how can an operator troubleshoot without logging into a master or worker node?
1382-
- Mitigations: What can be done to stop the bleeding, especially for already
1383-
running user workloads?
1384-
- Diagnostics: What are the useful log messages and their required logging
1385-
levels that could help debug the issue?
1386-
Not required until feature graduated to beta.
1387-
- Testing: Are there any tests for failure mode? If not, describe why.
1388-
-->
1389-
13901295
###### What steps should be taken if SLOs are not being met to determine the problem?
13911296

13921297
This KEP doesn't introduce new SLOs and doesn't result in increasing time taken
@@ -1416,24 +1321,10 @@ be the cause of the problem.
14161321
- Kubernetes 1.28: Support for stateful pods, renamed feature gate (alpha)
14171322
- Kubernetes 1.30: Feature went off-by-default beta
14181323
- Kubernetes 1.33: Feature goes on-by-default beta
1419-
1420-
<!--
1421-
Major milestones in the lifecycle of a KEP should be tracked in this section.
1422-
Major milestones might include:
1423-
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
1424-
- the `Proposal` section being merged, signaling agreement on a proposed design
1425-
- the date implementation started
1426-
- the first Kubernetes release where an initial version of the KEP was available
1427-
- the version of Kubernetes where the KEP graduated to general availability
1428-
- when the KEP was retired or superseded
1429-
-->
1324+
- Kubernetes 1.34: Feature adds metrics
14301325

14311326
## Drawbacks
14321327

1433-
<!--
1434-
Why should this KEP _not_ be implemented?
1435-
-->
1436-
14371328
## Alternatives
14381329

14391330
Here is a list of considerations raised in PRs discussion that were considered.
@@ -1530,16 +1421,4 @@ range needs to be used by the kubelet, that can be configured per-node.
15301421

15311422
Therefore, this old concerned is now resolved.
15321423

1533-
<!--
1534-
What other approaches did you consider, and why did you rule them out? These do
1535-
not need to be as detailed as the proposal, but should include enough
1536-
information to express the idea and why it was not acceptable.
1537-
-->
1538-
15391424
## Infrastructure Needed (Optional)
1540-
1541-
<!--
1542-
Use this section if you need things from the project/SIG. Examples include a
1543-
new subproject, repos requested, or GitHub details. Listing these here allows a
1544-
SIG to get the process for these resources started right away.
1545-
-->

keps/sig-node/127-user-namespaces/kep.yaml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,19 @@ approvers:
1616
- "@derekwaynecarr"
1717

1818
stage: beta
19-
latest-milestone: "v1.33"
19+
latest-milestone: "v1.34"
2020
milestone:
2121
alpha: "v1.25"
22-
beta: "v1.33"
22+
beta: "v1.34"
23+
stable: "v1.35"
2324

2425
feature-gates:
2526
- name: UserNamespacesSupport
2627
components:
2728
- kubelet
2829
- kube-apiserver
2930
disable-supported: true
31+
32+
metrics:
33+
- started_user_namespaced_pods_total (exposed by kubelet)
34+
- started_user_namespaced_pods_errors_total (exposed by kubelet)

0 commit comments

Comments
 (0)