Skip to content

Commit a82ca52

Browse files
authored
Merge pull request #49983 from rata/rata/blog-userns-1.33
Feature blog 1.33: Userns enabled by default
2 parents de15741 + 6d4266b commit a82ca52

File tree

2 files changed

+316
-0
lines changed

2 files changed

+316
-0
lines changed
Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.33: User Namespaces enabled by default!"
4+
date: 2025-04-23
5+
draft: true
6+
slug: userns-enabled-by-default
7+
author: >
8+
Rodrigo Campos Catelin (Microsoft),
9+
Giuseppe Scrivano (Red Hat),
10+
Sascha Grunert (Red Hat)
11+
---
12+
13+
In Kubernetes v1.33 support for user namespaces is enabled by default. This means
14+
that, when the stack requirements are met, pods can opt-in to use user
15+
namespaces. To use the feature there is no need to enable any Kubernetes feature
16+
flag anymore!
17+
18+
In this blog post we answer some common questions about user namespaces. But,
19+
before we dive into that, let's recap what user namespaces are and why they are
20+
important.
21+
22+
## What is a user namespace?
23+
24+
Note: Linux user namespaces are a different concept from [Kubernetes
25+
namespaces](/docs/concepts/overview/working-with-objects/namespaces/).
26+
The former is a Linux kernel feature; the latter is a Kubernetes feature.
27+
28+
Linux provides different namespaces to isolate processes from each other. For
29+
example, a typical Kubernetes pod runs within a network namespace to isolate the
30+
network identity and a PID namespace to isolate the processes.
31+
32+
One Linux namespace that was left behind is the [user
33+
namespace](https://man7.org/linux/man-pages/man7/user_namespaces.7.html). It
34+
isolates the UIDs and GIDs of the containers from the ones on the host. The
35+
identifiers in a container can be mapped to identifiers on the host in a way
36+
where host and container(s) never end up in overlapping UID/GIDs. Furthermore,
37+
the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on
38+
the host. This brings three key benefits:
39+
40+
* _Prevention of lateral movement_: As the UIDs and GIDs for different
41+
containers are mapped to different UIDs and GIDs on the host, containers have a
42+
harder time attacking each other, even if they escape the container boundaries.
43+
For example, suppose container A runs with different UIDs and GIDs on the host
44+
than container B. In that case, the operations it can do on container B's files and processes
45+
are limited: only read/write what a file allows to others, as it will never
46+
have permission owner or group permission (the UIDs/GIDs on the host are
47+
guaranteed to be different for different containers).
48+
49+
* _Increased host isolation_: As the UIDs and GIDs are mapped to unprivileged
50+
users on the host, if a container escapes the container boundaries, even if it
51+
runs as root inside the container, it has no privileges on the host. This
52+
greatly protects what host files it can read/write, which process it can send
53+
signals to, etc. Furthermore, capabilities granted are only valid inside the
54+
user namespace and not on the host, limiting the impact a container
55+
escape can have.
56+
57+
* _Enablement of new use cases_: User namespaces allow containers to gain
58+
certain capabilities inside their own user namespace without affecting the host.
59+
This unlocks new possibilities, such as running applications that require
60+
privileged operations without granting full root access on the host. This is
61+
particularly useful for running nested containers.
62+
63+
{{< figure src="/images/blog/2024-04-22-userns-beta/image.svg" alt="Image showing IDs 0-65535 are reserved to the host, pods use higher IDs" title="User namespace IDs allocation" class="diagram-medium" >}}
64+
65+
If a pod running as the root user without a user namespace manages to breakout,
66+
it has root privileges on the node. If some capabilities were granted to the
67+
container, the capabilities are valid on the host too. None of this is true when
68+
using user namespaces (modulo bugs, of course 🙂).
69+
70+
## Demos
71+
72+
Rodrigo created demos to understand how some CVEs are mitigated when user
73+
namespaces are used. We showed them here before (see [here][userns-alpha] and
74+
[here][userns-beta]), but take a look if you haven't:
75+
76+
Mitigation of CVE 2024-21626 with user namespaces:
77+
78+
{{< youtube id="07y5bl5UDdA" title="Mitigation of CVE-2024-21626 on Kubernetes by enabling User Namespace support" class="youtube-quote-sm" >}}
79+
80+
Mitigation of CVE 2022-0492 with user namespaces:
81+
82+
{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support" class="youtube-quote-sm" >}}
83+
84+
[userns-alpha]: https://kubernetes.io/blog/2023/09/13/userns-alpha/
85+
[userns-beta]: https://kubernetes.io/blog/2024/04/22/userns-beta/
86+
87+
## Everything you wanted to know about user namespaces in Kubernetes
88+
89+
Here we try to answer some of the questions we have been asked about user
90+
namespaces support in Kubernetes.
91+
92+
**1. What are the requirements to use it?**
93+
94+
The requirements are documented [here][userns-req]. But we will elaborate a bit
95+
more, in the following questions.
96+
97+
Note this is a Linux-only feature.
98+
99+
[userns-req]: /docs/concepts/workloads/pods/user-namespaces/#before-you-begin
100+
101+
**2. How do I configure a pod to opt-in?**
102+
103+
A complete step-by-step guide is available [here][task-userns]. But the short
104+
version is you need to set the `hostUsers: false` field in the pod spec. For
105+
example like this:
106+
107+
```yaml
108+
apiVersion: v1
109+
kind: Pod
110+
metadata:
111+
name: userns
112+
spec:
113+
hostUsers: false
114+
containers:
115+
- name: shell
116+
command: ["sleep", "infinity"]
117+
image: debian
118+
```
119+
120+
Yes, it is that simple. Applications will run just fine, without any other
121+
changes needed (unless your application needs the privileges).
122+
123+
User namespaces allows you to run as root inside the container, but not have
124+
privileges in the host. However, if your application needs the privileges on the
125+
host, for example an app that needs to load a kernel module, then you can't use
126+
user namespaces.
127+
128+
**3. What are idmap mounts and why the file-systems used need to support it?**
129+
130+
Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when
131+
accessing a mount. When combined with user namespaces, it greatly simplifies the
132+
support for volumes, as you can forget about the host UIDs/GIDs the user
133+
namespace is using.
134+
135+
In particular, thanks to idmap mounts we can:
136+
* Run each pod with different UIDs/GIDs on the host. This is key for the
137+
lateral movement prevention we mentioned earlier.
138+
* Share volumes with pods that don't use user namespaces.
139+
* Enable/disable user namespaces without needing to chown the pod's volumes.
140+
141+
Support for idmap mounts in the kernel is per file-system and different kernel
142+
releases added support for idmap mounts on different file-systems.
143+
144+
To find which kernel version added support for each file-system, you can check
145+
out the `mount_setattr` man page, or the online version of it
146+
[here][mount_setattr].
147+
148+
Most popular file-systems are supported, the notable absence that isn't
149+
supported yet is NFS.
150+
151+
[mount_setattr]: https://man7.org/linux/man-pages/man2/mount_setattr.2.html#NOTES
152+
153+
**4. Can you clarify exactly which file-systems need to support idmap mounts?**
154+
155+
The file-systems that need to support idmap mounts are all the file-systems used
156+
by a pod in the `pod.spec.volumes` field.
157+
158+
This means: for PV/PVC volumes, the file-system used in the PV needs to support
159+
idmap mounts; for hostPath volumes, the file-system used in the hostPath
160+
needs to support idmap mounts.
161+
162+
What does this mean for secrets/configmaps/projected/downwardAPI volumes? For
163+
these volumes, the kubelet creates a `tmpfs` file-system. So, you will need a
164+
6.3 kernel to use these volumes (note that if you use them as env variables it
165+
is fine).
166+
167+
And what about emptyDir volumes? Those volumes are created by the kubelet by
168+
default in `/var/lib/kubelet/pods/`. You can also use a custom directory for
169+
this. But what needs to support idmap mounts is the file-system used in that
170+
directory.
171+
172+
The kubelet creates some more files for the container, like `/etc/hostname`,
173+
`/etc/resolv.conf`, `/dev/termination-log`, `/etc/hosts`, etc. These files are
174+
also created in `/var/lib/kubelet/pods/` by default, so it's important for the
175+
file-system used in that directory to support idmap mounts.
176+
177+
Also, some container runtimes may put some of these ephemeral volumes inside a
178+
`tmpfs` file-system, in which case you will need support for idmap mounts in
179+
`tmpfs`.
180+
181+
**5. Can I use a kernel older than 6.3?**
182+
183+
Yes, but you will need to make sure you are not using a `tmpfs` file-system. If
184+
you avoid that, you can easily use 5.19 (if all the other file-systems you use
185+
support idmap mounts in that kernel).
186+
187+
It can be tricky to avoid using `tmpfs`, though, as we just described above.
188+
Besides having to avoid those volume types, you will also have to avoid mounting the
189+
service account token. Every pod has it mounted by default, and it uses a
190+
projected volume that, as we mentioned, uses a `tmpfs` file-system.
191+
192+
You could even go lower than 5.19, all the way to 5.12. However, your container
193+
rootfs probably uses an overlayfs file-system, and support for overlayfs was
194+
added in 5.19. We wouldn't recommend to use a kernel older than 5.19, as not
195+
being able to use idmap mounts for the rootfs is a big limitation. If you
196+
absolutely need to, you can check [this blog post][userns-tricks] Rodrigo wrote
197+
some years ago, about tricks to use user namespaces when you can't support
198+
idmap mounts on the rootfs.
199+
200+
[userns-tricks]: https://kinvolk.io/blog/2023/11/tips-and-tricks-for-user-namespaces-with-kubernetes-and-containerd
201+
202+
**6. If my stack supports user namespaces, do I need to configure anything else?**
203+
204+
No, if your stack supports it and you are using Kubernetes v1.33, there is
205+
nothing you _need_ to configure. You should be able to follow the task: [Use a
206+
user namespace with a pod][task-userns].
207+
208+
However, in case you have specific requirements, you may configure various
209+
options. You can find more information [here][userns-k8s-conf]. You can also
210+
enable a [feature gate to relax the PSS rules][userns-pss].
211+
212+
[userns-k8s-conf]: /docs/concepts/workloads/pods/user-namespaces/#set-up-a-node-to-support-user-namespaces
213+
[task-userns]: /docs/tasks/configure-pod-container/user-namespaces/
214+
[userns-pss]: /docs/concepts/workloads/pods/user-namespaces/#integration-with-pod-security-admission-checks
215+
216+
**7. The demos are nice, but are there more CVEs that this mitigates?**
217+
218+
Yes, quite a lot, actually! Besides the ones in the demo, the KEP has [more CVEs
219+
you can check][kep-cve]. That list is not exhaustive, there are many more.
220+
221+
[kep-cve]: https://github.com/kubernetes/enhancements/blob/b8013bfbceb16843686aebbb2ccffce81a6e772d/keps/sig-node/127-user-namespaces/README.md#motivation
222+
223+
**8. Can you sum up why user namespaces is important?**
224+
225+
Think about running a process as root, maybe even an untrusted process. Do you
226+
think that is secure? What if we limit it by adding seccomp and apparmor, mask
227+
some files in /proc (so it can't crash the node, etc.) and some more tweaks?
228+
229+
Wouldn't it be better if we don't give it privileges in the first place, instead
230+
of trying to play whack-a-mole with all the possible ways root can escape?
231+
232+
This is what user namespaces does, plus some other goodies:
233+
234+
* **Run as an unprivileged user on the host without making changes to your application**.
235+
Greg and Vinayak did a great talk on the pains you can face when trying to run
236+
unprivileged without user namespaces. The pains part [starts in this minute][kubecon-nonroot-pains].
237+
238+
* **All pods run with different UIDs/GIDs, we significantly improve the lateral
239+
movement**. This is guaranteed with user namespaces (the kubelet chooses it for
240+
you). In the same talk, Greg and Vinayak show that to achieve the same without
241+
user namespaces, they went through a quite complex custom solution. This part
242+
[starts in this minute][kubecon-nonroot-uids].
243+
244+
* **The capabilities granted are only granted inside the user namespace**. That
245+
means that if a pod breaks out of the container, they are not valid on the
246+
host. We can't provide that without user namespaces.
247+
248+
* **It enables new use-cases in a _secure_ way**. You can run docker in docker,
249+
unprivileged container builds, Kubernetes inside Kubernetes, etc all **in a secure
250+
way**. Most of the previous solutions to do this required privileged containers or
251+
putting the node at a high risk of compromise.
252+
253+
[kubecon-nonroot-pains]: https://youtu.be/uouH9fsWVIE?feature=shared&t=351
254+
[kubecon-nonroot-uids]: https://youtu.be/uouH9fsWVIE?feature=shared&t=793
255+
256+
**9. Is there container runtime documentation for user namespaces?**
257+
258+
Yes, we have [containerd
259+
documentation](https://github.com/containerd/containerd/tree/b22a302a75d9a7d7955780e54cc5b32de6c8525d/docs/user-namespaces).
260+
This explains different limitations of containerd 1.7 and how to use
261+
user namespaces in containerd without Kubernetes pods (using `ctr`). Note that
262+
if you use containerd, you need containerd 2.0 or higher to use user namespaces
263+
with Kubernetes.
264+
265+
CRI-O doesn't have special documentation for user namespaces, it works out of
266+
the box.
267+
268+
**10. What about the other container runtimes?**
269+
270+
No other container runtime that we are aware of supports user namespaces with
271+
Kubernetes. That sadly includes [cri-dockerd][cri-dockerd] too.
272+
273+
[cri-dockerd]: https://github.com/Mirantis/cri-dockerd/issues/74
274+
275+
**11. I'd like to learn more about it, what would you recommend?**
276+
277+
Rodrigo did an introduction to user namespaces at KubeCon 2022:
278+
* [Run As “Root”, Not Root: User Namespaces In K8s- Marga Manterola, Isovalent & Rodrigo Campos Catelin](https://sched.co/182K0)
279+
280+
Also, this aforementioned presentation at KubeCon 2023 can be
281+
useful as a motivation for user namespaces:
282+
* [Least Privilege Containers: Keeping a Bad Day from Getting Worse - Greg Castle & Vinayak Goyal](https://sched.co/1HyX4)
283+
284+
Bear in mind the presentation are some years old, some things have changed since
285+
then. Use the Kubernetes documentation as the source of truth.
286+
287+
If you would like to learn more about the low-level details of user namespaces,
288+
you can check `man 7 user_namespaces` and `man 1 unshare`. You can easily create
289+
namespaces and experiment with how they behave. Be aware that the `unshare` tool
290+
has a lot of flexibility, and with that options to create incomplete setups.
291+
292+
If you would like to know more about idmap mounts, you can check [its Linux
293+
kernel documentation](https://docs.kernel.org/filesystems/idmappings.html).
294+
295+
## Conclusions
296+
297+
Running pods as root is not ideal and running them as non-root is also hard
298+
with containers, as it can require a lot of changes to the applications.
299+
User namespaces are a unique feature to let you have the best of both worlds: run
300+
as non-root, without any changes to your application.
301+
302+
This post covered: what are user namespaces, why they are important, some real
303+
world examples of CVEs mitigated by user-namespaces, and some common questions.
304+
Hopefully, this post helped you to eliminate the last doubts you had and you
305+
will now try user-namespaces (if you didn't already!).
306+
307+
## How do I get involved?
308+
309+
You can reach SIG Node by several means:
310+
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
311+
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
312+
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
313+
314+
You can also contact us directly:
315+
- GitHub: @rata @giuseppe @saschagrunert
316+
- Slack: @rata @giuseppe @sascha
Binary file not shown.

0 commit comments

Comments
 (0)