|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes v1.33: User Namespaces enabled by default!" |
| 4 | +date: 2025-04-23 |
| 5 | +draft: true |
| 6 | +slug: userns-enabled-by-default |
| 7 | +author: > |
| 8 | + Rodrigo Campos Catelin (Microsoft), |
| 9 | + Giuseppe Scrivano (Red Hat), |
| 10 | + Sascha Grunert (Red Hat) |
| 11 | +--- |
| 12 | + |
| 13 | +In Kubernetes v1.33 support for user namespaces is enabled by default. This means |
| 14 | +that, when the stack requirements are met, pods can opt-in to use user |
| 15 | +namespaces. To use the feature there is no need to enable any Kubernetes feature |
| 16 | +flag anymore! |
| 17 | + |
| 18 | +In this blog post we answer some common questions about user namespaces. But, |
| 19 | +before we dive into that, let's recap what user namespaces are and why they are |
| 20 | +important. |
| 21 | + |
| 22 | +## What is a user namespace? |
| 23 | + |
| 24 | +Note: Linux user namespaces are a different concept from [Kubernetes |
| 25 | +namespaces](/docs/concepts/overview/working-with-objects/namespaces/). |
| 26 | +The former is a Linux kernel feature; the latter is a Kubernetes feature. |
| 27 | + |
| 28 | +Linux provides different namespaces to isolate processes from each other. For |
| 29 | +example, a typical Kubernetes pod runs within a network namespace to isolate the |
| 30 | +network identity and a PID namespace to isolate the processes. |
| 31 | + |
| 32 | +One Linux namespace that was left behind is the [user |
| 33 | +namespace](https://man7.org/linux/man-pages/man7/user_namespaces.7.html). It |
| 34 | +isolates the UIDs and GIDs of the containers from the ones on the host. The |
| 35 | +identifiers in a container can be mapped to identifiers on the host in a way |
| 36 | +where host and container(s) never end up in overlapping UID/GIDs. Furthermore, |
| 37 | +the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on |
| 38 | +the host. This brings three key benefits: |
| 39 | + |
| 40 | + * _Prevention of lateral movement_: As the UIDs and GIDs for different |
| 41 | +containers are mapped to different UIDs and GIDs on the host, containers have a |
| 42 | +harder time attacking each other, even if they escape the container boundaries. |
| 43 | +For example, suppose container A runs with different UIDs and GIDs on the host |
| 44 | +than container B. In that case, the operations it can do on container B's files and processes |
| 45 | +are limited: only read/write what a file allows to others, as it will never |
| 46 | +have permission owner or group permission (the UIDs/GIDs on the host are |
| 47 | +guaranteed to be different for different containers). |
| 48 | + |
| 49 | + * _Increased host isolation_: As the UIDs and GIDs are mapped to unprivileged |
| 50 | +users on the host, if a container escapes the container boundaries, even if it |
| 51 | +runs as root inside the container, it has no privileges on the host. This |
| 52 | +greatly protects what host files it can read/write, which process it can send |
| 53 | +signals to, etc. Furthermore, capabilities granted are only valid inside the |
| 54 | +user namespace and not on the host, limiting the impact a container |
| 55 | +escape can have. |
| 56 | + |
| 57 | + * _Enablement of new use cases_: User namespaces allow containers to gain |
| 58 | +certain capabilities inside their own user namespace without affecting the host. |
| 59 | +This unlocks new possibilities, such as running applications that require |
| 60 | +privileged operations without granting full root access on the host. This is |
| 61 | +particularly useful for running nested containers. |
| 62 | + |
| 63 | +{{< figure src="/images/blog/2024-04-22-userns-beta/image.svg" alt="Image showing IDs 0-65535 are reserved to the host, pods use higher IDs" title="User namespace IDs allocation" class="diagram-medium" >}} |
| 64 | + |
| 65 | +If a pod running as the root user without a user namespace manages to breakout, |
| 66 | +it has root privileges on the node. If some capabilities were granted to the |
| 67 | +container, the capabilities are valid on the host too. None of this is true when |
| 68 | +using user namespaces (modulo bugs, of course 🙂). |
| 69 | + |
| 70 | +## Demos |
| 71 | + |
| 72 | +Rodrigo created demos to understand how some CVEs are mitigated when user |
| 73 | +namespaces are used. We showed them here before (see [here][userns-alpha] and |
| 74 | +[here][userns-beta]), but take a look if you haven't: |
| 75 | + |
| 76 | +Mitigation of CVE 2024-21626 with user namespaces: |
| 77 | + |
| 78 | +{{< youtube id="07y5bl5UDdA" title="Mitigation of CVE-2024-21626 on Kubernetes by enabling User Namespace support" class="youtube-quote-sm" >}} |
| 79 | + |
| 80 | +Mitigation of CVE 2022-0492 with user namespaces: |
| 81 | + |
| 82 | +{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support" class="youtube-quote-sm" >}} |
| 83 | + |
| 84 | +[userns-alpha]: https://kubernetes.io/blog/2023/09/13/userns-alpha/ |
| 85 | +[userns-beta]: https://kubernetes.io/blog/2024/04/22/userns-beta/ |
| 86 | + |
| 87 | +## Everything you wanted to know about user namespaces in Kubernetes |
| 88 | + |
| 89 | +Here we try to answer some of the questions we have been asked about user |
| 90 | +namespaces support in Kubernetes. |
| 91 | + |
| 92 | +**1. What are the requirements to use it?** |
| 93 | + |
| 94 | +The requirements are documented [here][userns-req]. But we will elaborate a bit |
| 95 | +more, in the following questions. |
| 96 | + |
| 97 | +Note this is a Linux-only feature. |
| 98 | + |
| 99 | +[userns-req]: /docs/concepts/workloads/pods/user-namespaces/#before-you-begin |
| 100 | + |
| 101 | +**2. How do I configure a pod to opt-in?** |
| 102 | + |
| 103 | +A complete step-by-step guide is available [here][task-userns]. But the short |
| 104 | +version is you need to set the `hostUsers: false` field in the pod spec. For |
| 105 | +example like this: |
| 106 | + |
| 107 | +```yaml |
| 108 | +apiVersion: v1 |
| 109 | +kind: Pod |
| 110 | +metadata: |
| 111 | + name: userns |
| 112 | +spec: |
| 113 | + hostUsers: false |
| 114 | + containers: |
| 115 | + - name: shell |
| 116 | + command: ["sleep", "infinity"] |
| 117 | + image: debian |
| 118 | +``` |
| 119 | +
|
| 120 | +Yes, it is that simple. Applications will run just fine, without any other |
| 121 | +changes needed (unless your application needs the privileges). |
| 122 | +
|
| 123 | +User namespaces allows you to run as root inside the container, but not have |
| 124 | +privileges in the host. However, if your application needs the privileges on the |
| 125 | +host, for example an app that needs to load a kernel module, then you can't use |
| 126 | +user namespaces. |
| 127 | +
|
| 128 | +**3. What are idmap mounts and why the file-systems used need to support it?** |
| 129 | +
|
| 130 | +Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when |
| 131 | +accessing a mount. When combined with user namespaces, it greatly simplifies the |
| 132 | +support for volumes, as you can forget about the host UIDs/GIDs the user |
| 133 | +namespace is using. |
| 134 | +
|
| 135 | +In particular, thanks to idmap mounts we can: |
| 136 | + * Run each pod with different UIDs/GIDs on the host. This is key for the |
| 137 | + lateral movement prevention we mentioned earlier. |
| 138 | + * Share volumes with pods that don't use user namespaces. |
| 139 | + * Enable/disable user namespaces without needing to chown the pod's volumes. |
| 140 | +
|
| 141 | +Support for idmap mounts in the kernel is per file-system and different kernel |
| 142 | +releases added support for idmap mounts on different file-systems. |
| 143 | +
|
| 144 | +To find which kernel version added support for each file-system, you can check |
| 145 | +out the `mount_setattr` man page, or the online version of it |
| 146 | +[here][mount_setattr]. |
| 147 | + |
| 148 | +Most popular file-systems are supported, the notable absence that isn't |
| 149 | +supported yet is NFS. |
| 150 | + |
| 151 | +[mount_setattr]: https://man7.org/linux/man-pages/man2/mount_setattr.2.html#NOTES |
| 152 | + |
| 153 | +**4. Can you clarify exactly which file-systems need to support idmap mounts?** |
| 154 | + |
| 155 | +The file-systems that need to support idmap mounts are all the file-systems used |
| 156 | +by a pod in the `pod.spec.volumes` field. |
| 157 | + |
| 158 | +This means: for PV/PVC volumes, the file-system used in the PV needs to support |
| 159 | +idmap mounts; for hostPath volumes, the file-system used in the hostPath |
| 160 | +needs to support idmap mounts. |
| 161 | + |
| 162 | +What does this mean for secrets/configmaps/projected/downwardAPI volumes? For |
| 163 | +these volumes, the kubelet creates a `tmpfs` file-system. So, you will need a |
| 164 | +6.3 kernel to use these volumes (note that if you use them as env variables it |
| 165 | +is fine). |
| 166 | + |
| 167 | +And what about emptyDir volumes? Those volumes are created by the kubelet by |
| 168 | +default in `/var/lib/kubelet/pods/`. You can also use a custom directory for |
| 169 | +this. But what needs to support idmap mounts is the file-system used in that |
| 170 | +directory. |
| 171 | + |
| 172 | +The kubelet creates some more files for the container, like `/etc/hostname`, |
| 173 | +`/etc/resolv.conf`, `/dev/termination-log`, `/etc/hosts`, etc. These files are |
| 174 | +also created in `/var/lib/kubelet/pods/` by default, so it's important for the |
| 175 | +file-system used in that directory to support idmap mounts. |
| 176 | + |
| 177 | +Also, some container runtimes may put some of these ephemeral volumes inside a |
| 178 | +`tmpfs` file-system, in which case you will need support for idmap mounts in |
| 179 | +`tmpfs`. |
| 180 | + |
| 181 | +**5. Can I use a kernel older than 6.3?** |
| 182 | + |
| 183 | +Yes, but you will need to make sure you are not using a `tmpfs` file-system. If |
| 184 | +you avoid that, you can easily use 5.19 (if all the other file-systems you use |
| 185 | +support idmap mounts in that kernel). |
| 186 | + |
| 187 | +It can be tricky to avoid using `tmpfs`, though, as we just described above. |
| 188 | +Besides having to avoid those volume types, you will also have to avoid mounting the |
| 189 | +service account token. Every pod has it mounted by default, and it uses a |
| 190 | +projected volume that, as we mentioned, uses a `tmpfs` file-system. |
| 191 | + |
| 192 | +You could even go lower than 5.19, all the way to 5.12. However, your container |
| 193 | +rootfs probably uses an overlayfs file-system, and support for overlayfs was |
| 194 | +added in 5.19. We wouldn't recommend to use a kernel older than 5.19, as not |
| 195 | +being able to use idmap mounts for the rootfs is a big limitation. If you |
| 196 | +absolutely need to, you can check [this blog post][userns-tricks] Rodrigo wrote |
| 197 | +some years ago, about tricks to use user namespaces when you can't support |
| 198 | +idmap mounts on the rootfs. |
| 199 | + |
| 200 | +[userns-tricks]: https://kinvolk.io/blog/2023/11/tips-and-tricks-for-user-namespaces-with-kubernetes-and-containerd |
| 201 | + |
| 202 | +**6. If my stack supports user namespaces, do I need to configure anything else?** |
| 203 | + |
| 204 | +No, if your stack supports it and you are using Kubernetes v1.33, there is |
| 205 | +nothing you _need_ to configure. You should be able to follow the task: [Use a |
| 206 | +user namespace with a pod][task-userns]. |
| 207 | + |
| 208 | +However, in case you have specific requirements, you may configure various |
| 209 | +options. You can find more information [here][userns-k8s-conf]. You can also |
| 210 | +enable a [feature gate to relax the PSS rules][userns-pss]. |
| 211 | + |
| 212 | +[userns-k8s-conf]: /docs/concepts/workloads/pods/user-namespaces/#set-up-a-node-to-support-user-namespaces |
| 213 | +[task-userns]: /docs/tasks/configure-pod-container/user-namespaces/ |
| 214 | +[userns-pss]: /docs/concepts/workloads/pods/user-namespaces/#integration-with-pod-security-admission-checks |
| 215 | + |
| 216 | +**7. The demos are nice, but are there more CVEs that this mitigates?** |
| 217 | + |
| 218 | +Yes, quite a lot, actually! Besides the ones in the demo, the KEP has [more CVEs |
| 219 | +you can check][kep-cve]. That list is not exhaustive, there are many more. |
| 220 | + |
| 221 | +[kep-cve]: https://github.com/kubernetes/enhancements/blob/b8013bfbceb16843686aebbb2ccffce81a6e772d/keps/sig-node/127-user-namespaces/README.md#motivation |
| 222 | + |
| 223 | +**8. Can you sum up why user namespaces is important?** |
| 224 | + |
| 225 | +Think about running a process as root, maybe even an untrusted process. Do you |
| 226 | +think that is secure? What if we limit it by adding seccomp and apparmor, mask |
| 227 | +some files in /proc (so it can't crash the node, etc.) and some more tweaks? |
| 228 | + |
| 229 | +Wouldn't it be better if we don't give it privileges in the first place, instead |
| 230 | +of trying to play whack-a-mole with all the possible ways root can escape? |
| 231 | + |
| 232 | +This is what user namespaces does, plus some other goodies: |
| 233 | + |
| 234 | + * **Run as an unprivileged user on the host without making changes to your application**. |
| 235 | +Greg and Vinayak did a great talk on the pains you can face when trying to run |
| 236 | +unprivileged without user namespaces. The pains part [starts in this minute][kubecon-nonroot-pains]. |
| 237 | + |
| 238 | + * **All pods run with different UIDs/GIDs, we significantly improve the lateral |
| 239 | +movement**. This is guaranteed with user namespaces (the kubelet chooses it for |
| 240 | +you). In the same talk, Greg and Vinayak show that to achieve the same without |
| 241 | +user namespaces, they went through a quite complex custom solution. This part |
| 242 | +[starts in this minute][kubecon-nonroot-uids]. |
| 243 | + |
| 244 | + * **The capabilities granted are only granted inside the user namespace**. That |
| 245 | + means that if a pod breaks out of the container, they are not valid on the |
| 246 | +host. We can't provide that without user namespaces. |
| 247 | + |
| 248 | + * **It enables new use-cases in a _secure_ way**. You can run docker in docker, |
| 249 | +unprivileged container builds, Kubernetes inside Kubernetes, etc all **in a secure |
| 250 | +way**. Most of the previous solutions to do this required privileged containers or |
| 251 | +putting the node at a high risk of compromise. |
| 252 | + |
| 253 | +[kubecon-nonroot-pains]: https://youtu.be/uouH9fsWVIE?feature=shared&t=351 |
| 254 | +[kubecon-nonroot-uids]: https://youtu.be/uouH9fsWVIE?feature=shared&t=793 |
| 255 | + |
| 256 | +**9. Is there container runtime documentation for user namespaces?** |
| 257 | + |
| 258 | +Yes, we have [containerd |
| 259 | +documentation](https://github.com/containerd/containerd/tree/b22a302a75d9a7d7955780e54cc5b32de6c8525d/docs/user-namespaces). |
| 260 | +This explains different limitations of containerd 1.7 and how to use |
| 261 | +user namespaces in containerd without Kubernetes pods (using `ctr`). Note that |
| 262 | +if you use containerd, you need containerd 2.0 or higher to use user namespaces |
| 263 | +with Kubernetes. |
| 264 | + |
| 265 | +CRI-O doesn't have special documentation for user namespaces, it works out of |
| 266 | +the box. |
| 267 | + |
| 268 | +**10. What about the other container runtimes?** |
| 269 | + |
| 270 | +No other container runtime that we are aware of supports user namespaces with |
| 271 | +Kubernetes. That sadly includes [cri-dockerd][cri-dockerd] too. |
| 272 | + |
| 273 | +[cri-dockerd]: https://github.com/Mirantis/cri-dockerd/issues/74 |
| 274 | + |
| 275 | +**11. I'd like to learn more about it, what would you recommend?** |
| 276 | + |
| 277 | +Rodrigo did an introduction to user namespaces at KubeCon 2022: |
| 278 | + * [Run As “Root”, Not Root: User Namespaces In K8s- Marga Manterola, Isovalent & Rodrigo Campos Catelin](https://sched.co/182K0) |
| 279 | + |
| 280 | +Also, this aforementioned presentation at KubeCon 2023 can be |
| 281 | +useful as a motivation for user namespaces: |
| 282 | + * [Least Privilege Containers: Keeping a Bad Day from Getting Worse - Greg Castle & Vinayak Goyal](https://sched.co/1HyX4) |
| 283 | + |
| 284 | +Bear in mind the presentation are some years old, some things have changed since |
| 285 | +then. Use the Kubernetes documentation as the source of truth. |
| 286 | + |
| 287 | +If you would like to learn more about the low-level details of user namespaces, |
| 288 | +you can check `man 7 user_namespaces` and `man 1 unshare`. You can easily create |
| 289 | +namespaces and experiment with how they behave. Be aware that the `unshare` tool |
| 290 | +has a lot of flexibility, and with that options to create incomplete setups. |
| 291 | + |
| 292 | +If you would like to know more about idmap mounts, you can check [its Linux |
| 293 | +kernel documentation](https://docs.kernel.org/filesystems/idmappings.html). |
| 294 | + |
| 295 | +## Conclusions |
| 296 | + |
| 297 | +Running pods as root is not ideal and running them as non-root is also hard |
| 298 | +with containers, as it can require a lot of changes to the applications. |
| 299 | +User namespaces are a unique feature to let you have the best of both worlds: run |
| 300 | +as non-root, without any changes to your application. |
| 301 | + |
| 302 | +This post covered: what are user namespaces, why they are important, some real |
| 303 | +world examples of CVEs mitigated by user-namespaces, and some common questions. |
| 304 | +Hopefully, this post helped you to eliminate the last doubts you had and you |
| 305 | +will now try user-namespaces (if you didn't already!). |
| 306 | + |
| 307 | +## How do I get involved? |
| 308 | + |
| 309 | +You can reach SIG Node by several means: |
| 310 | +- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) |
| 311 | +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) |
| 312 | +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) |
| 313 | + |
| 314 | +You can also contact us directly: |
| 315 | +- GitHub: @rata @giuseppe @saschagrunert |
| 316 | +- Slack: @rata @giuseppe @sascha |
0 commit comments