|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.30: Beta Support For Pods With User Namespaces" |
| 4 | +date: 2024-04-22 |
| 5 | +slug: userns-beta |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat) |
| 9 | + |
| 10 | +Linux provides different namespaces to isolate processes from each other. For |
| 11 | +example, a typical Kubernetes pod runs within a network namespace to isolate the |
| 12 | +network identity and a PID namespace to isolate the processes. |
| 13 | + |
| 14 | +One Linux namespace that was left behind is the [user |
| 15 | +namespace](https://man7.org/linux/man-pages/man7/user_namespaces.7.html). This |
| 16 | +namespace allows us to isolate the user and group identifiers (UIDs and GIDs) we |
| 17 | +use inside the container from the ones on the host. |
| 18 | + |
| 19 | +This is a powerful abstraction that allows us to run containers as "root": we |
| 20 | +are root inside the container and can do everything root can inside the pod, |
| 21 | +but our interactions with the host are limited to what a non-privileged user can |
| 22 | +do. This is great for limiting the impact of a container breakout. |
| 23 | + |
| 24 | +A container breakout is when a process inside a container can break out |
| 25 | +onto the host using some unpatched vulnerability in the container runtime or the |
| 26 | +kernel and can access/modify files on the host or other containers. If we |
| 27 | +run our pods with user namespaces, the privileges the container has over the |
| 28 | +rest of the host are reduced, and the files outside the container it can access |
| 29 | +are limited too. |
| 30 | + |
| 31 | +In Kubernetes v1.25, we introduced support for user namespaces only for stateless |
| 32 | +pods. Kubernetes 1.28 lifted that restriction, and now, with Kubernetes 1.30, we |
| 33 | +are moving to beta! |
| 34 | + |
| 35 | +## What is a user namespace? |
| 36 | + |
| 37 | +Note: Linux user namespaces are a different concept from [Kubernetes |
| 38 | +namespaces](/docs/concepts/overview/working-with-objects/namespaces/). |
| 39 | +The former is a Linux kernel feature; the latter is a Kubernetes feature. |
| 40 | + |
| 41 | +User namespaces are a Linux feature that isolates the UIDs and GIDs of the |
| 42 | +containers from the ones on the host. The identifiers in the container can be |
| 43 | +mapped to identifiers on the host in a way where the host UID/GIDs used for |
| 44 | +different containers never overlap. Furthermore, the identifiers can be mapped |
| 45 | +to unprivileged, non-overlapping UIDs and GIDs on the host. This brings two key |
| 46 | +benefits: |
| 47 | + |
| 48 | + * _Prevention of lateral movement_: As the UIDs and GIDs for different |
| 49 | +containers are mapped to different UIDs and GIDs on the host, containers have a |
| 50 | +harder time attacking each other, even if they escape the container boundaries. |
| 51 | +For example, suppose container A runs with different UIDs and GIDs on the host |
| 52 | +than container B. In that case, the operations it can do on container B's files and processes |
| 53 | +are limited: only read/write what a file allows to others, as it will never |
| 54 | +have permission owner or group permission (the UIDs/GIDs on the host are |
| 55 | +guaranteed to be different for different containers). |
| 56 | + |
| 57 | + * _Increased host isolation_: As the UIDs and GIDs are mapped to unprivileged |
| 58 | +users on the host, if a container escapes the container boundaries, even if it |
| 59 | +runs as root inside the container, it has no privileges on the host. This |
| 60 | +greatly protects what host files it can read/write, which process it can send |
| 61 | +signals to, etc. Furthermore, capabilities granted are only valid inside the |
| 62 | +user namespace and not on the host, limiting the impact a container |
| 63 | +escape can have. |
| 64 | + |
| 65 | +{{< figure src="/images/blog/2024-04-22-userns-beta/userns-ids.png" alt="Image showing IDs 0-65535 are reserved to the host, pods use higher IDs" title="User namespace IDs allocation" >}} |
| 66 | + |
| 67 | + |
| 68 | +Without using a user namespace, a container running as root in the case of a |
| 69 | +container breakout has root privileges on the node. If some capabilities |
| 70 | +were granted to the container, the capabilities are valid on the host too. None |
| 71 | +of this is true when using user namespaces (modulo bugs, of course 🙂). |
| 72 | + |
| 73 | +## Changes in 1.30 |
| 74 | + |
| 75 | +In Kubernetes 1.30, besides moving user namespaces to beta, the contributors |
| 76 | +working on this feature: |
| 77 | + |
| 78 | + * Introduced a way for the kubelet to use custom ranges for the UIDs/GIDs mapping |
| 79 | + * Have added a way for Kubernetes to enforce that the runtime supports all the features |
| 80 | + needed for user namespaces. If they are not supported, Kubernetes will show a |
| 81 | + clear error when trying to create a pod with user namespaces. Before 1.30, if |
| 82 | + the container runtime didn't support user namespaces, the pod could be created |
| 83 | + without a user namespace. |
| 84 | + * Added more tests, including [tests in the |
| 85 | + cri-tools](https://github.com/kubernetes-sigs/cri-tools/pull/1354) |
| 86 | + repository. |
| 87 | + |
| 88 | +You can check the |
| 89 | +[documentation](/docs/concepts/workloads/pods/user-namespaces/#set-up-a-node-to-support-user-namespaces) |
| 90 | +on user namespaces for how to configure custom ranges for the mapping. |
| 91 | + |
| 92 | +## Demo |
| 93 | + |
| 94 | +A few months ago, [CVE-2024-21626][runc-cve] was disclosed. This **vulnerability |
| 95 | +score is 8.6 (HIGH)**. It allows an attacker to escape a container and |
| 96 | +**read/write to any path on the node and other pods hosted on the same node**. |
| 97 | + |
| 98 | +Rodrigo created a demo that exploits [CVE 2024-21626][runc-cve] and shows how |
| 99 | +the exploit, which works without user namespaces, **is mitigated when user |
| 100 | +namespaces are in use.** |
| 101 | + |
| 102 | +{{< youtube id="07y5bl5UDdA" title="Mitigation of CVE-2024-21626 on Kubernetes by enabling User Namespace support" class="youtube-quote-sm" >}} |
| 103 | + |
| 104 | +Please note that with user namespaces, an attacker can do on the host file system |
| 105 | +what the permission bits for "others" allow. Therefore, the CVE is not |
| 106 | +completely prevented, but the impact is greatly reduced. |
| 107 | + |
| 108 | +[runc-cve]: https://github.com/opencontainers/runc/security/advisories/GHSA-xr7r-f8xq-vfvv |
| 109 | + |
| 110 | +## Node system requirements |
| 111 | + |
| 112 | +There are requirements on the Linux kernel version and the container |
| 113 | +runtime to use this feature. |
| 114 | + |
| 115 | +On Linux you need Linux 6.3 or greater. This is because the feature relies on a |
| 116 | +kernel feature named idmap mounts, and support for using idmap mounts with tmpfs |
| 117 | +was merged in Linux 6.3. |
| 118 | + |
| 119 | +Suppose you are using [CRI-O][crio] with crun; as always, you can expect support for |
| 120 | +Kubernetes 1.30 with CRI-O 1.30. Please note you also need [crun][crun] 1.9 or |
| 121 | +greater. If you are using CRI-O with [runc][runc], this is still not supported. |
| 122 | + |
| 123 | +Containerd support is currently targeted for [containerd][containerd] 2.0, and |
| 124 | +the same crun version requirements apply. If you are using containerd with runc, |
| 125 | +this is still not supported. |
| 126 | + |
| 127 | +Please note that containerd 1.7 added _experimental_ support for user |
| 128 | +namespaces, as implemented in Kubernetes 1.25 and 1.26. We did a redesign in |
| 129 | +Kubernetes 1.27, which requires changes in the container runtime. Those changes |
| 130 | +are not present in containerd 1.7, so it only works with user namespaces |
| 131 | +support in Kubernetes 1.25 and 1.26. |
| 132 | + |
| 133 | +Another limitation of containerd 1.7 is that it needs to change the |
| 134 | +ownership of every file and directory inside the container image during Pod |
| 135 | +startup. This has a storage overhead and can significantly impact the |
| 136 | +container startup latency. Containerd 2.0 will probably include an implementation |
| 137 | +that will eliminate the added startup latency and storage overhead. Consider |
| 138 | +this if you plan to use containerd 1.7 with user namespaces in |
| 139 | +production. |
| 140 | + |
| 141 | +None of these containerd 1.7 limitations apply to CRI-O. |
| 142 | + |
| 143 | +[crio]: https://cri-o.io/ |
| 144 | +[crun]: https://github.com/containers/crun |
| 145 | +[runc]: https://github.com/opencontainers/runc/ |
| 146 | +[containerd]: https://containerd.io/ |
| 147 | + |
| 148 | +## How do I get involved? |
| 149 | + |
| 150 | +You can reach SIG Node by several means: |
| 151 | +- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) |
| 152 | +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) |
| 153 | +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) |
| 154 | + |
| 155 | +You can also contact us directly: |
| 156 | +- GitHub: @rata @giuseppe @saschagrunert |
| 157 | +- Slack: @rata @giuseppe @sascha |
0 commit comments