kubernetes
diff --git a/‎content/en/blog/_posts/2024-04-22-userns-beta/index.md
Lines changed: 157 additions & 0 deletions b/‎content/en/blog/_posts/2024-04-22-userns-beta/index.md
Lines changed: 157 additions & 0 deletions
diff --git a/‎content/en/blog/_posts/2024-04-22-userns-beta/userns-ids.xcf
94.3 KB b/‎content/en/blog/_posts/2024-04-22-userns-beta/userns-ids.xcf
94.3 KB
diff --git a/‎static/images/blog/2024-04-22-userns-beta/userns-ids.png
43.2 KB b/‎static/images/blog/2024-04-22-userns-beta/userns-ids.png
43.2 KB
@@ -0,0 +1,157 @@
+---
+layout: blog
+title: "Kubernetes 1.30: Beta Support For Pods With User Namespaces"
+date: 2024-04-22
+slug: userns-beta
+---
+
+**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
+
+Linux provides different namespaces to isolate processes from each other. For
+example, a typical Kubernetes pod runs within a network namespace to isolate the
+network identity and a PID namespace to isolate the processes.
+
+One Linux namespace that was left behind is the [user
+namespace](https://man7.org/linux/man-pages/man7/user_namespaces.7.html). This
+namespace allows us to isolate the user and group identifiers (UIDs and GIDs) we
+use inside the container from the ones on the host.
+
+This is a powerful abstraction that allows us to run containers as "root": we
+are root inside the container and can do everything root can inside the pod,
+but our interactions with the host are limited to what a non-privileged user can
+do. This is great for limiting the impact of a container breakout.
+
+A container breakout is when a process inside a container can break out
+onto the host using some unpatched vulnerability in the container runtime or the
+kernel and can access/modify files on the host or other containers. If we
+run our pods with user namespaces, the privileges the container has over the
+rest of the host are reduced, and the files outside the container it can access
+are limited too.
+
+In Kubernetes v1.25, we introduced support for user namespaces only for stateless
+pods. Kubernetes 1.28 lifted that restriction, and now, with Kubernetes 1.30, we
+are moving to beta!
+
+## What is a user namespace?
+
+Note: Linux user namespaces are a different concept from [Kubernetes
+namespaces](/docs/concepts/overview/working-with-objects/namespaces/).
+The former is a Linux kernel feature; the latter is a Kubernetes feature.
+
+User namespaces are a Linux feature that isolates the UIDs and GIDs of the
+containers from the ones on the host. The identifiers in the container can be
+mapped to identifiers on the host in a way where the host UID/GIDs used for
+different containers never overlap. Furthermore, the identifiers can be mapped
+to unprivileged, non-overlapping UIDs and GIDs on the host. This brings two key
+benefits:
+
+ * _Prevention of lateral movement_: As the UIDs and GIDs for different
+containers are mapped to different UIDs and GIDs on the host, containers have a
+harder time attacking each other, even if they escape the container boundaries.
+For example, suppose container A runs with different UIDs and GIDs on the host
+than container B. In that case, the operations it can do on container B's files and processes
+are limited: only read/write what a file allows to others, as it will never
+have permission owner or group permission (the UIDs/GIDs on the host are
+guaranteed to be different for different containers).
+
+ * _Increased host isolation_: As the UIDs and GIDs are mapped to unprivileged
+users on the host, if a container escapes the container boundaries, even if it
+runs as root inside the container, it has no privileges on the host. This
+greatly protects what host files it can read/write, which process it can send
+signals to, etc. Furthermore, capabilities granted are only valid inside the
+user namespace and not on the host, limiting the impact a container
+escape can have.
+
+{{< figure src="/images/blog/2024-04-22-userns-beta/userns-ids.png" alt="Image showing IDs 0-65535 are reserved to the host, pods use higher IDs" title="User namespace IDs allocation" >}}
+
+
+Without using a user namespace, a container running as root in the case of a
+container breakout has root privileges on the node. If some capabilities
+were granted to the container, the capabilities are valid on the host too. None
+of this is true when using user namespaces (modulo bugs, of course 🙂).
+
+## Changes in 1.30
+
+In Kubernetes 1.30, besides moving user namespaces to beta, the contributors
+working on this feature:
+
+ * Introduced a way for the kubelet to use custom ranges for the UIDs/GIDs mapping 
+ * Have added a way for Kubernetes to enforce that the runtime supports all the features
+   needed for user namespaces. If they are not supported, Kubernetes will show a
+   clear error when trying to create a pod with user namespaces. Before 1.30, if
+   the container runtime didn't support user namespaces, the pod could be created
+   without a user namespace.
+ * Added more tests, including [tests in the
+   cri-tools](https://github.com/kubernetes-sigs/cri-tools/pull/1354)
+   repository.
+
+You can check the
+[documentation](/docs/concepts/workloads/pods/user-namespaces/#set-up-a-node-to-support-user-namespaces)
+on user namespaces for how to configure custom ranges for the mapping.
+
+## Demo
+
+A few months ago, [CVE-2024-21626][runc-cve] was disclosed. This **vulnerability
+score is 8.6 (HIGH)**. It allows an attacker to escape a container and
+**read/write to any path on the node and other pods hosted on the same node**.
+
+Rodrigo created a demo that exploits [CVE 2024-21626][runc-cve] and shows how
+the exploit, which works without user namespaces, **is mitigated when user
+namespaces are in use.**
+
+{{< youtube id="07y5bl5UDdA" title="Mitigation of CVE-2024-21626 on Kubernetes by enabling User Namespace support" class="youtube-quote-sm" >}}
+
+Please note that with user namespaces, an attacker can do on the host file system
+what the permission bits for "others" allow. Therefore, the CVE is not
+completely prevented, but the impact is greatly reduced.
+
+[runc-cve]: https://github.com/opencontainers/runc/security/advisories/GHSA-xr7r-f8xq-vfvv
+
+## Node system requirements
+
+There are requirements on the Linux kernel version and the container
+runtime to use this feature.
+
+On Linux you need Linux 6.3 or greater. This is because the feature relies on a
+kernel feature named idmap mounts, and support for using idmap mounts with tmpfs
+was merged in Linux 6.3.
+
+Suppose you are using [CRI-O][crio] with crun; as always, you can expect support for
+Kubernetes 1.30 with CRI-O 1.30. Please note you also need [crun][crun] 1.9 or
+greater. If you are using CRI-O with [runc][runc], this is still not supported.
+
+Containerd support is currently targeted for [containerd][containerd] 2.0, and
+the same crun version requirements apply. If you are using containerd with runc,
+this is still not supported.
+
+Please note that containerd 1.7 added _experimental_ support for user
+namespaces, as implemented in Kubernetes 1.25 and 1.26. We did a redesign in
+Kubernetes 1.27, which requires changes in the container runtime. Those changes
+are not present in containerd 1.7, so it only works with user namespaces
+support in Kubernetes 1.25 and 1.26.
+
+Another limitation of containerd 1.7 is that it needs to change the
+ownership of every file and directory inside the container image during Pod
+startup. This has a storage overhead and can significantly impact the
+container startup latency. Containerd 2.0 will probably include an implementation
+that will eliminate the added startup latency and storage overhead. Consider
+this if you plan to use containerd 1.7 with user namespaces in
+production.
+
+None of these containerd 1.7 limitations apply to CRI-O.
+
+[crio]: https://cri-o.io/
+[crun]: https://github.com/containers/crun
+[runc]: https://github.com/opencontainers/runc/
+[containerd]: https://containerd.io/
+
+## How do I get involved?
+
+You can reach SIG Node by several means:
+- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
+- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
+- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
+
+You can also contact us directly:
+- GitHub: @rata @giuseppe @saschagrunert
+- Slack: @rata @giuseppe @sascha