Skip to content

Commit 93d8268

Browse files
committed
blog: Add post about userns for stateful pods
Signed-off-by: Rodrigo Campos <[email protected]>
1 parent 58c5199 commit 93d8268

File tree

1 file changed

+148
-0
lines changed
  • content/en/blog/_posts/2023-09-13-userns-stateful-pods

1 file changed

+148
-0
lines changed
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
layout: blog
3+
title: "User Namespaces: Now Supports Running Stateful Pods in Alpha!"
4+
date: 2023-09-13
5+
slug: userns-alpha
6+
---
7+
8+
**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
9+
10+
Kubernetes v1.25 introduced support for user namespaces for only stateless
11+
pods. Kubernetes 1.28 lifted that restriction, after some design changes were
12+
done in 1.27.
13+
14+
The beauty of this feature is that:
15+
* it is trivial to adopt (you just need to set a bool in the pod spec)
16+
* doesn't need any changes for **most** applications
17+
* improves security by _drastically_ enhancing the isolation of containers and
18+
mitigating CVEs rated HIGH and CRITICAL.
19+
20+
This post explains the basics of user namespaces and also shows:
21+
* the changes that arrived in the recent Kubernetes v1.28 release
22+
* a **demo of a vulnerability rated as HIGH** that is not exploitable with user namespaces
23+
* the runtime requirements to use this feature
24+
* what you can expect in future releases regarding user namespaces.
25+
26+
## What is a user namespace?
27+
28+
A user namespace is a Linux feature that isolates the user and group identifiers
29+
(UIDs and GIDs) of the containers from the ones on the host. The indentifiers
30+
in the container can be mapped to indentifiers on the host in a way where the
31+
host UID/GIDs used for different containers never overlap. Even more, the
32+
identifiers can be mapped to *unprivileged* non-overlapping UIDs and GIDs on the
33+
host. This basically means two things:
34+
35+
* As the UIDs and GIDs for different containers are mapped to different UIDs
36+
and GIDs on the host, containers have a harder time to attack each other even
37+
if they escape the container boundaries. For example, if container A is running
38+
with different UIDs and GIDs on the host than container B, the operations it
39+
can do on container B's files and process are limited: only read/write what a
40+
file allows to others, as it will never have permission for the owner or
41+
group (the UIDs/GIDs on the host are guaranteed to be different for
42+
different containers).
43+
44+
* As the UIDs and GIDs are mapped to unprivileged users on the host, if a
45+
container escapes the container boundaries, even if it is running as root
46+
inside the container, it has no privileges on the host. This greatly
47+
protects what host files it can read/write, which process it can send signals
48+
to, etc.
49+
50+
Furthermore, capabilities granted are only valid inside the user namespace and
51+
not on the host.
52+
53+
Without using a user namespace a container running as root, in the case of a
54+
container breakout, has root privileges on the node. And if some capabilities
55+
were granted to the container, the capabilities are valid on the host too. None
56+
of this is true when using user namespaces (modulo bugs, of course 🙂).
57+
58+
## Changes in 1.28
59+
60+
As already mentioned, starting from 1.28, Kubernetes supports user namespaces
61+
with stateful pods. This means that pods with user namespaces can use any type
62+
of volume, they are no longer limited to only some volume types as before.
63+
64+
The feature gate to activate this feature was renamed, it is no longer
65+
`UserNamespacesStatelessPodsSupport` but from 1.28 onwards you should use
66+
`UserNamespacesSupport`. There were many changes done and the requirements on
67+
the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to
68+
reflect this.
69+
70+
## Demo
71+
72+
Rodrigo created a demo which exploits [CVE 2022-0492][cve-link] and shows how
73+
the exploit can occur without user namespaces. He also shows how it is not
74+
possible to use this exploit from a Pod where the containers are using this
75+
feature.
76+
77+
This vulnerability is rated **HIGH** and allows **a container with no special
78+
privileges to read/write to any path on the host** and launch processes as root
79+
on the host too.
80+
81+
{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
82+
83+
Most applications in containers run as root today, or as a semi-predictable
84+
non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod
85+
with containers using a userns, Kubernetes runs those containers as unprivileged
86+
users, with no changes needed in your app.
87+
88+
This means two containers running as user 65534 will effectively be mapped to
89+
different users on the host, limiting what they can do to each other in case of
90+
an escape, and if they are running as root, the privileges on the host are
91+
reduced to the one of an unprivileged user.
92+
93+
[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
94+
95+
## Node system requirements
96+
97+
There are requirements on the Linux kernel version as well as the container
98+
runtime to use this feature.
99+
100+
On Linux you need Linux 6.3 or greater. This is because the feature relies on a
101+
kernel feature named idmap mounts, and support to use idmap mounts with tmpfs
102+
was merged in Linux 6.3.
103+
104+
If you are using CRI-O with crun, this is [supported in CRI-O
105+
1.28.1][CRIO-release] and crun 1.9 or greater. If you are using CRI-O with runc,
106+
this is still not supported.
107+
108+
containerd support is currently targeted for containerd 2.0; it is likely that
109+
it won't matter if you use it with crun or runc.
110+
111+
Please note that containerd 1.7 added _experimental_ support for user
112+
namespaces as implemented in Kubernetes 1.25 and 1.26. The redesign done in 1.27
113+
is not supported by containerd 1.7, therefore it only works, in terms of user
114+
namespaces support, with Kubernetes 1.25 and 1.26.
115+
116+
One limitation present in containerd 1.7 is that it needs to change the
117+
ownership of every file and directory inside the container image, during Pod
118+
startup. This means it has a storage overhead and can significantly impact the
119+
container startup latency. Containerd 2.0 will probably include a implementation
120+
that will eliminate the startup latency added and the storage overhead. Take
121+
this into account if you plan to use containerd 1.7 with user namespaces in
122+
production.
123+
124+
None of these containerd limitations apply to [CRI-O 1.28][CRIO-release].
125+
126+
[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
127+
128+
## What’s next?
129+
130+
Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user
131+
namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For
132+
the time being, the plan is to relax checks in PSS policies when user namespaces are
133+
in use. This means that the fields `spec[.*].securityContext` `runAsUser`,
134+
`runAsNonRoot`, `allowPrivilegeEscalation` and `capabilities` will not trigger a
135+
violation if user namespaces are in use. The behavior will probably be controlled by
136+
utilizing a API Server feature gate, like `UserNamespacesPodSecurityStandards`
137+
or similar.
138+
139+
## How do I get involved?
140+
141+
You can reach SIG Node by several means:
142+
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
143+
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
144+
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
145+
146+
You can also contact us directly:
147+
- GitHub: @rata @giuseppe @saschagrunert
148+
- Slack: @rata @giuseppe @sascha

0 commit comments

Comments
 (0)