Skip to content

Commit 8c92014

Browse files
authored
Merge pull request #50 from brauner/work
wishlist: add a few more entries
2 parents cc39946 + baf5a17 commit 8c92014

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,16 @@ associated problem space.
1616

1717
## In-Progress
1818

19+
### Create empty mount namespaces via `unshare(UNSHARE_EMPTY_MNTNS)` and `clone3(CLONE_EMPTY_MNTNS)`
20+
21+
Now that we have support for `nullfs` it is trivial to allow the
22+
creation of completely empty mount namespaces, i.e., mount namespaces
23+
that only have the `nullfs` mount located at it's root.
24+
25+
**Usecase:** This allows to isolate tasks in completely empty mount
26+
namespaces. It also allows the caller to avoid copying its current mount
27+
table which is useless in the majority of container workload cases.
28+
1929
### Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system
2030

2131
Currently, the kernel only allows extended attributes in the
@@ -95,6 +105,74 @@ This creates a mount namespace where "wootwoot" has become the rootfs. The
95105
caller can `setns()` into this new mount namespace and assemble additional
96106
mounts without copying and destroying the entire parent mount table.
97107

108+
### Add immutable rootfs (`nullfs`)
109+
110+
Currently `pivot_root()` doesn't work on the real rootfs because it
111+
cannot be unmounted. Userspace has to do a recursive removal of the
112+
initramfs contents manually before continuing the boot.
113+
114+
Add an immutable rootfs called `nullfs` that serves as the parent mount
115+
for anything that is actually useful such as the tmpfs or ramfs for
116+
initramfs unpacking or the rootfs itself. The kernel mounts a
117+
tmpfs/ramfs on top of it, unpacks the initramfs and fires up userspace
118+
which mounts the rootfs and can then simply do:
119+
120+
```c
121+
chdir(rootfs);
122+
pivot_root(".", ".");
123+
umount2(".", MNT_DETACH);
124+
```
125+
126+
This also means that the rootfs mount in unprivileged namespaces doesn't
127+
need to become `MNT_LOCKED` anymore as it's guaranteed that the
128+
immutable rootfs remains permanently empty so there cannot be anything
129+
revealed by unmounting the covering mount.
130+
131+
**Use-Case:** Simplifies the boot process by enabling `pivot_root()` to
132+
work directly on the real rootfs. Removes the need for traditional
133+
`switch_root` workarounds. In the future this also allows us to create
134+
completely empty mount namespaces without risking to leak anything.
135+
136+
### Allow `MOVE_MOUNT_BENEATH` on the rootfs
137+
138+
Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling
139+
root-switching without `pivot_root(2)`. The traditional approach to
140+
switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based
141+
mechanism that atomically updates `fs->root` for all tasks sharing the
142+
same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`,
143+
and `setns()`.
144+
145+
Instead, decompose root-switching into individually atomic, locally-scoped
146+
steps:
147+
148+
```c
149+
fd_tree = open_tree(-EBADF, "/newroot",
150+
OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
151+
fchdir(fd_tree);
152+
move_mount(fd_tree, "", AT_FDCWD, "/",
153+
MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
154+
chroot(".");
155+
umount2(".", MNT_DETACH);
156+
```
157+
158+
Since each step only modifies the caller's own state, the
159+
`fork()`/`unshare()`/`setns()` races are eliminated by design.
160+
161+
To make this work, `MNT_LOCKED` is transferred from the top mount to the
162+
mount beneath. The new mount takes over the job of protecting the parent
163+
mount from being revealed. This also makes it possible to safely modify
164+
an inherited mount table after `unshare(CLONE_NEWUSER | CLONE_NEWNS)`:
165+
166+
```sh
167+
mount --beneath -t tmpfs tmpfs /proc
168+
umount -l /proc
169+
```
170+
171+
**Use-Case:** Containers created with `unshare(CLONE_NEWUSER | CLONE_NEWNS)`
172+
can reshuffle an inherited mount table safely. `MOVE_MOUNT_BENEATH` on the
173+
rootfs makes it possible to switch out the rootfs without the costly
174+
`pivot_root(2)` and without cross-namespace vulnerabilities.
175+
98176
### Query mount information via file descriptor with `statmount()`
99177

100178
Extend `struct mnt_id_req` to accept a file descriptor and introduce

0 commit comments

Comments
 (0)