@@ -16,6 +16,16 @@ associated problem space.
1616
1717## In-Progress
1818
19+ ### Create empty mount namespaces via ` unshare(UNSHARE_EMPTY_MNTNS) ` and ` clone3(CLONE_EMPTY_MNTNS) `
20+
21+ Now that we have support for ` nullfs ` it is trivial to allow the
22+ creation of completely empty mount namespaces, i.e., mount namespaces
23+ that only have the ` nullfs ` mount located at it's root.
24+
25+ ** Usecase:** This allows to isolate tasks in completely empty mount
26+ namespaces. It also allows the caller to avoid copying its current mount
27+ table which is useless in the majority of container workload cases.
28+
1929### Ability to put user xattrs on ` S_IFSOCK ` socket entrypoint inodes in the file system
2030
2131Currently, the kernel only allows extended attributes in the
@@ -95,6 +105,74 @@ This creates a mount namespace where "wootwoot" has become the rootfs. The
95105caller can ` setns() ` into this new mount namespace and assemble additional
96106mounts without copying and destroying the entire parent mount table.
97107
108+ ### Add immutable rootfs (` nullfs ` )
109+
110+ Currently ` pivot_root() ` doesn't work on the real rootfs because it
111+ cannot be unmounted. Userspace has to do a recursive removal of the
112+ initramfs contents manually before continuing the boot.
113+
114+ Add an immutable rootfs called ` nullfs ` that serves as the parent mount
115+ for anything that is actually useful such as the tmpfs or ramfs for
116+ initramfs unpacking or the rootfs itself. The kernel mounts a
117+ tmpfs/ramfs on top of it, unpacks the initramfs and fires up userspace
118+ which mounts the rootfs and can then simply do:
119+
120+ ``` c
121+ chdir (rootfs);
122+ pivot_root(".", ".");
123+ umount2(".", MNT_DETACH);
124+ ```
125+
126+ This also means that the rootfs mount in unprivileged namespaces doesn't
127+ need to become `MNT_LOCKED` anymore as it's guaranteed that the
128+ immutable rootfs remains permanently empty so there cannot be anything
129+ revealed by unmounting the covering mount.
130+
131+ **Use-Case:** Simplifies the boot process by enabling `pivot_root()` to
132+ work directly on the real rootfs. Removes the need for traditional
133+ `switch_root` workarounds. In the future this also allows us to create
134+ completely empty mount namespaces without risking to leak anything.
135+
136+ ### Allow `MOVE_MOUNT_BENEATH` on the rootfs
137+
138+ Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling
139+ root-switching without `pivot_root(2)`. The traditional approach to
140+ switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based
141+ mechanism that atomically updates `fs->root` for all tasks sharing the
142+ same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`,
143+ and `setns()`.
144+
145+ Instead, decompose root-switching into individually atomic, locally-scoped
146+ steps:
147+
148+ ```c
149+ fd_tree = open_tree(-EBADF, "/newroot",
150+ OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
151+ fchdir(fd_tree);
152+ move_mount(fd_tree, "", AT_FDCWD, "/",
153+ MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
154+ chroot(".");
155+ umount2(".", MNT_DETACH);
156+ ```
157+
158+ Since each step only modifies the caller's own state, the
159+ ` fork() ` /` unshare() ` /` setns() ` races are eliminated by design.
160+
161+ To make this work, ` MNT_LOCKED ` is transferred from the top mount to the
162+ mount beneath. The new mount takes over the job of protecting the parent
163+ mount from being revealed. This also makes it possible to safely modify
164+ an inherited mount table after ` unshare(CLONE_NEWUSER | CLONE_NEWNS) ` :
165+
166+ ``` sh
167+ mount --beneath -t tmpfs tmpfs /proc
168+ umount -l /proc
169+ ```
170+
171+ ** Use-Case:** Containers created with ` unshare(CLONE_NEWUSER | CLONE_NEWNS) `
172+ can reshuffle an inherited mount table safely. ` MOVE_MOUNT_BENEATH ` on the
173+ rootfs makes it possible to switch out the rootfs without the costly
174+ ` pivot_root(2) ` and without cross-namespace vulnerabilities.
175+
98176### Query mount information via file descriptor with ` statmount() `
99177
100178Extend ` struct mnt_id_req ` to accept a file descriptor and introduce
0 commit comments