@@ -10,21 +10,6 @@ associated problem space.
1010point that out explicitly and clearly in the associated patches and Cc
1111` Christian Brauner <brauner (at) kernel (dot) org ` .**
1212
13- * [x] Ability to unmount obstructed mounts. (This means: you have a stack
14- of mounts on the very same inode, and you want to remove a mount in
15- the middle. Right now, you can only remove the topmost mount.)
16-
17- ** π Instead of the ability to unmount obstructured mounts we gained
18- the ability to mount beneath an existing mount, with mostly
19- equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b
20- ("fs: allow to mount beneath top mount") π**
21-
22- ** Use-Case:** this is useful for replacing mounts atomically, for
23- example for upgrading versioned disk images: first an old version
24- of the image is mounted. Then a new version is mounted over the
25- existing mount point, and then the lower mount point is
26- removed. One such software would be ` systemd-sysext ` .
27-
2813* Ability to mount sub-directories of regular file systems instead of
2914 the top-level directory. i.e. for a file system ` /dev/sda1 ` which
3015 contains a sub-directory ` /foobar ` mount ` /foobar ` without having
@@ -77,17 +62,6 @@ point that out explicitly and clearly in the associated patches and Cc
7762 log and immediately exit, the cgroup information frequently cannot
7863 be acquired anymore by ` systemd-journald ` .
7964
80- * [x] ` SCM_PIDFD ` or similar auxiliary socket message, that is a modern
81- version of the ` SCM_CREDS ` message's ` .pid ` field, and provides a
82- ` pidfd ` file descriptor to the originating peer process.
83-
84- ** π ` 5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)") ` π**
85-
86- ** Use-Case:** security infrastructure (such as PolicyKit) can safely
87- reference clients this way without fearing PID
88- recycling. ` systemd-journald ` can acquire peer metadata this way in
89- a less racy fashion, in particular safe against PID recycling.
90-
9165* Ability to link an ` O_TMPFILE ` file into a directory while * replacing* an
9266 existing file. (Currently there's only the ability to link it in, if the
9367 file name doesn't exist yet.)
@@ -115,23 +89,6 @@ point that out explicitly and clearly in the associated patches and Cc
11589 pointed to ` file://dev/zero ` , not expecting an endless amount of
11690 data to read.
11791
118- * [x] ` IP_UNICAST_IF ` should be taken into account for routing decisions
119- at UDP ` connect() ` time (currently it isn't, only ` SO_BINDTOINDEX `
120- is, but that does so much more than just that, and one often
121- doesn't want that)
122-
123- ** π `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix
124- IP_UNICAST_IF option behavior for connected sockets")` π**
125-
126- ** Use-Case:** DNS resolvers that associate DNS configuration with
127- specific network interfaces (example: ` systemd-resolved ` ) typically
128- want to preferably route DNS traffic to the per-interface DNS
129- server via that interface, but not make further restrictions on the
130- origins or received replies, and all that without
131- privileges. ` IP_UNICAST_IF ` fulfills this role fine for TCP, but
132- for UDP it is not taken into account for the ` connect() ` routing
133- decision.
134-
13592* ` unlinkat3(dir_fd, name, inode_fd) ` : taking one file descriptor
13693 for the directory to remove a file in, and another one referring
13794 to the inode of the filename to remove. This call should only
@@ -354,33 +311,6 @@ point that out explicitly and clearly in the associated patches and Cc
354311 ** Use-Case:** block services or containers from re-opening/upgrading an
355312 ` O_PATH ` file descriptor through e.g. ` /proc/<pid>/fd/<nr ` as ` O_WRONLY ` .
356313
357- * [x] Implement a mount-specific companion to ` statx() ` that puts at least the
358- following information into ` struct mount_info ` :
359-
360- ** π 46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")`` π**
361-
362- * mount flags: ` MOUNT_ATTR_RDONLY ` , ...
363- * time flags: ` MOUNT_ATTR_RELATIME ` , ...
364- Could probably be combined with mount flags.
365- * propagation setting: ` MS_SHARED) ` , ...
366- * peer group
367- * mnt id of the mount
368- * mnt id of the mount's parent
369- * owning userns
370-
371- There's a bit more advanced stuff systemd would really want but which
372- I think is misplaced in a mountinfo system call including:
373- * list of primary and auxiliary block device major/minor
374- * diskseq value of those device nodes (This is a new block device feature
375- we added that allows preventing device recycling issues when e.g.
376- removing usb devices very quickly and is needed for udev.)
377- * uuid/fsid
378- * feature flags (` O_TMPFILE ` , ` RENAME_EXCHANGE ` supported etc.)
379-
380- ** Use-Case:** low-level userspace tools have to interact with advanced
381- mount information constantly. This is currently costly and brittel because
382- they have to go and parse ` /proc/<pid>/mountinfo ` .
383-
384314* Make quotas work with user namespaces. The quota codepaths in the kernel
385315 currently broken and inconsistent and most interesting operations are
386316 guarded behind ` capable(CAP_SYS_ADMIN) ` , i.e., require ` CAP_SYS_ADMIN ` in
@@ -449,12 +379,6 @@ point that out explicitly and clearly in the associated patches and Cc
449379 ** Use-Case:** Allow LSMs to make decisions about what mount properties to
450380 allow and what to deny.
451381
452- * [x] (kAPI) Add security hook to ` create_user_ns() ` .
453-
454- ** π ` 7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()") ` π**
455-
456- ** Use-Case:** Allow LSMs to monitor user namespace creation.
457-
458382* A per-cgroup knob for coredump sizes. Currently coredump size
459383 control is strictly per process, and primarily under control of
460384 the processes themselves. It would be good if we had a per-cgroup
@@ -522,27 +446,6 @@ point that out explicitly and clearly in the associated patches and Cc
522446 ** Use-Case:** Allow mounting images inside nspawn containers, and using
523447 RootImage= and friends in the systemd user manager.
524448
525- * [x] Support idmapped mounts for tmpfs
526-
527- ** π ` 7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs") ` π**
528-
529- ** Use-Case:** Runtimes such as Kubernetes use a lot of ` tmpfs ` mounts of
530- individual files or directories to expose information to containers/pods.
531- Instead of having to change ownership permanently allow them to use an
532- idmapped mount instead.
533-
534- @rata and @giuseppe brought this suggestion forward. For Kubernetes it is
535- sufficient to support idmapped mounts of ` tmpfs ` instances mounted in the
536- initial user namespace. However, in the future idmapped
537- mounts of ` tmpfs ` instances mounted in user namespaces should be supported.
538- Other container runtimes want to make use of this. The kernel is able to
539- support this since at least ` 5.17 ` .
540-
541- Things to remember are that ` tmpfs ` mounts can serve as lower- or upper
542- layers in ` overlayfs ` and care needs to be taken that this remains safe if
543- idmapped mounts of ` tmpfs ` instances mounted in user namespaces are
544- supported.
545-
546449* Support detached mounts with ` pivot_root() `
547450
548451 The new rootfs must currently refer to an attached mount. This restriction
@@ -613,18 +516,14 @@ point that out explicitly and clearly in the associated patches and Cc
613516 system extension with a key pair that is supposed to be good for
614517 container images only.
615518
616- * [x] Make statx() on a pidfd return additional recognizable identifiers
617- in ` .stx_btime ` and ` .stx_ino ` .
519+ * Make statx() on a pidfd return additional recognizable identifiers
520+ in ` .stx_btime ` .
618521
619522 ** π ` cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs ` π**
620523
621524 It would be fantastic if issuing statx() on any pidfd would return
622525 the start time of the process in ` .stx_btime ` even after the process
623- died, plus some reasonably stable 64bit identifier for the process
624- in ` .stx_ino ` . Together these two fields would be perfect to
625- identify processes pinned by a pidfd, and compare them, as the 96++
626- bit of information they expose should be unique enough for the
627- entire lifetime of the system to identify the processes.
526+ died.
628527
629528 These fields should in particular be queriable * after* the process
630529 already exited and has been reaped, i.e. after its PID has already
@@ -780,3 +679,124 @@ point that out explicitly and clearly in the associated patches and Cc
780679 which protocols they support on them. For example D-Bus sockets
781680 could carry ` user.dbus ` set to ` 1 ` , and Varlink sockets
782681 ` user.varlink ` set to ` 1 ` and so on.
682+
683+ ## Finished Items
684+
685+ * [x] ability to unmount obstructed mounts. (this means: you have a stack
686+ of mounts on the very same inode, and you want to remove a mount in
687+ the middle. right now, you can only remove the topmost mount.)
688+
689+ ** π instead of the ability to unmount obstructured mounts we gained
690+ the ability to mount beneath an existing mount, with mostly
691+ equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b
692+ ("fs: allow to mount beneath top mount") π**
693+
694+ ** use-case:** this is useful for replacing mounts atomically, for
695+ example for upgrading versioned disk images: first an old version
696+ of the image is mounted. then a new version is mounted over the
697+ existing mount point, and then the lower mount point is
698+ removed. One such software would be ` systemd-sysext ` .
699+
700+ * [x] ` SCM_PIDFD ` or similar auxiliary socket message, that is a modern
701+ version of the ` SCM_CREDS ` message's ` .pid ` field, and provides a
702+ ` pidfd ` file descriptor to the originating peer process.
703+
704+ ** π ` 5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)") ` π**
705+
706+ ** Use-Case:** security infrastructure (such as PolicyKit) can safely
707+ reference clients this way without fearing PID
708+ recycling. ` systemd-journald ` can acquire peer metadata this way in
709+ a less racy fashion, in particular safe against PID recycling.
710+
711+ * [x] ` IP_UNICAST_IF ` should be taken into account for routing decisions
712+ at UDP ` connect() ` time (currently it isn't, only ` SO_BINDTOINDEX `
713+ is, but that does so much more than just that, and one often
714+ doesn't want that)
715+
716+ ** π `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix
717+ IP_UNICAST_IF option behavior for connected sockets")` π**
718+
719+ ** Use-Case:** DNS resolvers that associate DNS configuration with
720+ specific network interfaces (example: ` systemd-resolved ` ) typically
721+ want to preferably route DNS traffic to the per-interface DNS
722+ server via that interface, but not make further restrictions on the
723+ origins or received replies, and all that without
724+ privileges. ` IP_UNICAST_IF ` fulfills this role fine for TCP, but
725+ for UDP it is not taken into account for the ` connect() ` routing
726+ decision.
727+
728+ * [x] Implement a mount-specific companion to ` statx() ` that puts at least the
729+ following information into ` struct mount_info ` :
730+
731+ ** π 46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")`` π**
732+
733+ * mount flags: ` MOUNT_ATTR_RDONLY ` , ...
734+ * time flags: ` MOUNT_ATTR_RELATIME ` , ...
735+ Could probably be combined with mount flags.
736+ * propagation setting: ` MS_SHARED) ` , ...
737+ * peer group
738+ * mnt id of the mount
739+ * mnt id of the mount's parent
740+ * owning userns
741+
742+ There's a bit more advanced stuff systemd would really want but which
743+ I think is misplaced in a mountinfo system call including:
744+ * list of primary and auxiliary block device major/minor
745+ * diskseq value of those device nodes (This is a new block device feature
746+ we added that allows preventing device recycling issues when e.g.
747+ removing usb devices very quickly and is needed for udev.)
748+ * uuid/fsid
749+ * feature flags (` O_TMPFILE ` , ` RENAME_EXCHANGE ` supported etc.)
750+
751+ ** Use-Case:** low-level userspace tools have to interact with advanced
752+ mount information constantly. This is currently costly and brittel because
753+ they have to go and parse ` /proc/<pid>/mountinfo ` .
754+
755+ * [x] (kAPI) Add security hook to ` create_user_ns() ` .
756+
757+ ** π ` 7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()") ` π**
758+
759+ ** Use-Case:** Allow LSMs to monitor user namespace creation.
760+
761+ * [x] Support idmapped mounts for tmpfs
762+
763+ ** π ` 7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs") ` π**
764+
765+ ** Use-Case:** Runtimes such as Kubernetes use a lot of ` tmpfs ` mounts of
766+ individual files or directories to expose information to containers/pods.
767+ Instead of having to change ownership permanently allow them to use an
768+ idmapped mount instead.
769+
770+ @rata and @giuseppe brought this suggestion forward. For Kubernetes it is
771+ sufficient to support idmapped mounts of ` tmpfs ` instances mounted in the
772+ initial user namespace. However, in the future idmapped
773+ mounts of ` tmpfs ` instances mounted in user namespaces should be supported.
774+ Other container runtimes want to make use of this. The kernel is able to
775+ support this since at least ` 5.17 ` .
776+
777+ Things to remember are that ` tmpfs ` mounts can serve as lower- or upper
778+ layers in ` overlayfs ` and care needs to be taken that this remains safe if
779+ idmapped mounts of ` tmpfs ` instances mounted in user namespaces are
780+ supported.
781+
782+ * [x] Make statx() on a pidfd return additional recognizable identifiers
783+ in ` .stx_ino ` .
784+
785+ ** π ` cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs ` π**
786+
787+ It would be fantastic if issuing statx() on any pidfd would return some
788+ reasonably stable 64bit identifier for the process in ` .stx_ino ` . This would
789+ be perfect to identify processes pinned by a pidfd, and compare them.
790+
791+ * [x] Namespace ` binfmt_misc ` filesystem.
792+
793+ ** π ` 21ca59b365c0 ("binfmt_misc: enable sandboxed mounts") ` π**
794+
795+ ** Use-Case:** Allow containers and sandboxes to register their own binfmt
796+ handlers.
797+
798+ * [x] Support idmapped mounts for ` overlayfs `
799+
800+ ** π ` bc70682a497c ("ovl: support idmapped layers") ` π**
801+
802+ ** Use-Case:** Allow containers to use ` overlayfs ` with idmapped mounts.
0 commit comments