@@ -10,32 +10,6 @@ associated problem space.
1010point that out explicitly and clearly in the associated patches and Cc
1111` Christian Brauner <brauner (at) kernel (dot) org ` .**
1212
13- ### Mount a subdirectory instead of the top-level directory
14-
15- [ x] Mount a subdirectory instead of the top-level directory
16-
17- Ability to mount a subdirectory of a regular file system instead of
18- the top-level directory. E.e. for a file system ` /dev/sda1 ` which
19- contains a sub-directory ` /foobar ` mount ` /foobar ` without having
20- to mount its parent directory first. Consider something like this:
21-
22- ```
23- mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
24- ```
25-
26- (This is of course already possible via some mount namespacing
27- shenanigans, but this requires namespacing to be available, and is
28- not precisely obvious to implement. Explicit kernel support at mount
29- time would be much preferable.)
30-
31- ** 🙇 ` c5c12f871a30 ("fs: create detached mounts from detached mounts") ` 🙇**
32-
33- ** Use-Case:** ` systemd-homed ` currently mounts a sub-directory of
34- the per-user LUKS volume as the user's home directory (and not the
35- root directory of the per-user LUKS volume's file system!), and in
36- order to implement this invisibly from the host side requires a
37- complex mount namespace exercise.
38-
3913### inotify() events for BSD file locks
4014
4115BSD file locks (i.e. ` flock() ` , as opposed to POSIX ` F_SETLK ` and
@@ -247,26 +221,6 @@ to use `pidfd`s to remove PID recycling security issues, but
247221currently cannot as it also needs to generically wait for such
248222unexpected children.
249223
250- ### Mount notifications without rescanning of ` /proc/self/mountinfo `
251-
252- [ x] Mount notifications without rescanning of ` /proc/self/mountinfo `
253-
254- Mount notifications that do not require continuous rescanning of
255- ` /proc/self/mountinfo ` . Currently, if a program wants to track
256- mounts established on the system it can receive ` poll() ` able
257- events via a file descriptor to ` /proc/self/mountinfo ` . When
258- receiving them it needs to rescan the file from the top and
259- compare it with the previous scan. This is both slow and
260- racy. It's slow on systems with a large number of mounts as the
261- cost for re-scanning the table has to be paid for every change to
262- the mount table. It's racy because quickly added and removed
263- mounts might not be noticed.
264-
265- ** 🙇 ` 0f46d81f2bce ("fanotify: notify on mount attach and detach") ` 🙇**
266-
267- ** Use-Case:** ` systemd ` tracks the mount table to integrate the mounts
268- into it own dependency management.
269-
270224### Asynchronous ` close() `
271225
272226An asynchronous or forced ` close() ` , that guarantees that
@@ -375,51 +329,6 @@ user namespace. But this doesn't just lock a single mount or mount subtree
375329it locks all mounts in the mount namespace, i.e., the mount table cannot be
376330altered.
377331
378- ### Allow creating idmapped mounts from idmapped mounts
379-
380- [ x] Allow creating idmapped mounts from idmapped mounts
381-
382- Add a new ` OPEN_TREE_CLEAR ` flag to ` open_tree() ` that can only be
383- used in conjunction with ` OPEN_TREE_CLONE ` . When specified it will clear
384- all mount properties from that mount including the mount's idmapping.
385- Requires the caller to be ` ns_capable(mntns->user_ns) ` . If idmapped mounts
386- are encountered the caller must be ` ns_capable(sb->user_ns, CAP_SYS_ADMIN) `
387- in the filesystems user namespace.
388-
389- Locked mount properties cannot be changed. A mount's idmapping becomes
390- locked if it propagates across user namespaces.
391-
392- This is useful to get a new, clear mount and also allows the caller to
393- create a new detached mount with an idmapping attached to the mount. Iow,
394- the caller may idmap the mount afterwards.
395-
396- ** 🙇 ` c4a16820d901 ("fs: add open_tree_attr()") ` 🙇**
397-
398- ** Use-Case:** A user may already use an idmapped mount for their home
399- directory. And once a mount has been idmapped the idmapping cannot be
400- changed anymore. This allows for simple semantics and allows to avoid
401- lifetime complexity in order to account for scenarios where concurrent
402- readers or writers might still use a given user namespace while it is about
403- to be changed.
404- But this poses a problem when the user wants to attach an idmapping to
405- a mount that is already idmapped. The new flag allows to solve this
406- problem. A sufficiently privileged user such as a container manager can
407- create a user namespace for the container which expresses the desired
408- ownership. Then they can create a new detached mount without any prior
409- mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
410- mount.
411-
412- ### Require a user namespace to have an idmapping when attached
413-
414- [ x] Require a user namespace to have an idmapping when attached
415-
416- Enforce that the user namespace about to be attached to a mount must
417- have an idmapping written.
418-
419- ** 🙇 ` dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written") ` 🙇**
420-
421- ** Use-Case:** Tighten the semantics.
422-
423332### Extend ` setns() ` to allow attaching to all new namespaces of a process
424333
425334Add an extension to ` setns() ` to allow attaching to all namespaces of
@@ -591,77 +500,6 @@ different sources and it should not be possible to generate a
591500system extension with a key pair that is supposed to be good for
592501container images only.
593502
594- ### Make statx() on a pidfd return additional info
595-
596- Make statx() on a pidfd return additional recognizable identifiers in
597- ` .stx_btime ` .
598-
599- ** 🙇 ` cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs ` 🙇**
600-
601- It would be fantastic if issuing statx() on any pidfd would return
602- the start time of the process in ` .stx_btime ` even after the process
603- died.
604-
605- These fields should in particular be queriable * after* the process
606- already exited and has been reaped, i.e. after its PID has already
607- been recycled.
608-
609- ** Usecase:** In systemd we maintain lists of processes in a hash
610- table. Right now, the key is the PID, but this is less than ideal
611- because of PID recycling. By being able to use the ` .stx_btime `
612- and/or ` .stx_ino ` fields instead would be perfect to safely
613- identify, track and compare process even after they ceased to exist.
614-
615- ### API to determine the parent process ID of a pidfd
616-
617- [ x] API to determine the parent process ID of a pidfd
618-
619- An API to determine the parent process ID (ppid) of a pidfd would be
620- good.
621-
622- This information is relevant to code dealing with pidfds, since if
623- the ppid of a pidfd matches the process own pid it can call
624- ` waitid() ` on the process, if it doesn't it cannot and such a call
625- would fail. It would be very useful if this could be determined
626- easily before even calling that syscall.
627-
628- ** 🙇 ` cdda1f26e74b ("pidfd: add ioctl to retrieve pid info") ` 🙇**
629-
630- ** Usecase:** systemd manages a multitude of processes, most of which
631- are its own children, but many which are not. It would be great if
632- we could easily determine whether it is worth waiting for
633- ` SIGCHLD ` /` waitid() ` on them or whether waiting for ` POLLIN ` on
634- them is the only way to get exit notification.
635-
636- ### Set ` comm ` field before ` exec() `
637-
638- [ x] Set ` comm ` field before ` exec() `
639-
640- There should be a way to control the process' ` comm ` field if
641- started via ` fexecve() ` /` execveat() ` .
642-
643- Right now, when ` fexecve() ` /` execveat() ` is used, the ` comm ` field
644- (i.e. ` /proc/self/comm ` ) contains a name derived of the numeric fd,
645- which breaks ` ps -C … ` and various other tools. In particular when
646- the fd was opened with ` O_CLOEXEC ` , the number of the fd in the old
647- process is completely meaningless.
648-
649- The goal is add a way to tell ` fexecve() ` /` execveat() ` what Name to use.
650-
651- Since ` comm ` is under user control anyway (via ` PR_SET_NAME ` ), it
652- should be safe to also make it somehow configurable at fexecve()
653- time.
654-
655- See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294 ,
656- https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81 .
657-
658- ** 🙇 ` 543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case") ` 🙇**
659-
660- ** Usecase:** In systemd we generally would prefer using ` fexecve() `
661- to safely and race-freely invoke processes, but the fact that ` comm `
662- is useless after invoking a process that way makes the call
663- unfortunately hard to use for systemd.
664-
665503### Path-based ACL management in an LSM hook
666504
667505The LSM module API should have the ability to do path-based (not
@@ -790,16 +628,6 @@ Add an option to go from individual thread to thread-group leader.
790628** Use-Case:** Allow for a race free way to go from individual thread
791629to thread-group leader pidfd.
792630
793- ### Namespace ioctl to translate a PID between PID namespaces
794-
795- [ x] Namespace ioctl to translate a PID between PID namespaces
796-
797- ** 🙇 ` ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls") ` 🙇**
798-
799- ** Use-Case:** This makes it possible to e.g., figure out what a given PID in
800- a PID namespace corresponds to in the caller's PID namespace. For example, to
801- figure out what the PID of PID 1 inside of a given PID namespace is.
802-
803631### Useful handling of LSM denials on SCM_RIGHTS
804632
805633Right now if some LSM such as SELinux denies an ` AF_UNIX ` socket peer
@@ -839,6 +667,177 @@ on received messages.
839667
840668## Finished Items
841669
670+ ### Namespace ioctl to translate a PID between PID namespaces
671+
672+ [ x] Namespace ioctl to translate a PID between PID namespaces
673+
674+ ** 🙇 ` ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls") ` 🙇**
675+
676+ ** Use-Case:** This makes it possible to e.g., figure out what a given PID in
677+ a PID namespace corresponds to in the caller's PID namespace. For example, to
678+ figure out what the PID of PID 1 inside of a given PID namespace is.
679+
680+ ### API to determine the parent process ID of a pidfd
681+
682+ [ x] API to determine the parent process ID of a pidfd
683+
684+ An API to determine the parent process ID (ppid) of a pidfd would be
685+ good.
686+
687+ This information is relevant to code dealing with pidfds, since if
688+ the ppid of a pidfd matches the process own pid it can call
689+ ` waitid() ` on the process, if it doesn't it cannot and such a call
690+ would fail. It would be very useful if this could be determined
691+ easily before even calling that syscall.
692+
693+ ** 🙇 ` cdda1f26e74b ("pidfd: add ioctl to retrieve pid info") ` 🙇**
694+
695+ ** Usecase:** systemd manages a multitude of processes, most of which
696+ are its own children, but many which are not. It would be great if
697+ we could easily determine whether it is worth waiting for
698+ ` SIGCHLD ` /` waitid() ` on them or whether waiting for ` POLLIN ` on
699+ them is the only way to get exit notification.
700+
701+ ### Set ` comm ` field before ` exec() `
702+
703+ [ x] Set ` comm ` field before ` exec() `
704+
705+ There should be a way to control the process' ` comm ` field if
706+ started via ` fexecve() ` /` execveat() ` .
707+
708+ Right now, when ` fexecve() ` /` execveat() ` is used, the ` comm ` field
709+ (i.e. ` /proc/self/comm ` ) contains a name derived of the numeric fd,
710+ which breaks ` ps -C … ` and various other tools. In particular when
711+ the fd was opened with ` O_CLOEXEC ` , the number of the fd in the old
712+ process is completely meaningless.
713+
714+ The goal is add a way to tell ` fexecve() ` /` execveat() ` what Name to use.
715+
716+ Since ` comm ` is under user control anyway (via ` PR_SET_NAME ` ), it
717+ should be safe to also make it somehow configurable at fexecve()
718+ time.
719+
720+ See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294 ,
721+ https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81 .
722+
723+ ** 🙇 ` 543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case") ` 🙇**
724+
725+ ** Usecase:** In systemd we generally would prefer using ` fexecve() `
726+ to safely and race-freely invoke processes, but the fact that ` comm `
727+ is useless after invoking a process that way makes the call
728+ unfortunately hard to use for systemd.
729+ ### Make statx() on a pidfd return additional info
730+
731+ Make statx() on a pidfd return additional recognizable identifiers in
732+ ` .stx_btime ` .
733+
734+ ** 🙇 ` cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs ` 🙇**
735+
736+ It would be fantastic if issuing statx() on any pidfd would return
737+ the start time of the process in ` .stx_btime ` even after the process
738+ died.
739+
740+ These fields should in particular be queriable * after* the process
741+ already exited and has been reaped, i.e. after its PID has already
742+ been recycled.
743+
744+ ** Usecase:** In systemd we maintain lists of processes in a hash
745+ table. Right now, the key is the PID, but this is less than ideal
746+ because of PID recycling. By being able to use the ` .stx_btime `
747+ and/or ` .stx_ino ` fields instead would be perfect to safely
748+ identify, track and compare process even after they ceased to exist.
749+
750+ ### Allow creating idmapped mounts from idmapped mounts
751+
752+ [ x] Allow creating idmapped mounts from idmapped mounts
753+
754+ Add a new ` OPEN_TREE_CLEAR ` flag to ` open_tree() ` that can only be
755+ used in conjunction with ` OPEN_TREE_CLONE ` . When specified it will clear
756+ all mount properties from that mount including the mount's idmapping.
757+ Requires the caller to be ` ns_capable(mntns->user_ns) ` . If idmapped mounts
758+ are encountered the caller must be ` ns_capable(sb->user_ns, CAP_SYS_ADMIN) `
759+ in the filesystems user namespace.
760+
761+ Locked mount properties cannot be changed. A mount's idmapping becomes
762+ locked if it propagates across user namespaces.
763+
764+ This is useful to get a new, clear mount and also allows the caller to
765+ create a new detached mount with an idmapping attached to the mount. Iow,
766+ the caller may idmap the mount afterwards.
767+
768+ ** 🙇 ` c4a16820d901 ("fs: add open_tree_attr()") ` 🙇**
769+
770+ ** Use-Case:** A user may already use an idmapped mount for their home
771+ directory. And once a mount has been idmapped the idmapping cannot be
772+ changed anymore. This allows for simple semantics and allows to avoid
773+ lifetime complexity in order to account for scenarios where concurrent
774+ readers or writers might still use a given user namespace while it is about
775+ to be changed.
776+ But this poses a problem when the user wants to attach an idmapping to
777+ a mount that is already idmapped. The new flag allows to solve this
778+ problem. A sufficiently privileged user such as a container manager can
779+ create a user namespace for the container which expresses the desired
780+ ownership. Then they can create a new detached mount without any prior
781+ mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
782+ mount.
783+
784+ ### Require a user namespace to have an idmapping when attached
785+
786+ [ x] Require a user namespace to have an idmapping when attached
787+
788+ Enforce that the user namespace about to be attached to a mount must
789+ have an idmapping written.
790+
791+ ** 🙇 ` dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written") ` 🙇**
792+
793+ ** Use-Case:** Tighten the semantics.
794+
795+ ### Mount notifications without rescanning of ` /proc/self/mountinfo `
796+
797+ [ x] Mount notifications without rescanning of ` /proc/self/mountinfo `
798+
799+ Mount notifications that do not require continuous rescanning of
800+ ` /proc/self/mountinfo ` . Currently, if a program wants to track
801+ mounts established on the system it can receive ` poll() ` able
802+ events via a file descriptor to ` /proc/self/mountinfo ` . When
803+ receiving them it needs to rescan the file from the top and
804+ compare it with the previous scan. This is both slow and
805+ racy. It's slow on systems with a large number of mounts as the
806+ cost for re-scanning the table has to be paid for every change to
807+ the mount table. It's racy because quickly added and removed
808+ mounts might not be noticed.
809+
810+ ** 🙇 ` 0f46d81f2bce ("fanotify: notify on mount attach and detach") ` 🙇**
811+
812+ ** Use-Case:** ` systemd ` tracks the mount table to integrate the mounts
813+ into it own dependency management.
814+
815+ ### Mount a subdirectory instead of the top-level directory
816+
817+ [ x] Mount a subdirectory instead of the top-level directory
818+
819+ Ability to mount a subdirectory of a regular file system instead of
820+ the top-level directory. E.e. for a file system ` /dev/sda1 ` which
821+ contains a sub-directory ` /foobar ` mount ` /foobar ` without having
822+ to mount its parent directory first. Consider something like this:
823+
824+ ```
825+ mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
826+ ```
827+
828+ (This is of course already possible via some mount namespacing
829+ shenanigans, but this requires namespacing to be available, and is
830+ not precisely obvious to implement. Explicit kernel support at mount
831+ time would be much preferable.)
832+
833+ ** 🙇 ` c5c12f871a30 ("fs: create detached mounts from detached mounts") ` 🙇**
834+
835+ ** Use-Case:** ` systemd-homed ` currently mounts a sub-directory of
836+ the per-user LUKS volume as the user's home directory (and not the
837+ root directory of the per-user LUKS volume's file system!), and in
838+ order to implement this invisibly from the host side requires a
839+ complex mount namespace exercise.
840+
842841### Unmounting of obstructed mounts
843842
844843[ x] ability to unmount obstructed mounts. (this means: you have a stack
0 commit comments