@@ -10,28 +10,6 @@ associated problem space.
1010point that out explicitly and clearly in the associated patches and Cc
1111` Christian Brauner <brauner (at) kernel (dot) org ` .**
1212
13- ### Mount a subdirectory instead of the top-level directory
14-
15- Ability to mount a subdirectory of a regular file system instead of
16- the top-level directory. E.e. for a file system ` /dev/sda1 ` which
17- contains a sub-directory ` /foobar ` mount ` /foobar ` without having
18- to mount its parent directory first. Consider something like this:
19-
20- ```
21- mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
22- ```
23-
24- (This is of course already possible via some mount namespacing
25- shenanigans, but this requires namespacing to be available, and is
26- not precisely obvious to implement. Explicit kernel support at mount
27- time would be much preferable.)
28-
29- ** Use-Case:** ` systemd-homed ` currently mounts a sub-directory of
30- the per-user LUKS volume as the user's home directory (and not the
31- root directory of the per-user LUKS volume's file system!), and in
32- order to implement this invisibly from the host side requires a
33- complex mount namespace exercise.
34-
3513### inotify() events for BSD file locks
3614
3715BSD file locks (i.e. ` flock() ` , as opposed to POSIX ` F_SETLK ` and
@@ -243,22 +221,6 @@ to use `pidfd`s to remove PID recycling security issues, but
243221currently cannot as it also needs to generically wait for such
244222unexpected children.
245223
246- ### Mount notifications without rescanning of ` /proc/self/mountinfo `
247-
248- Mount notifications that do not require continuous rescanning of
249- ` /proc/self/mountinfo ` . Currently, if a program wants to track
250- mounts established on the system it can receive ` poll() ` able
251- events via a file descriptor to ` /proc/self/mountinfo ` . When
252- receiving them it needs to rescan the file from the top and
253- compare it with the previous scan. This is both slow and
254- racy. It's slow on systems with a large number of mounts as the
255- cost for re-scanning the table has to be paid for every change to
256- the mount table. It's racy because quickly added and removed
257- mounts might not be noticed.
258-
259- ** Use-Case:** ` systemd ` tracks the mount table to integrate the mounts
260- into it own dependency management.
261-
262224### Asynchronous ` close() `
263225
264226An asynchronous or forced ` close() ` , that guarantees that
@@ -367,43 +329,6 @@ user namespace. But this doesn't just lock a single mount or mount subtree
367329it locks all mounts in the mount namespace, i.e., the mount table cannot be
368330altered.
369331
370- ### Add ` OPEN_TREE_CLEAR ` flag to ` open_tree() `
371-
372- Add a new ` OPEN_TREE_CLEAR ` flag to ` open_tree() ` that can only be
373- used in conjunction with ` OPEN_TREE_CLONE ` . When specified it will clear
374- all mount properties from that mount including the mount's idmapping.
375- Requires the caller to be ` ns_capable(mntns->user_ns) ` . If idmapped mounts
376- are encountered the caller must be ` ns_capable(sb->user_ns, CAP_SYS_ADMIN) `
377- in the filesystems user namespace.
378-
379- Locked mount properties cannot be changed. A mount's idmapping becomes
380- locked if it propagates across user namespaces.
381-
382- This is useful to get a new, clear mount and also allows the caller to
383- create a new detached mount with an idmapping attached to the mount. Iow,
384- the caller may idmap the mount afterwards.
385-
386- ** Use-Case:** A user may already use an idmapped mount for their home
387- directory. And once a mount has been idmapped the idmapping cannot be
388- changed anymore. This allows for simple semantics and allows to avoid
389- lifetime complexity in order to account for scenarios where concurrent
390- readers or writers might still use a given user namespace while it is about
391- to be changed.
392- But this poses a problem when the user wants to attach an idmapping to
393- a mount that is already idmapped. The new flag allows to solve this
394- problem. A sufficiently privileged user such as a container manager can
395- create a user namespace for the container which expresses the desired
396- ownership. Then they can create a new detached mount without any prior
397- mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
398- mount.
399-
400- ### Require a user namespace to have an idmapping when attached
401-
402- Enforce that the user namespace about to be attached to a mount must
403- have an idmapping written.
404-
405- ** Use-Case:** Tighten the semantics.
406-
407332### Extend ` setns() ` to allow attaching to all new namespaces of a process
408333
409334Add an extension to ` setns() ` to allow attaching to all namespaces of
@@ -575,69 +500,6 @@ different sources and it should not be possible to generate a
575500system extension with a key pair that is supposed to be good for
576501container images only.
577502
578- ### Make statx() on a pidfd return additional info
579-
580- Make statx() on a pidfd return additional recognizable identifiers in
581- ` .stx_btime ` .
582-
583- ** 🙇 ` cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs ` 🙇**
584-
585- It would be fantastic if issuing statx() on any pidfd would return
586- the start time of the process in ` .stx_btime ` even after the process
587- died.
588-
589- These fields should in particular be queriable * after* the process
590- already exited and has been reaped, i.e. after its PID has already
591- been recycled.
592-
593- ** Usecase:** In systemd we maintain lists of processes in a hash
594- table. Right now, the key is the PID, but this is less than ideal
595- because of PID recycling. By being able to use the ` .stx_btime `
596- and/or ` .stx_ino ` fields instead would be perfect to safely
597- identify, track and compare process even after they ceased to exist.
598-
599- ### API to determine the parent process ID of a pidfd
600-
601- An API to determine the parent process ID (ppid) of a pidfd would be
602- good.
603-
604- This information is relevant to code dealing with pidfds, since if
605- the ppid of a pidfd matches the process own pid it can call
606- ` waitid() ` on the process, if it doesn't it cannot and such a call
607- would fail. It would be very useful if this could be determined
608- easily before even calling that syscall.
609-
610- ** Usecase:** systemd manages a multitude of processes, most of which
611- are its own children, but many which are not. It would be great if
612- we could easily determine whether it is worth waiting for
613- ` SIGCHLD ` /` waitid() ` on them or whether waiting for ` POLLIN ` on
614- them is the only way to get exit notification.
615-
616- ### Set ` comm ` field before ` exec() `
617-
618- There should be a way to control the process' ` comm ` field if
619- started via ` fexecve() ` /` execveat() ` .
620-
621- Right now, when ` fexecve() ` /` execveat() ` is used, the ` comm ` field
622- (i.e. ` /proc/self/comm ` ) contains a name derived of the numeric fd,
623- which breaks ` ps -C … ` and various other tools. In particular when
624- the fd was opened with ` O_CLOEXEC ` , the number of the fd in the old
625- process is completely meaningless.
626-
627- The goal is add a way to tell ` fexecve() ` /` execveat() ` what Name to use.
628-
629- Since ` comm ` is under user control anyway (via ` PR_SET_NAME ` ), it
630- should be safe to also make it somehow configurable at fexecve()
631- time.
632-
633- See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294 ,
634- https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81 .
635-
636- ** Usecase:** In systemd we generally would prefer using ` fexecve() `
637- to safely and race-freely invoke processes, but the fact that ` comm `
638- is useless after invoking a process that way makes the call
639- unfortunately hard to use for systemd.
640-
641503### Path-based ACL management in an LSM hook
642504
643505The LSM module API should have the ability to do path-based (not
@@ -766,12 +628,6 @@ Add an option to go from individual thread to thread-group leader.
766628** Use-Case:** Allow for a race free way to go from individual thread
767629to thread-group leader pidfd.
768630
769- ### Namespace ioctl to translate a PID between PID namespaces
770-
771- ** Use-Case:** This makes it possible to e.g., figure out what a given PID in
772- a PID namespace corresponds to in the caller's PID namespace. For example, to
773- figure out what the PID of PID 1 inside of a given PID namespace is.
774-
775631### Useful handling of LSM denials on SCM_RIGHTS
776632
777633Right now if some LSM such as SELinux denies an ` AF_UNIX ` socket peer
@@ -811,6 +667,177 @@ on received messages.
811667
812668## Finished Items
813669
670+ ### Namespace ioctl to translate a PID between PID namespaces
671+
672+ [ x] Namespace ioctl to translate a PID between PID namespaces
673+
674+ ** 🙇 ` ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls") ` 🙇**
675+
676+ ** Use-Case:** This makes it possible to e.g., figure out what a given PID in
677+ a PID namespace corresponds to in the caller's PID namespace. For example, to
678+ figure out what the PID of PID 1 inside of a given PID namespace is.
679+
680+ ### API to determine the parent process ID of a pidfd
681+
682+ [ x] API to determine the parent process ID of a pidfd
683+
684+ An API to determine the parent process ID (ppid) of a pidfd would be
685+ good.
686+
687+ This information is relevant to code dealing with pidfds, since if
688+ the ppid of a pidfd matches the process own pid it can call
689+ ` waitid() ` on the process, if it doesn't it cannot and such a call
690+ would fail. It would be very useful if this could be determined
691+ easily before even calling that syscall.
692+
693+ ** 🙇 ` cdda1f26e74b ("pidfd: add ioctl to retrieve pid info") ` 🙇**
694+
695+ ** Usecase:** systemd manages a multitude of processes, most of which
696+ are its own children, but many which are not. It would be great if
697+ we could easily determine whether it is worth waiting for
698+ ` SIGCHLD ` /` waitid() ` on them or whether waiting for ` POLLIN ` on
699+ them is the only way to get exit notification.
700+
701+ ### Set ` comm ` field before ` exec() `
702+
703+ [ x] Set ` comm ` field before ` exec() `
704+
705+ There should be a way to control the process' ` comm ` field if
706+ started via ` fexecve() ` /` execveat() ` .
707+
708+ Right now, when ` fexecve() ` /` execveat() ` is used, the ` comm ` field
709+ (i.e. ` /proc/self/comm ` ) contains a name derived of the numeric fd,
710+ which breaks ` ps -C … ` and various other tools. In particular when
711+ the fd was opened with ` O_CLOEXEC ` , the number of the fd in the old
712+ process is completely meaningless.
713+
714+ The goal is add a way to tell ` fexecve() ` /` execveat() ` what Name to use.
715+
716+ Since ` comm ` is under user control anyway (via ` PR_SET_NAME ` ), it
717+ should be safe to also make it somehow configurable at fexecve()
718+ time.
719+
720+ See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294 ,
721+ https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81 .
722+
723+ ** 🙇 ` 543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case") ` 🙇**
724+
725+ ** Usecase:** In systemd we generally would prefer using ` fexecve() `
726+ to safely and race-freely invoke processes, but the fact that ` comm `
727+ is useless after invoking a process that way makes the call
728+ unfortunately hard to use for systemd.
729+ ### Make statx() on a pidfd return additional info
730+
731+ Make statx() on a pidfd return additional recognizable identifiers in
732+ ` .stx_btime ` .
733+
734+ ** 🙇 ` cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs ` 🙇**
735+
736+ It would be fantastic if issuing statx() on any pidfd would return
737+ the start time of the process in ` .stx_btime ` even after the process
738+ died.
739+
740+ These fields should in particular be queriable * after* the process
741+ already exited and has been reaped, i.e. after its PID has already
742+ been recycled.
743+
744+ ** Usecase:** In systemd we maintain lists of processes in a hash
745+ table. Right now, the key is the PID, but this is less than ideal
746+ because of PID recycling. By being able to use the ` .stx_btime `
747+ and/or ` .stx_ino ` fields instead would be perfect to safely
748+ identify, track and compare process even after they ceased to exist.
749+
750+ ### Allow creating idmapped mounts from idmapped mounts
751+
752+ [ x] Allow creating idmapped mounts from idmapped mounts
753+
754+ Add a new ` OPEN_TREE_CLEAR ` flag to ` open_tree() ` that can only be
755+ used in conjunction with ` OPEN_TREE_CLONE ` . When specified it will clear
756+ all mount properties from that mount including the mount's idmapping.
757+ Requires the caller to be ` ns_capable(mntns->user_ns) ` . If idmapped mounts
758+ are encountered the caller must be ` ns_capable(sb->user_ns, CAP_SYS_ADMIN) `
759+ in the filesystems user namespace.
760+
761+ Locked mount properties cannot be changed. A mount's idmapping becomes
762+ locked if it propagates across user namespaces.
763+
764+ This is useful to get a new, clear mount and also allows the caller to
765+ create a new detached mount with an idmapping attached to the mount. Iow,
766+ the caller may idmap the mount afterwards.
767+
768+ ** 🙇 ` c4a16820d901 ("fs: add open_tree_attr()") ` 🙇**
769+
770+ ** Use-Case:** A user may already use an idmapped mount for their home
771+ directory. And once a mount has been idmapped the idmapping cannot be
772+ changed anymore. This allows for simple semantics and allows to avoid
773+ lifetime complexity in order to account for scenarios where concurrent
774+ readers or writers might still use a given user namespace while it is about
775+ to be changed.
776+ But this poses a problem when the user wants to attach an idmapping to
777+ a mount that is already idmapped. The new flag allows to solve this
778+ problem. A sufficiently privileged user such as a container manager can
779+ create a user namespace for the container which expresses the desired
780+ ownership. Then they can create a new detached mount without any prior
781+ mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
782+ mount.
783+
784+ ### Require a user namespace to have an idmapping when attached
785+
786+ [ x] Require a user namespace to have an idmapping when attached
787+
788+ Enforce that the user namespace about to be attached to a mount must
789+ have an idmapping written.
790+
791+ ** 🙇 ` dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written") ` 🙇**
792+
793+ ** Use-Case:** Tighten the semantics.
794+
795+ ### Mount notifications without rescanning of ` /proc/self/mountinfo `
796+
797+ [ x] Mount notifications without rescanning of ` /proc/self/mountinfo `
798+
799+ Mount notifications that do not require continuous rescanning of
800+ ` /proc/self/mountinfo ` . Currently, if a program wants to track
801+ mounts established on the system it can receive ` poll() ` able
802+ events via a file descriptor to ` /proc/self/mountinfo ` . When
803+ receiving them it needs to rescan the file from the top and
804+ compare it with the previous scan. This is both slow and
805+ racy. It's slow on systems with a large number of mounts as the
806+ cost for re-scanning the table has to be paid for every change to
807+ the mount table. It's racy because quickly added and removed
808+ mounts might not be noticed.
809+
810+ ** 🙇 ` 0f46d81f2bce ("fanotify: notify on mount attach and detach") ` 🙇**
811+
812+ ** Use-Case:** ` systemd ` tracks the mount table to integrate the mounts
813+ into it own dependency management.
814+
815+ ### Mount a subdirectory instead of the top-level directory
816+
817+ [ x] Mount a subdirectory instead of the top-level directory
818+
819+ Ability to mount a subdirectory of a regular file system instead of
820+ the top-level directory. E.e. for a file system ` /dev/sda1 ` which
821+ contains a sub-directory ` /foobar ` mount ` /foobar ` without having
822+ to mount its parent directory first. Consider something like this:
823+
824+ ```
825+ mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
826+ ```
827+
828+ (This is of course already possible via some mount namespacing
829+ shenanigans, but this requires namespacing to be available, and is
830+ not precisely obvious to implement. Explicit kernel support at mount
831+ time would be much preferable.)
832+
833+ ** 🙇 ` c5c12f871a30 ("fs: create detached mounts from detached mounts") ` 🙇**
834+
835+ ** Use-Case:** ` systemd-homed ` currently mounts a sub-directory of
836+ the per-user LUKS volume as the user's home directory (and not the
837+ root directory of the per-user LUKS volume's file system!), and in
838+ order to implement this invisibly from the host side requires a
839+ complex mount namespace exercise.
840+
814841### Unmounting of obstructed mounts
815842
816843[ x] ability to unmount obstructed mounts. (this means: you have a stack
0 commit comments