Skip to content

Commit 171e5e3

Browse files
committed
wishlist: move all finished items into the correct section
Signed-off-by: Christian Brauner <[email protected]>
1 parent 662e7b7 commit 171e5e3

File tree

1 file changed

+171
-172
lines changed

1 file changed

+171
-172
lines changed

README.md

Lines changed: 171 additions & 172 deletions
Original file line numberDiff line numberDiff line change
@@ -10,32 +10,6 @@ associated problem space.
1010
point that out explicitly and clearly in the associated patches and Cc
1111
`Christian Brauner <brauner (at) kernel (dot) org`.**
1212

13-
### Mount a subdirectory instead of the top-level directory
14-
15-
[x] Mount a subdirectory instead of the top-level directory
16-
17-
Ability to mount a subdirectory of a regular file system instead of
18-
the top-level directory. E.e. for a file system `/dev/sda1` which
19-
contains a sub-directory `/foobar` mount `/foobar` without having
20-
to mount its parent directory first. Consider something like this:
21-
22-
```
23-
mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
24-
```
25-
26-
(This is of course already possible via some mount namespacing
27-
shenanigans, but this requires namespacing to be available, and is
28-
not precisely obvious to implement. Explicit kernel support at mount
29-
time would be much preferable.)
30-
31-
**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇**
32-
33-
**Use-Case:** `systemd-homed` currently mounts a sub-directory of
34-
the per-user LUKS volume as the user's home directory (and not the
35-
root directory of the per-user LUKS volume's file system!), and in
36-
order to implement this invisibly from the host side requires a
37-
complex mount namespace exercise.
38-
3913
### inotify() events for BSD file locks
4014

4115
BSD file locks (i.e. `flock()`, as opposed to POSIX `F_SETLK` and
@@ -247,26 +221,6 @@ to use `pidfd`s to remove PID recycling security issues, but
247221
currently cannot as it also needs to generically wait for such
248222
unexpected children.
249223

250-
### Mount notifications without rescanning of `/proc/self/mountinfo`
251-
252-
[x] Mount notifications without rescanning of `/proc/self/mountinfo`
253-
254-
Mount notifications that do not require continuous rescanning of
255-
`/proc/self/mountinfo`. Currently, if a program wants to track
256-
mounts established on the system it can receive `poll()`able
257-
events via a file descriptor to `/proc/self/mountinfo`. When
258-
receiving them it needs to rescan the file from the top and
259-
compare it with the previous scan. This is both slow and
260-
racy. It's slow on systems with a large number of mounts as the
261-
cost for re-scanning the table has to be paid for every change to
262-
the mount table. It's racy because quickly added and removed
263-
mounts might not be noticed.
264-
265-
**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇**
266-
267-
**Use-Case:** `systemd` tracks the mount table to integrate the mounts
268-
into it own dependency management.
269-
270224
### Asynchronous `close()`
271225

272226
An asynchronous or forced `close()`, that guarantees that
@@ -375,51 +329,6 @@ user namespace. But this doesn't just lock a single mount or mount subtree
375329
it locks all mounts in the mount namespace, i.e., the mount table cannot be
376330
altered.
377331

378-
### Allow creating idmapped mounts from idmapped mounts
379-
380-
[x] Allow creating idmapped mounts from idmapped mounts
381-
382-
Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
383-
used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
384-
all mount properties from that mount including the mount's idmapping.
385-
Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
386-
are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
387-
in the filesystems user namespace.
388-
389-
Locked mount properties cannot be changed. A mount's idmapping becomes
390-
locked if it propagates across user namespaces.
391-
392-
This is useful to get a new, clear mount and also allows the caller to
393-
create a new detached mount with an idmapping attached to the mount. Iow,
394-
the caller may idmap the mount afterwards.
395-
396-
**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇**
397-
398-
**Use-Case:** A user may already use an idmapped mount for their home
399-
directory. And once a mount has been idmapped the idmapping cannot be
400-
changed anymore. This allows for simple semantics and allows to avoid
401-
lifetime complexity in order to account for scenarios where concurrent
402-
readers or writers might still use a given user namespace while it is about
403-
to be changed.
404-
But this poses a problem when the user wants to attach an idmapping to
405-
a mount that is already idmapped. The new flag allows to solve this
406-
problem. A sufficiently privileged user such as a container manager can
407-
create a user namespace for the container which expresses the desired
408-
ownership. Then they can create a new detached mount without any prior
409-
mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
410-
mount.
411-
412-
### Require a user namespace to have an idmapping when attached
413-
414-
[x] Require a user namespace to have an idmapping when attached
415-
416-
Enforce that the user namespace about to be attached to a mount must
417-
have an idmapping written.
418-
419-
**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇**
420-
421-
**Use-Case:** Tighten the semantics.
422-
423332
### Extend `setns()` to allow attaching to all new namespaces of a process
424333

425334
Add an extension to `setns()` to allow attaching to all namespaces of
@@ -591,77 +500,6 @@ different sources and it should not be possible to generate a
591500
system extension with a key pair that is supposed to be good for
592501
container images only.
593502

594-
### Make statx() on a pidfd return additional info
595-
596-
Make statx() on a pidfd return additional recognizable identifiers in
597-
`.stx_btime`.
598-
599-
**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
600-
601-
It would be fantastic if issuing statx() on any pidfd would return
602-
the start time of the process in `.stx_btime` even after the process
603-
died.
604-
605-
These fields should in particular be queriable *after* the process
606-
already exited and has been reaped, i.e. after its PID has already
607-
been recycled.
608-
609-
**Usecase:** In systemd we maintain lists of processes in a hash
610-
table. Right now, the key is the PID, but this is less than ideal
611-
because of PID recycling. By being able to use the `.stx_btime`
612-
and/or `.stx_ino` fields instead would be perfect to safely
613-
identify, track and compare process even after they ceased to exist.
614-
615-
### API to determine the parent process ID of a pidfd
616-
617-
[x] API to determine the parent process ID of a pidfd
618-
619-
An API to determine the parent process ID (ppid) of a pidfd would be
620-
good.
621-
622-
This information is relevant to code dealing with pidfds, since if
623-
the ppid of a pidfd matches the process own pid it can call
624-
`waitid()` on the process, if it doesn't it cannot and such a call
625-
would fail. It would be very useful if this could be determined
626-
easily before even calling that syscall.
627-
628-
**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇**
629-
630-
**Usecase:** systemd manages a multitude of processes, most of which
631-
are its own children, but many which are not. It would be great if
632-
we could easily determine whether it is worth waiting for
633-
`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
634-
them is the only way to get exit notification.
635-
636-
### Set `comm` field before `exec()`
637-
638-
[x] Set `comm` field before `exec()`
639-
640-
There should be a way to control the process' `comm` field if
641-
started via `fexecve()`/`execveat()`.
642-
643-
Right now, when `fexecve()`/`execveat()` is used, the `comm` field
644-
(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
645-
which breaks `ps -C …` and various other tools. In particular when
646-
the fd was opened with `O_CLOEXEC`, the number of the fd in the old
647-
process is completely meaningless.
648-
649-
The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.
650-
651-
Since `comm` is under user control anyway (via `PR_SET_NAME`), it
652-
should be safe to also make it somehow configurable at fexecve()
653-
time.
654-
655-
See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
656-
https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.
657-
658-
**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇**
659-
660-
**Usecase:** In systemd we generally would prefer using `fexecve()`
661-
to safely and race-freely invoke processes, but the fact that `comm`
662-
is useless after invoking a process that way makes the call
663-
unfortunately hard to use for systemd.
664-
665503
### Path-based ACL management in an LSM hook
666504

667505
The LSM module API should have the ability to do path-based (not
@@ -790,16 +628,6 @@ Add an option to go from individual thread to thread-group leader.
790628
**Use-Case:** Allow for a race free way to go from individual thread
791629
to thread-group leader pidfd.
792630

793-
### Namespace ioctl to translate a PID between PID namespaces
794-
795-
[x] Namespace ioctl to translate a PID between PID namespaces
796-
797-
**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇**
798-
799-
**Use-Case:** This makes it possible to e.g., figure out what a given PID in
800-
a PID namespace corresponds to in the caller's PID namespace. For example, to
801-
figure out what the PID of PID 1 inside of a given PID namespace is.
802-
803631
### Useful handling of LSM denials on SCM_RIGHTS
804632

805633
Right now if some LSM such as SELinux denies an `AF_UNIX` socket peer
@@ -839,6 +667,177 @@ on received messages.
839667

840668
## Finished Items
841669

670+
### Namespace ioctl to translate a PID between PID namespaces
671+
672+
[x] Namespace ioctl to translate a PID between PID namespaces
673+
674+
**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇**
675+
676+
**Use-Case:** This makes it possible to e.g., figure out what a given PID in
677+
a PID namespace corresponds to in the caller's PID namespace. For example, to
678+
figure out what the PID of PID 1 inside of a given PID namespace is.
679+
680+
### API to determine the parent process ID of a pidfd
681+
682+
[x] API to determine the parent process ID of a pidfd
683+
684+
An API to determine the parent process ID (ppid) of a pidfd would be
685+
good.
686+
687+
This information is relevant to code dealing with pidfds, since if
688+
the ppid of a pidfd matches the process own pid it can call
689+
`waitid()` on the process, if it doesn't it cannot and such a call
690+
would fail. It would be very useful if this could be determined
691+
easily before even calling that syscall.
692+
693+
**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇**
694+
695+
**Usecase:** systemd manages a multitude of processes, most of which
696+
are its own children, but many which are not. It would be great if
697+
we could easily determine whether it is worth waiting for
698+
`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
699+
them is the only way to get exit notification.
700+
701+
### Set `comm` field before `exec()`
702+
703+
[x] Set `comm` field before `exec()`
704+
705+
There should be a way to control the process' `comm` field if
706+
started via `fexecve()`/`execveat()`.
707+
708+
Right now, when `fexecve()`/`execveat()` is used, the `comm` field
709+
(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
710+
which breaks `ps -C …` and various other tools. In particular when
711+
the fd was opened with `O_CLOEXEC`, the number of the fd in the old
712+
process is completely meaningless.
713+
714+
The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.
715+
716+
Since `comm` is under user control anyway (via `PR_SET_NAME`), it
717+
should be safe to also make it somehow configurable at fexecve()
718+
time.
719+
720+
See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
721+
https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.
722+
723+
**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇**
724+
725+
**Usecase:** In systemd we generally would prefer using `fexecve()`
726+
to safely and race-freely invoke processes, but the fact that `comm`
727+
is useless after invoking a process that way makes the call
728+
unfortunately hard to use for systemd.
729+
### Make statx() on a pidfd return additional info
730+
731+
Make statx() on a pidfd return additional recognizable identifiers in
732+
`.stx_btime`.
733+
734+
**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
735+
736+
It would be fantastic if issuing statx() on any pidfd would return
737+
the start time of the process in `.stx_btime` even after the process
738+
died.
739+
740+
These fields should in particular be queriable *after* the process
741+
already exited and has been reaped, i.e. after its PID has already
742+
been recycled.
743+
744+
**Usecase:** In systemd we maintain lists of processes in a hash
745+
table. Right now, the key is the PID, but this is less than ideal
746+
because of PID recycling. By being able to use the `.stx_btime`
747+
and/or `.stx_ino` fields instead would be perfect to safely
748+
identify, track and compare process even after they ceased to exist.
749+
750+
### Allow creating idmapped mounts from idmapped mounts
751+
752+
[x] Allow creating idmapped mounts from idmapped mounts
753+
754+
Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
755+
used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
756+
all mount properties from that mount including the mount's idmapping.
757+
Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
758+
are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
759+
in the filesystems user namespace.
760+
761+
Locked mount properties cannot be changed. A mount's idmapping becomes
762+
locked if it propagates across user namespaces.
763+
764+
This is useful to get a new, clear mount and also allows the caller to
765+
create a new detached mount with an idmapping attached to the mount. Iow,
766+
the caller may idmap the mount afterwards.
767+
768+
**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇**
769+
770+
**Use-Case:** A user may already use an idmapped mount for their home
771+
directory. And once a mount has been idmapped the idmapping cannot be
772+
changed anymore. This allows for simple semantics and allows to avoid
773+
lifetime complexity in order to account for scenarios where concurrent
774+
readers or writers might still use a given user namespace while it is about
775+
to be changed.
776+
But this poses a problem when the user wants to attach an idmapping to
777+
a mount that is already idmapped. The new flag allows to solve this
778+
problem. A sufficiently privileged user such as a container manager can
779+
create a user namespace for the container which expresses the desired
780+
ownership. Then they can create a new detached mount without any prior
781+
mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
782+
mount.
783+
784+
### Require a user namespace to have an idmapping when attached
785+
786+
[x] Require a user namespace to have an idmapping when attached
787+
788+
Enforce that the user namespace about to be attached to a mount must
789+
have an idmapping written.
790+
791+
**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇**
792+
793+
**Use-Case:** Tighten the semantics.
794+
795+
### Mount notifications without rescanning of `/proc/self/mountinfo`
796+
797+
[x] Mount notifications without rescanning of `/proc/self/mountinfo`
798+
799+
Mount notifications that do not require continuous rescanning of
800+
`/proc/self/mountinfo`. Currently, if a program wants to track
801+
mounts established on the system it can receive `poll()`able
802+
events via a file descriptor to `/proc/self/mountinfo`. When
803+
receiving them it needs to rescan the file from the top and
804+
compare it with the previous scan. This is both slow and
805+
racy. It's slow on systems with a large number of mounts as the
806+
cost for re-scanning the table has to be paid for every change to
807+
the mount table. It's racy because quickly added and removed
808+
mounts might not be noticed.
809+
810+
**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇**
811+
812+
**Use-Case:** `systemd` tracks the mount table to integrate the mounts
813+
into it own dependency management.
814+
815+
### Mount a subdirectory instead of the top-level directory
816+
817+
[x] Mount a subdirectory instead of the top-level directory
818+
819+
Ability to mount a subdirectory of a regular file system instead of
820+
the top-level directory. E.e. for a file system `/dev/sda1` which
821+
contains a sub-directory `/foobar` mount `/foobar` without having
822+
to mount its parent directory first. Consider something like this:
823+
824+
```
825+
mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
826+
```
827+
828+
(This is of course already possible via some mount namespacing
829+
shenanigans, but this requires namespacing to be available, and is
830+
not precisely obvious to implement. Explicit kernel support at mount
831+
time would be much preferable.)
832+
833+
**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇**
834+
835+
**Use-Case:** `systemd-homed` currently mounts a sub-directory of
836+
the per-user LUKS volume as the user's home directory (and not the
837+
root directory of the per-user LUKS volume's file system!), and in
838+
order to implement this invisibly from the host side requires a
839+
complex mount namespace exercise.
840+
842841
### Unmounting of obstructed mounts
843842

844843
[x] ability to unmount obstructed mounts. (this means: you have a stack

0 commit comments

Comments
 (0)