Skip to content

Commit cfd3acb

Browse files
authored
Merge pull request #34 from brauner/tmp
whishlist: mark a bunch of items as done
2 parents c5a48c4 + 171e5e3 commit cfd3acb

File tree

1 file changed

+171
-144
lines changed

1 file changed

+171
-144
lines changed

README.md

Lines changed: 171 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -10,28 +10,6 @@ associated problem space.
1010
point that out explicitly and clearly in the associated patches and Cc
1111
`Christian Brauner <brauner (at) kernel (dot) org`.**
1212

13-
### Mount a subdirectory instead of the top-level directory
14-
15-
Ability to mount a subdirectory of a regular file system instead of
16-
the top-level directory. E.e. for a file system `/dev/sda1` which
17-
contains a sub-directory `/foobar` mount `/foobar` without having
18-
to mount its parent directory first. Consider something like this:
19-
20-
```
21-
mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
22-
```
23-
24-
(This is of course already possible via some mount namespacing
25-
shenanigans, but this requires namespacing to be available, and is
26-
not precisely obvious to implement. Explicit kernel support at mount
27-
time would be much preferable.)
28-
29-
**Use-Case:** `systemd-homed` currently mounts a sub-directory of
30-
the per-user LUKS volume as the user's home directory (and not the
31-
root directory of the per-user LUKS volume's file system!), and in
32-
order to implement this invisibly from the host side requires a
33-
complex mount namespace exercise.
34-
3513
### inotify() events for BSD file locks
3614

3715
BSD file locks (i.e. `flock()`, as opposed to POSIX `F_SETLK` and
@@ -243,22 +221,6 @@ to use `pidfd`s to remove PID recycling security issues, but
243221
currently cannot as it also needs to generically wait for such
244222
unexpected children.
245223

246-
### Mount notifications without rescanning of `/proc/self/mountinfo`
247-
248-
Mount notifications that do not require continuous rescanning of
249-
`/proc/self/mountinfo`. Currently, if a program wants to track
250-
mounts established on the system it can receive `poll()`able
251-
events via a file descriptor to `/proc/self/mountinfo`. When
252-
receiving them it needs to rescan the file from the top and
253-
compare it with the previous scan. This is both slow and
254-
racy. It's slow on systems with a large number of mounts as the
255-
cost for re-scanning the table has to be paid for every change to
256-
the mount table. It's racy because quickly added and removed
257-
mounts might not be noticed.
258-
259-
**Use-Case:** `systemd` tracks the mount table to integrate the mounts
260-
into it own dependency management.
261-
262224
### Asynchronous `close()`
263225

264226
An asynchronous or forced `close()`, that guarantees that
@@ -367,43 +329,6 @@ user namespace. But this doesn't just lock a single mount or mount subtree
367329
it locks all mounts in the mount namespace, i.e., the mount table cannot be
368330
altered.
369331

370-
### Add `OPEN_TREE_CLEAR` flag to `open_tree()`
371-
372-
Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
373-
used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
374-
all mount properties from that mount including the mount's idmapping.
375-
Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
376-
are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
377-
in the filesystems user namespace.
378-
379-
Locked mount properties cannot be changed. A mount's idmapping becomes
380-
locked if it propagates across user namespaces.
381-
382-
This is useful to get a new, clear mount and also allows the caller to
383-
create a new detached mount with an idmapping attached to the mount. Iow,
384-
the caller may idmap the mount afterwards.
385-
386-
**Use-Case:** A user may already use an idmapped mount for their home
387-
directory. And once a mount has been idmapped the idmapping cannot be
388-
changed anymore. This allows for simple semantics and allows to avoid
389-
lifetime complexity in order to account for scenarios where concurrent
390-
readers or writers might still use a given user namespace while it is about
391-
to be changed.
392-
But this poses a problem when the user wants to attach an idmapping to
393-
a mount that is already idmapped. The new flag allows to solve this
394-
problem. A sufficiently privileged user such as a container manager can
395-
create a user namespace for the container which expresses the desired
396-
ownership. Then they can create a new detached mount without any prior
397-
mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
398-
mount.
399-
400-
### Require a user namespace to have an idmapping when attached
401-
402-
Enforce that the user namespace about to be attached to a mount must
403-
have an idmapping written.
404-
405-
**Use-Case:** Tighten the semantics.
406-
407332
### Extend `setns()` to allow attaching to all new namespaces of a process
408333

409334
Add an extension to `setns()` to allow attaching to all namespaces of
@@ -575,69 +500,6 @@ different sources and it should not be possible to generate a
575500
system extension with a key pair that is supposed to be good for
576501
container images only.
577502

578-
### Make statx() on a pidfd return additional info
579-
580-
Make statx() on a pidfd return additional recognizable identifiers in
581-
`.stx_btime`.
582-
583-
**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
584-
585-
It would be fantastic if issuing statx() on any pidfd would return
586-
the start time of the process in `.stx_btime` even after the process
587-
died.
588-
589-
These fields should in particular be queriable *after* the process
590-
already exited and has been reaped, i.e. after its PID has already
591-
been recycled.
592-
593-
**Usecase:** In systemd we maintain lists of processes in a hash
594-
table. Right now, the key is the PID, but this is less than ideal
595-
because of PID recycling. By being able to use the `.stx_btime`
596-
and/or `.stx_ino` fields instead would be perfect to safely
597-
identify, track and compare process even after they ceased to exist.
598-
599-
### API to determine the parent process ID of a pidfd
600-
601-
An API to determine the parent process ID (ppid) of a pidfd would be
602-
good.
603-
604-
This information is relevant to code dealing with pidfds, since if
605-
the ppid of a pidfd matches the process own pid it can call
606-
`waitid()` on the process, if it doesn't it cannot and such a call
607-
would fail. It would be very useful if this could be determined
608-
easily before even calling that syscall.
609-
610-
**Usecase:** systemd manages a multitude of processes, most of which
611-
are its own children, but many which are not. It would be great if
612-
we could easily determine whether it is worth waiting for
613-
`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
614-
them is the only way to get exit notification.
615-
616-
### Set `comm` field before `exec()`
617-
618-
There should be a way to control the process' `comm` field if
619-
started via `fexecve()`/`execveat()`.
620-
621-
Right now, when `fexecve()`/`execveat()` is used, the `comm` field
622-
(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
623-
which breaks `ps -C …` and various other tools. In particular when
624-
the fd was opened with `O_CLOEXEC`, the number of the fd in the old
625-
process is completely meaningless.
626-
627-
The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.
628-
629-
Since `comm` is under user control anyway (via `PR_SET_NAME`), it
630-
should be safe to also make it somehow configurable at fexecve()
631-
time.
632-
633-
See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
634-
https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.
635-
636-
**Usecase:** In systemd we generally would prefer using `fexecve()`
637-
to safely and race-freely invoke processes, but the fact that `comm`
638-
is useless after invoking a process that way makes the call
639-
unfortunately hard to use for systemd.
640-
641503
### Path-based ACL management in an LSM hook
642504

643505
The LSM module API should have the ability to do path-based (not
@@ -766,12 +628,6 @@ Add an option to go from individual thread to thread-group leader.
766628
**Use-Case:** Allow for a race free way to go from individual thread
767629
to thread-group leader pidfd.
768630

769-
### Namespace ioctl to translate a PID between PID namespaces
770-
771-
**Use-Case:** This makes it possible to e.g., figure out what a given PID in
772-
a PID namespace corresponds to in the caller's PID namespace. For example, to
773-
figure out what the PID of PID 1 inside of a given PID namespace is.
774-
775631
### Useful handling of LSM denials on SCM_RIGHTS
776632

777633
Right now if some LSM such as SELinux denies an `AF_UNIX` socket peer
@@ -811,6 +667,177 @@ on received messages.
811667

812668
## Finished Items
813669

670+
### Namespace ioctl to translate a PID between PID namespaces
671+
672+
[x] Namespace ioctl to translate a PID between PID namespaces
673+
674+
**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇**
675+
676+
**Use-Case:** This makes it possible to e.g., figure out what a given PID in
677+
a PID namespace corresponds to in the caller's PID namespace. For example, to
678+
figure out what the PID of PID 1 inside of a given PID namespace is.
679+
680+
### API to determine the parent process ID of a pidfd
681+
682+
[x] API to determine the parent process ID of a pidfd
683+
684+
An API to determine the parent process ID (ppid) of a pidfd would be
685+
good.
686+
687+
This information is relevant to code dealing with pidfds, since if
688+
the ppid of a pidfd matches the process own pid it can call
689+
`waitid()` on the process, if it doesn't it cannot and such a call
690+
would fail. It would be very useful if this could be determined
691+
easily before even calling that syscall.
692+
693+
**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇**
694+
695+
**Usecase:** systemd manages a multitude of processes, most of which
696+
are its own children, but many which are not. It would be great if
697+
we could easily determine whether it is worth waiting for
698+
`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
699+
them is the only way to get exit notification.
700+
701+
### Set `comm` field before `exec()`
702+
703+
[x] Set `comm` field before `exec()`
704+
705+
There should be a way to control the process' `comm` field if
706+
started via `fexecve()`/`execveat()`.
707+
708+
Right now, when `fexecve()`/`execveat()` is used, the `comm` field
709+
(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
710+
which breaks `ps -C …` and various other tools. In particular when
711+
the fd was opened with `O_CLOEXEC`, the number of the fd in the old
712+
process is completely meaningless.
713+
714+
The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.
715+
716+
Since `comm` is under user control anyway (via `PR_SET_NAME`), it
717+
should be safe to also make it somehow configurable at fexecve()
718+
time.
719+
720+
See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
721+
https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.
722+
723+
**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇**
724+
725+
**Usecase:** In systemd we generally would prefer using `fexecve()`
726+
to safely and race-freely invoke processes, but the fact that `comm`
727+
is useless after invoking a process that way makes the call
728+
unfortunately hard to use for systemd.
729+
### Make statx() on a pidfd return additional info
730+
731+
Make statx() on a pidfd return additional recognizable identifiers in
732+
`.stx_btime`.
733+
734+
**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
735+
736+
It would be fantastic if issuing statx() on any pidfd would return
737+
the start time of the process in `.stx_btime` even after the process
738+
died.
739+
740+
These fields should in particular be queriable *after* the process
741+
already exited and has been reaped, i.e. after its PID has already
742+
been recycled.
743+
744+
**Usecase:** In systemd we maintain lists of processes in a hash
745+
table. Right now, the key is the PID, but this is less than ideal
746+
because of PID recycling. By being able to use the `.stx_btime`
747+
and/or `.stx_ino` fields instead would be perfect to safely
748+
identify, track and compare process even after they ceased to exist.
749+
750+
### Allow creating idmapped mounts from idmapped mounts
751+
752+
[x] Allow creating idmapped mounts from idmapped mounts
753+
754+
Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
755+
used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
756+
all mount properties from that mount including the mount's idmapping.
757+
Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
758+
are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
759+
in the filesystems user namespace.
760+
761+
Locked mount properties cannot be changed. A mount's idmapping becomes
762+
locked if it propagates across user namespaces.
763+
764+
This is useful to get a new, clear mount and also allows the caller to
765+
create a new detached mount with an idmapping attached to the mount. Iow,
766+
the caller may idmap the mount afterwards.
767+
768+
**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇**
769+
770+
**Use-Case:** A user may already use an idmapped mount for their home
771+
directory. And once a mount has been idmapped the idmapping cannot be
772+
changed anymore. This allows for simple semantics and allows to avoid
773+
lifetime complexity in order to account for scenarios where concurrent
774+
readers or writers might still use a given user namespace while it is about
775+
to be changed.
776+
But this poses a problem when the user wants to attach an idmapping to
777+
a mount that is already idmapped. The new flag allows to solve this
778+
problem. A sufficiently privileged user such as a container manager can
779+
create a user namespace for the container which expresses the desired
780+
ownership. Then they can create a new detached mount without any prior
781+
mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
782+
mount.
783+
784+
### Require a user namespace to have an idmapping when attached
785+
786+
[x] Require a user namespace to have an idmapping when attached
787+
788+
Enforce that the user namespace about to be attached to a mount must
789+
have an idmapping written.
790+
791+
**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇**
792+
793+
**Use-Case:** Tighten the semantics.
794+
795+
### Mount notifications without rescanning of `/proc/self/mountinfo`
796+
797+
[x] Mount notifications without rescanning of `/proc/self/mountinfo`
798+
799+
Mount notifications that do not require continuous rescanning of
800+
`/proc/self/mountinfo`. Currently, if a program wants to track
801+
mounts established on the system it can receive `poll()`able
802+
events via a file descriptor to `/proc/self/mountinfo`. When
803+
receiving them it needs to rescan the file from the top and
804+
compare it with the previous scan. This is both slow and
805+
racy. It's slow on systems with a large number of mounts as the
806+
cost for re-scanning the table has to be paid for every change to
807+
the mount table. It's racy because quickly added and removed
808+
mounts might not be noticed.
809+
810+
**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇**
811+
812+
**Use-Case:** `systemd` tracks the mount table to integrate the mounts
813+
into it own dependency management.
814+
815+
### Mount a subdirectory instead of the top-level directory
816+
817+
[x] Mount a subdirectory instead of the top-level directory
818+
819+
Ability to mount a subdirectory of a regular file system instead of
820+
the top-level directory. E.e. for a file system `/dev/sda1` which
821+
contains a sub-directory `/foobar` mount `/foobar` without having
822+
to mount its parent directory first. Consider something like this:
823+
824+
```
825+
mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
826+
```
827+
828+
(This is of course already possible via some mount namespacing
829+
shenanigans, but this requires namespacing to be available, and is
830+
not precisely obvious to implement. Explicit kernel support at mount
831+
time would be much preferable.)
832+
833+
**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇**
834+
835+
**Use-Case:** `systemd-homed` currently mounts a sub-directory of
836+
the per-user LUKS volume as the user's home directory (and not the
837+
root directory of the per-user LUKS volume's file system!), and in
838+
order to implement this invisibly from the host side requires a
839+
complex mount namespace exercise.
840+
814841
### Unmounting of obstructed mounts
815842

816843
[x] ability to unmount obstructed mounts. (this means: you have a stack

0 commit comments

Comments
 (0)