Skip to content

Commit ed56543

Browse files
braunerkeszybz
authored andcommitted
Move finished items into separate section
at the end of the list. Signed-off-by: Christian Brauner <[email protected]>
1 parent add290c commit ed56543

File tree

1 file changed

+124
-104
lines changed

1 file changed

+124
-104
lines changed

β€ŽREADME.mdβ€Ž

Lines changed: 124 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,6 @@ associated problem space.
1010
point that out explicitly and clearly in the associated patches and Cc
1111
`Christian Brauner <brauner (at) kernel (dot) org`.**
1212

13-
* [x] Ability to unmount obstructed mounts. (This means: you have a stack
14-
of mounts on the very same inode, and you want to remove a mount in
15-
the middle. Right now, you can only remove the topmost mount.)
16-
17-
**πŸ™‡ Instead of the ability to unmount obstructured mounts we gained
18-
the ability to mount beneath an existing mount, with mostly
19-
equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b
20-
("fs: allow to mount beneath top mount") πŸ™‡**
21-
22-
**Use-Case:** this is useful for replacing mounts atomically, for
23-
example for upgrading versioned disk images: first an old version
24-
of the image is mounted. Then a new version is mounted over the
25-
existing mount point, and then the lower mount point is
26-
removed. One such software would be `systemd-sysext`.
27-
2813
* Ability to mount sub-directories of regular file systems instead of
2914
the top-level directory. i.e. for a file system `/dev/sda1` which
3015
contains a sub-directory `/foobar` mount `/foobar` without having
@@ -77,17 +62,6 @@ point that out explicitly and clearly in the associated patches and Cc
7762
log and immediately exit, the cgroup information frequently cannot
7863
be acquired anymore by `systemd-journald`.
7964

80-
* [x] `SCM_PIDFD` or similar auxiliary socket message, that is a modern
81-
version of the `SCM_CREDS` message's `.pid` field, and provides a
82-
`pidfd` file descriptor to the originating peer process.
83-
84-
**πŸ™‡ `5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)")` πŸ™‡**
85-
86-
**Use-Case:** security infrastructure (such as PolicyKit) can safely
87-
reference clients this way without fearing PID
88-
recycling. `systemd-journald` can acquire peer metadata this way in
89-
a less racy fashion, in particular safe against PID recycling.
90-
9165
* Ability to link an `O_TMPFILE` file into a directory while *replacing* an
9266
existing file. (Currently there's only the ability to link it in, if the
9367
file name doesn't exist yet.)
@@ -115,23 +89,6 @@ point that out explicitly and clearly in the associated patches and Cc
11589
pointed to `file://dev/zero`, not expecting an endless amount of
11690
data to read.
11791

118-
* [x] `IP_UNICAST_IF` should be taken into account for routing decisions
119-
at UDP `connect()` time (currently it isn't, only `SO_BINDTOINDEX`
120-
is, but that does so much more than just that, and one often
121-
doesn't want that)
122-
123-
**πŸ™‡ `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix
124-
IP_UNICAST_IF option behavior for connected sockets")` πŸ™‡**
125-
126-
**Use-Case:** DNS resolvers that associate DNS configuration with
127-
specific network interfaces (example: `systemd-resolved`) typically
128-
want to preferably route DNS traffic to the per-interface DNS
129-
server via that interface, but not make further restrictions on the
130-
origins or received replies, and all that without
131-
privileges. `IP_UNICAST_IF` fulfills this role fine for TCP, but
132-
for UDP it is not taken into account for the `connect()` routing
133-
decision.
134-
13592
* `unlinkat3(dir_fd, name, inode_fd)`: taking one file descriptor
13693
for the directory to remove a file in, and another one referring
13794
to the inode of the filename to remove. This call should only
@@ -354,33 +311,6 @@ point that out explicitly and clearly in the associated patches and Cc
354311
**Use-Case:** block services or containers from re-opening/upgrading an
355312
`O_PATH` file descriptor through e.g. `/proc/<pid>/fd/<nr` as `O_WRONLY`.
356313

357-
* [x] Implement a mount-specific companion to `statx()` that puts at least the
358-
following information into `struct mount_info`:
359-
360-
**πŸ™‡ 46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")`` πŸ™‡**
361-
362-
* mount flags: `MOUNT_ATTR_RDONLY`, ...
363-
* time flags: `MOUNT_ATTR_RELATIME`, ...
364-
Could probably be combined with mount flags.
365-
* propagation setting: `MS_SHARED)`, ...
366-
* peer group
367-
* mnt id of the mount
368-
* mnt id of the mount's parent
369-
* owning userns
370-
371-
There's a bit more advanced stuff systemd would really want but which
372-
I think is misplaced in a mountinfo system call including:
373-
* list of primary and auxiliary block device major/minor
374-
* diskseq value of those device nodes (This is a new block device feature
375-
we added that allows preventing device recycling issues when e.g.
376-
removing usb devices very quickly and is needed for udev.)
377-
* uuid/fsid
378-
* feature flags (`O_TMPFILE`, `RENAME_EXCHANGE` supported etc.)
379-
380-
**Use-Case:** low-level userspace tools have to interact with advanced
381-
mount information constantly. This is currently costly and brittel because
382-
they have to go and parse `/proc/<pid>/mountinfo`.
383-
384314
* Make quotas work with user namespaces. The quota codepaths in the kernel
385315
currently broken and inconsistent and most interesting operations are
386316
guarded behind `capable(CAP_SYS_ADMIN)`, i.e., require `CAP_SYS_ADMIN` in
@@ -449,12 +379,6 @@ point that out explicitly and clearly in the associated patches and Cc
449379
**Use-Case:** Allow LSMs to make decisions about what mount properties to
450380
allow and what to deny.
451381

452-
* [x] (kAPI) Add security hook to `create_user_ns()`.
453-
454-
**πŸ™‡ `7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()")` πŸ™‡**
455-
456-
**Use-Case:** Allow LSMs to monitor user namespace creation.
457-
458382
* A per-cgroup knob for coredump sizes. Currently coredump size
459383
control is strictly per process, and primarily under control of
460384
the processes themselves. It would be good if we had a per-cgroup
@@ -522,27 +446,6 @@ point that out explicitly and clearly in the associated patches and Cc
522446
**Use-Case:** Allow mounting images inside nspawn containers, and using
523447
RootImage= and friends in the systemd user manager.
524448

525-
* [x] Support idmapped mounts for tmpfs
526-
527-
**πŸ™‡ `7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs")` πŸ™‡**
528-
529-
**Use-Case:** Runtimes such as Kubernetes use a lot of `tmpfs` mounts of
530-
individual files or directories to expose information to containers/pods.
531-
Instead of having to change ownership permanently allow them to use an
532-
idmapped mount instead.
533-
534-
@rata and @giuseppe brought this suggestion forward. For Kubernetes it is
535-
sufficient to support idmapped mounts of `tmpfs` instances mounted in the
536-
initial user namespace. However, in the future idmapped
537-
mounts of `tmpfs` instances mounted in user namespaces should be supported.
538-
Other container runtimes want to make use of this. The kernel is able to
539-
support this since at least `5.17`.
540-
541-
Things to remember are that `tmpfs` mounts can serve as lower- or upper
542-
layers in `overlayfs` and care needs to be taken that this remains safe if
543-
idmapped mounts of `tmpfs` instances mounted in user namespaces are
544-
supported.
545-
546449
* Support detached mounts with `pivot_root()`
547450

548451
The new rootfs must currently refer to an attached mount. This restriction
@@ -613,18 +516,14 @@ point that out explicitly and clearly in the associated patches and Cc
613516
system extension with a key pair that is supposed to be good for
614517
container images only.
615518

616-
* [x] Make statx() on a pidfd return additional recognizable identifiers
617-
in `.stx_btime` and `.stx_ino`.
519+
* Make statx() on a pidfd return additional recognizable identifiers
520+
in `.stx_btime`.
618521

619522
**πŸ™‡ `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` πŸ™‡**
620523

621524
It would be fantastic if issuing statx() on any pidfd would return
622525
the start time of the process in `.stx_btime` even after the process
623-
died, plus some reasonably stable 64bit identifier for the process
624-
in `.stx_ino`. Together these two fields would be perfect to
625-
identify processes pinned by a pidfd, and compare them, as the 96++
626-
bit of information they expose should be unique enough for the
627-
entire lifetime of the system to identify the processes.
526+
died.
628527

629528
These fields should in particular be queriable *after* the process
630529
already exited and has been reaped, i.e. after its PID has already
@@ -780,3 +679,124 @@ point that out explicitly and clearly in the associated patches and Cc
780679
which protocols they support on them. For example D-Bus sockets
781680
could carry `user.dbus` set to `1`, and Varlink sockets
782681
`user.varlink` set to `1` and so on.
682+
683+
## Finished Items
684+
685+
* [x] ability to unmount obstructed mounts. (this means: you have a stack
686+
of mounts on the very same inode, and you want to remove a mount in
687+
the middle. right now, you can only remove the topmost mount.)
688+
689+
**πŸ™‡ instead of the ability to unmount obstructured mounts we gained
690+
the ability to mount beneath an existing mount, with mostly
691+
equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b
692+
("fs: allow to mount beneath top mount") πŸ™‡**
693+
694+
**use-case:** this is useful for replacing mounts atomically, for
695+
example for upgrading versioned disk images: first an old version
696+
of the image is mounted. then a new version is mounted over the
697+
existing mount point, and then the lower mount point is
698+
removed. One such software would be `systemd-sysext`.
699+
700+
* [x] `SCM_PIDFD` or similar auxiliary socket message, that is a modern
701+
version of the `SCM_CREDS` message's `.pid` field, and provides a
702+
`pidfd` file descriptor to the originating peer process.
703+
704+
**πŸ™‡ `5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)")` πŸ™‡**
705+
706+
**Use-Case:** security infrastructure (such as PolicyKit) can safely
707+
reference clients this way without fearing PID
708+
recycling. `systemd-journald` can acquire peer metadata this way in
709+
a less racy fashion, in particular safe against PID recycling.
710+
711+
* [x] `IP_UNICAST_IF` should be taken into account for routing decisions
712+
at UDP `connect()` time (currently it isn't, only `SO_BINDTOINDEX`
713+
is, but that does so much more than just that, and one often
714+
doesn't want that)
715+
716+
**πŸ™‡ `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix
717+
IP_UNICAST_IF option behavior for connected sockets")` πŸ™‡**
718+
719+
**Use-Case:** DNS resolvers that associate DNS configuration with
720+
specific network interfaces (example: `systemd-resolved`) typically
721+
want to preferably route DNS traffic to the per-interface DNS
722+
server via that interface, but not make further restrictions on the
723+
origins or received replies, and all that without
724+
privileges. `IP_UNICAST_IF` fulfills this role fine for TCP, but
725+
for UDP it is not taken into account for the `connect()` routing
726+
decision.
727+
728+
* [x] Implement a mount-specific companion to `statx()` that puts at least the
729+
following information into `struct mount_info`:
730+
731+
**πŸ™‡ 46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")`` πŸ™‡**
732+
733+
* mount flags: `MOUNT_ATTR_RDONLY`, ...
734+
* time flags: `MOUNT_ATTR_RELATIME`, ...
735+
Could probably be combined with mount flags.
736+
* propagation setting: `MS_SHARED)`, ...
737+
* peer group
738+
* mnt id of the mount
739+
* mnt id of the mount's parent
740+
* owning userns
741+
742+
There's a bit more advanced stuff systemd would really want but which
743+
I think is misplaced in a mountinfo system call including:
744+
* list of primary and auxiliary block device major/minor
745+
* diskseq value of those device nodes (This is a new block device feature
746+
we added that allows preventing device recycling issues when e.g.
747+
removing usb devices very quickly and is needed for udev.)
748+
* uuid/fsid
749+
* feature flags (`O_TMPFILE`, `RENAME_EXCHANGE` supported etc.)
750+
751+
**Use-Case:** low-level userspace tools have to interact with advanced
752+
mount information constantly. This is currently costly and brittel because
753+
they have to go and parse `/proc/<pid>/mountinfo`.
754+
755+
* [x] (kAPI) Add security hook to `create_user_ns()`.
756+
757+
**πŸ™‡ `7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()")` πŸ™‡**
758+
759+
**Use-Case:** Allow LSMs to monitor user namespace creation.
760+
761+
* [x] Support idmapped mounts for tmpfs
762+
763+
**πŸ™‡ `7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs")` πŸ™‡**
764+
765+
**Use-Case:** Runtimes such as Kubernetes use a lot of `tmpfs` mounts of
766+
individual files or directories to expose information to containers/pods.
767+
Instead of having to change ownership permanently allow them to use an
768+
idmapped mount instead.
769+
770+
@rata and @giuseppe brought this suggestion forward. For Kubernetes it is
771+
sufficient to support idmapped mounts of `tmpfs` instances mounted in the
772+
initial user namespace. However, in the future idmapped
773+
mounts of `tmpfs` instances mounted in user namespaces should be supported.
774+
Other container runtimes want to make use of this. The kernel is able to
775+
support this since at least `5.17`.
776+
777+
Things to remember are that `tmpfs` mounts can serve as lower- or upper
778+
layers in `overlayfs` and care needs to be taken that this remains safe if
779+
idmapped mounts of `tmpfs` instances mounted in user namespaces are
780+
supported.
781+
782+
* [x] Make statx() on a pidfd return additional recognizable identifiers
783+
in `.stx_ino`.
784+
785+
**πŸ™‡ `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` πŸ™‡**
786+
787+
It would be fantastic if issuing statx() on any pidfd would return some
788+
reasonably stable 64bit identifier for the process in `.stx_ino`. This would
789+
be perfect to identify processes pinned by a pidfd, and compare them.
790+
791+
* [x] Namespace `binfmt_misc` filesystem.
792+
793+
**πŸ™‡ `21ca59b365c0 ("binfmt_misc: enable sandboxed mounts")` πŸ™‡**
794+
795+
**Use-Case:** Allow containers and sandboxes to register their own binfmt
796+
handlers.
797+
798+
* [x] Support idmapped mounts for `overlayfs`
799+
800+
**πŸ™‡ `bc70682a497c ("ovl: support idmapped layers")` πŸ™‡**
801+
802+
**Use-Case:** Allow containers to use `overlayfs` with idmapped mounts.

0 commit comments

Comments
Β (0)