Skip to content

Commit 4cae5e2

Browse files
committed
kernel-wishlist: move completed items to the end
1 parent 95b7daf commit 4cae5e2

File tree

1 file changed

+67
-67
lines changed

1 file changed

+67
-67
lines changed

README.md

Lines changed: 67 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -135,52 +135,6 @@ entirely read-only. To close this gap it would be great if such
135135
propagated mounts could implicitly gain `MS_RDONLY` as they are
136136
propagated.
137137

138-
### Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets
139-
140-
[x] Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX`
141-
sockets.
142-
143-
**🙇 `77cbe1a6d8730a07f99f9263c2d5f2304cf5e830 ("af_unix: Introduce SO_PASSRIGHTS")` 🙇**
144-
145-
Right now reception of file descriptors is always on when
146-
a process makes the mistake of invoking `recvmsg()` on such a
147-
socket. This is problematic since `SCM_RIGHTS` installs file
148-
descriptors in the recipient process' file descriptor
149-
table. Getting rid of these file descriptors is not necessarily
150-
easy, as they could refer to "slow-to-close" files (think: dirty
151-
file descriptor referring to a file on an unresponsive NFS server,
152-
or some device file descriptor), that might cause the recipient to
153-
block for a longer time when it tries to them. Programs reading
154-
from an `AF_UNIX` socket currently have three options:
155-
156-
1. Never use `recvmsg()`, and stick to `read()`, `recv()` and
157-
similar which do not install file descriptors in the recipients
158-
file descriptor table.
159-
160-
2. Ignore the problem, and simply `close()` the received file descriptors
161-
it didn't expect, thus possibly locking up for a longer time.
162-
163-
3. Fork off a thread that invokes `close()`, which mitigates the
164-
risk of blocking, but still means a sender can cause resource
165-
exhaustion in a recipient by flooding it with file descriptors,
166-
as for each of them a thread needs to be spawned and a file
167-
descriptor is taken while it is in the process of being closed.
168-
169-
(Another option of course is to never talk `AF_UNIX` to peers that
170-
are not trusted to not send unexpected file descriptors.)
171-
172-
A simple knob that allows turning off `SCM_RIGHTS` right reception
173-
would be useful to close this weakness, and would allow
174-
`recvmsg()` to be called without risking file descriptors to be
175-
installed in the file descriptor table, and thus risking a
176-
blocking `close()` or a form of potential resource exhaustion.
177-
178-
**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or
179-
would like to use) `recvmsg()` on it (which is useful to acquire
180-
other metadata). Example: logging daemons that want to collect
181-
timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message
182-
broker and suchlike.
183-
184138
### Filtering on received file descriptors
185139

186140
An alternative to the previous item could be if some form of filtering
@@ -191,27 +145,6 @@ received" may be expressed. (BPF?).
191145

192146
**Use-Case:** as above.
193147

194-
### A reliable way to check for PID namespacing
195-
196-
[x] A reliable (non-heuristic) way to detect from userspace if the
197-
current process is running in a PID namespace that is not the main
198-
PID namespace. PID namespaces are probably the primary type of
199-
namespace that identify a container environment. While many
200-
heuristics exist to determine generically whether one is executed
201-
inside a container, it would be good to have a correct,
202-
well-defined way to determine this.
203-
204-
**🙇 The inode number of the root PID namespace is fixed (0xEFFFFFFC)
205-
and now considered API. It can be used to distinguish the root PID
206-
namespace from all others. 🙇**
207-
208-
**Use-Case:** tools such as `systemd-detect-virt` exist to determine
209-
container execution, but typically resolve to checking for
210-
specific implementations. It would be much nicer and universally
211-
applicable if such a check could be done generically. It would
212-
probably suffice to provide an `ioctl()` call on the `pidns` file
213-
descriptor that reveals this kind of information in some form.
214-
215148
### Excluding processes watched via `pidfd` from `waitid(P_ALL, …)`
216149

217150
**Use-Case:** various programs use `waitid(P_ALL, …)` to collect exit
@@ -1007,3 +940,70 @@ handlers.
1007940
**🙇 `bc70682a497c ("ovl: support idmapped layers")` 🙇**
1008941

1009942
**Use-Case:** Allow containers to use `overlayfs` with idmapped mounts.
943+
944+
### Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets
945+
946+
[x] Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX`
947+
sockets.
948+
949+
**🙇 `77cbe1a6d8730a07f99f9263c2d5f2304cf5e830 ("af_unix: Introduce SO_PASSRIGHTS")` 🙇**
950+
951+
Right now reception of file descriptors is always on when
952+
a process makes the mistake of invoking `recvmsg()` on such a
953+
socket. This is problematic since `SCM_RIGHTS` installs file
954+
descriptors in the recipient process' file descriptor
955+
table. Getting rid of these file descriptors is not necessarily
956+
easy, as they could refer to "slow-to-close" files (think: dirty
957+
file descriptor referring to a file on an unresponsive NFS server,
958+
or some device file descriptor), that might cause the recipient to
959+
block for a longer time when it tries to them. Programs reading
960+
from an `AF_UNIX` socket currently have three options:
961+
962+
1. Never use `recvmsg()`, and stick to `read()`, `recv()` and
963+
similar which do not install file descriptors in the recipients
964+
file descriptor table.
965+
966+
2. Ignore the problem, and simply `close()` the received file descriptors
967+
it didn't expect, thus possibly locking up for a longer time.
968+
969+
3. Fork off a thread that invokes `close()`, which mitigates the
970+
risk of blocking, but still means a sender can cause resource
971+
exhaustion in a recipient by flooding it with file descriptors,
972+
as for each of them a thread needs to be spawned and a file
973+
descriptor is taken while it is in the process of being closed.
974+
975+
(Another option of course is to never talk `AF_UNIX` to peers that
976+
are not trusted to not send unexpected file descriptors.)
977+
978+
A simple knob that allows turning off `SCM_RIGHTS` right reception
979+
would be useful to close this weakness, and would allow
980+
`recvmsg()` to be called without risking file descriptors to be
981+
installed in the file descriptor table, and thus risking a
982+
blocking `close()` or a form of potential resource exhaustion.
983+
984+
**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or
985+
would like to use) `recvmsg()` on it (which is useful to acquire
986+
other metadata). Example: logging daemons that want to collect
987+
timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message
988+
broker and suchlike.
989+
990+
### A reliable way to check for PID namespacing
991+
992+
[x] A reliable (non-heuristic) way to detect from userspace if the
993+
current process is running in a PID namespace that is not the main
994+
PID namespace. PID namespaces are probably the primary type of
995+
namespace that identify a container environment. While many
996+
heuristics exist to determine generically whether one is executed
997+
inside a container, it would be good to have a correct,
998+
well-defined way to determine this.
999+
1000+
**🙇 The inode number of the root PID namespace is fixed (0xEFFFFFFC)
1001+
and now considered API. It can be used to distinguish the root PID
1002+
namespace from all others. 🙇**
1003+
1004+
**Use-Case:** tools such as `systemd-detect-virt` exist to determine
1005+
container execution, but typically resolve to checking for
1006+
specific implementations. It would be much nicer and universally
1007+
applicable if such a check could be done generically. It would
1008+
probably suffice to provide an `ioctl()` call on the `pidns` file
1009+
descriptor that reveals this kind of information in some form.

0 commit comments

Comments
 (0)