Skip to content

Commit 7d7a103

Browse files
committed
Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner: "Features: - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD socket option SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for the process that called connect() or listen(). This is heavily used to safely authenticate clients in userspace avoiding security bugs due to pid recycling races (dbus, polkit, systemd, etc.) SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag Another summary has been provided by David Rheinsberg: > A pidfd can outlive the task it refers to, and thus user-space > must already be prepared that the task underlying a pidfd is > gone at the time they get their hands on the pidfd. For > instance, resolving the pidfd to a PID via the fdinfo must be > prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several > kernel APIs currently add another layer that checks for this. In > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was > already reaped, but returns a stale pidfd if the task is reaped > immediately after the respective alive-check. > > This has the unfortunate effect that user-space now has two ways > to check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... *or* the pidfd might be stale, even though > there is no particular reason to distinguish both cases. This > also propagates through user-space APIs, which pass on pidfds. > They must be prepared to pass on `-1` *or* the pidfd, because > there is no guaranteed way to get a stale pidfd from the kernel. > > Userspace must already deal with a pidfd referring to a reaped > task as the task may exit and get reaped at any time will there > are still many pidfds referring to it In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info = { .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. */ ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes - Hand a pidfd to the coredump usermode helper process Give userspace a way to instruct the kernel to install a pidfd for the crashing process into the process started as a usermode helper. There's still tricky race-windows that cannot be easily or sometimes not closed at all by userspace. There's various ways like looking at the start time of a process to make sure that the usermode helper process is started after the crashing process but it's all very very brittle and fraught with peril The crashed-but-not-reaped process can be killed by userspace before coredump processing programs like systemd-coredump have had time to manually open a PIDFD from the PID the kernel provides them, which means they can be tricked into reading from an arbitrary process, and they run with full privileges as they are usermode helper processes Even if that specific race-window wouldn't exist it's still the safest and cleanest way to let the kernel provide the pidfd directly instead of requiring userspace to do it manually. In parallel with this commit we already have systemd adding support for this in [1] When the usermode helper process is forked we install a pidfd file descriptor three into the usermode helper's file descriptor table so it's available to the exec'd program Since usermode helpers are either children of the system_unbound_wq workqueue or kthreadd we know that the file descriptor table is empty and can thus always use three as the file descriptor number Note, that we'll install a pidfd for the thread-group leader even if a subthread is calling do_coredump(). We know that task linkage hasn't been removed yet and even if this @current isn't the actual thread-group leader we know that the thread-group leader cannot be reaped until @current has exited - Allow telling when a task has not been found from finding the wrong task when creating a pidfd We currently report EINVAL whenever a struct pid has no tasked attached anymore thereby conflating two concepts: (1) The task has already been reaped (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader This is causing issues for non-threaded workloads as in where they expect ESRCH to be reported, not EINVAL So allow userspace to reliably distinguish between (1) and (2) - Make it possible to detect when a pidfs entry would outlive the struct pid it pinned - Add a range of new selftests Cleanups: - Remove unneeded NULL check from pidfd_prepare() for passed struct pid - Avoid pointless reference count bump during release_task() Fixes: - Various fixes to the pidfd and coredump selftests - Fix error handling for replace_fd() when spawning coredump usermode helper" * tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: detect refcount bugs coredump: hand a pidfd to the usermode coredump helper coredump: fix error handling for replace_fd() pidfs: move O_RDWR into pidfs_alloc_file() selftests: coredump: Raise timeout to 2 minutes selftests: coredump: Fix test failure for slow machines selftests: coredump: Properly initialize pointer net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid pidfs: get rid of __pidfd_prepare() net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid pidfs: register pid in pidfs net, pidfd: report EINVAL for ESRCH release_task: kill the no longer needed get/put_pid(thread_pid) pidfs: ensure consistent ENOENT/ESRCH reporting exit: move wake_up_all() pidfd waiters into __unhash_process() selftest/pidfd: add test for thread-group leader pidfd open for thread pidfd: improve uapi when task isn't found pidfd: remove unneeded NULL check from pidfd_prepare() selftests/pidfd: adapt to recent changes
2 parents 2ca3534 + db56723 commit 7d7a103

File tree

13 files changed

+286
-93
lines changed

13 files changed

+286
-93
lines changed

fs/coredump.c

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@
4343
#include <linux/timekeeping.h>
4444
#include <linux/sysctl.h>
4545
#include <linux/elf.h>
46+
#include <linux/pidfs.h>
47+
#include <uapi/linux/pidfd.h>
4648

4749
#include <linux/uaccess.h>
4850
#include <asm/mmu_context.h>
@@ -60,6 +62,12 @@ static void free_vma_snapshot(struct coredump_params *cprm);
6062
#define CORE_FILE_NOTE_SIZE_DEFAULT (4*1024*1024)
6163
/* Define a reasonable max cap */
6264
#define CORE_FILE_NOTE_SIZE_MAX (16*1024*1024)
65+
/*
66+
* File descriptor number for the pidfd for the thread-group leader of
67+
* the coredumping task installed into the usermode helper's file
68+
* descriptor table.
69+
*/
70+
#define COREDUMP_PIDFD_NUMBER 3
6371

6472
static int core_uses_pid;
6573
static unsigned int core_pipe_limit;
@@ -339,6 +347,27 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
339347
case 'C':
340348
err = cn_printf(cn, "%d", cprm->cpu);
341349
break;
350+
/* pidfd number */
351+
case 'F': {
352+
/*
353+
* Installing a pidfd only makes sense if
354+
* we actually spawn a usermode helper.
355+
*/
356+
if (!ispipe)
357+
break;
358+
359+
/*
360+
* Note that we'll install a pidfd for the
361+
* thread-group leader. We know that task
362+
* linkage hasn't been removed yet and even if
363+
* this @current isn't the actual thread-group
364+
* leader we know that the thread-group leader
365+
* cannot be reaped until @current has exited.
366+
*/
367+
cprm->pid = task_tgid(current);
368+
err = cn_printf(cn, "%d", COREDUMP_PIDFD_NUMBER);
369+
break;
370+
}
342371
default:
343372
break;
344373
}
@@ -493,7 +522,7 @@ static void wait_for_dump_helpers(struct file *file)
493522
}
494523

495524
/*
496-
* umh_pipe_setup
525+
* umh_coredump_setup
497526
* helper function to customize the process used
498527
* to collect the core in userspace. Specifically
499528
* it sets up a pipe and installs it as fd 0 (stdin)
@@ -503,22 +532,46 @@ static void wait_for_dump_helpers(struct file *file)
503532
* is a special value that we use to trap recursive
504533
* core dumps
505534
*/
506-
static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
535+
static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
507536
{
508537
struct file *files[2];
509538
struct coredump_params *cp = (struct coredump_params *)info->data;
510-
int err = create_pipe_files(files, 0);
539+
int err;
540+
541+
if (cp->pid) {
542+
struct file *pidfs_file __free(fput) = NULL;
543+
544+
pidfs_file = pidfs_alloc_file(cp->pid, 0);
545+
if (IS_ERR(pidfs_file))
546+
return PTR_ERR(pidfs_file);
547+
548+
/*
549+
* Usermode helpers are childen of either
550+
* system_unbound_wq or of kthreadd. So we know that
551+
* we're starting off with a clean file descriptor
552+
* table. So we should always be able to use
553+
* COREDUMP_PIDFD_NUMBER as our file descriptor value.
554+
*/
555+
err = replace_fd(COREDUMP_PIDFD_NUMBER, pidfs_file, 0);
556+
if (err < 0)
557+
return err;
558+
}
559+
560+
err = create_pipe_files(files, 0);
511561
if (err)
512562
return err;
513563

514564
cp->file = files[1];
515565

516566
err = replace_fd(0, files[0], 0);
517567
fput(files[0]);
568+
if (err < 0)
569+
return err;
570+
518571
/* and disallow core files too */
519572
current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
520573

521-
return err;
574+
return 0;
522575
}
523576

524577
void do_coredump(const kernel_siginfo_t *siginfo)
@@ -593,7 +646,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
593646
}
594647

595648
if (cprm.limit == 1) {
596-
/* See umh_pipe_setup() which sets RLIMIT_CORE = 1.
649+
/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
597650
*
598651
* Normally core limits are irrelevant to pipes, since
599652
* we're not writing to the file system, but we use
@@ -632,7 +685,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
632685
retval = -ENOMEM;
633686
sub_info = call_usermodehelper_setup(helper_argv[0],
634687
helper_argv, NULL, GFP_KERNEL,
635-
umh_pipe_setup, NULL, &cprm);
688+
umh_coredump_setup, NULL, &cprm);
636689
if (sub_info)
637690
retval = call_usermodehelper_exec(sub_info,
638691
UMH_WAIT_EXEC);

fs/pidfs.c

Lines changed: 73 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -746,7 +746,7 @@ static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path,
746746
{
747747
enum pid_type type;
748748

749-
if (flags & PIDFD_CLONE)
749+
if (flags & PIDFD_STALE)
750750
return true;
751751

752752
/*
@@ -755,10 +755,14 @@ static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path,
755755
* pidfd has been allocated perform another check that the pid
756756
* is still alive. If it is exit information is available even
757757
* if the task gets reaped before the pidfd is returned to
758-
* userspace. The only exception is PIDFD_CLONE where no task
759-
* linkage has been established for @pid yet and the kernel is
760-
* in the middle of process creation so there's nothing for
761-
* pidfs to miss.
758+
* userspace. The only exception are indicated by PIDFD_STALE:
759+
*
760+
* (1) The kernel is in the middle of task creation and thus no
761+
* task linkage has been established yet.
762+
* (2) The caller knows @pid has been registered in pidfs at a
763+
* time when the task was still alive.
764+
*
765+
* In both cases exit information will have been reported.
762766
*/
763767
if (flags & PIDFD_THREAD)
764768
type = PIDTYPE_PID;
@@ -852,11 +856,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
852856
int ret;
853857

854858
/*
855-
* Ensure that PIDFD_CLONE can be passed as a flag without
859+
* Ensure that PIDFD_STALE can be passed as a flag without
856860
* overloading other uapi pidfd flags.
857861
*/
858-
BUILD_BUG_ON(PIDFD_CLONE == PIDFD_THREAD);
859-
BUILD_BUG_ON(PIDFD_CLONE == PIDFD_NONBLOCK);
862+
BUILD_BUG_ON(PIDFD_STALE == PIDFD_THREAD);
863+
BUILD_BUG_ON(PIDFD_STALE == PIDFD_NONBLOCK);
860864

861865
ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
862866
if (ret < 0)
@@ -865,7 +869,8 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
865869
if (!pidfs_pid_valid(pid, &path, flags))
866870
return ERR_PTR(-ESRCH);
867871

868-
flags &= ~PIDFD_CLONE;
872+
flags &= ~PIDFD_STALE;
873+
flags |= O_RDWR;
869874
pidfd_file = dentry_open(&path, flags, current_cred());
870875
/* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */
871876
if (!IS_ERR(pidfd_file))
@@ -874,6 +879,65 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
874879
return pidfd_file;
875880
}
876881

882+
/**
883+
* pidfs_register_pid - register a struct pid in pidfs
884+
* @pid: pid to pin
885+
*
886+
* Register a struct pid in pidfs. Needs to be paired with
887+
* pidfs_put_pid() to not risk leaking the pidfs dentry and inode.
888+
*
889+
* Return: On success zero, on error a negative error code is returned.
890+
*/
891+
int pidfs_register_pid(struct pid *pid)
892+
{
893+
struct path path __free(path_put) = {};
894+
int ret;
895+
896+
might_sleep();
897+
898+
if (!pid)
899+
return 0;
900+
901+
ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
902+
if (unlikely(ret))
903+
return ret;
904+
/* Keep the dentry and only put the reference to the mount. */
905+
path.dentry = NULL;
906+
return 0;
907+
}
908+
909+
/**
910+
* pidfs_get_pid - pin a struct pid through pidfs
911+
* @pid: pid to pin
912+
*
913+
* Similar to pidfs_register_pid() but only valid if the caller knows
914+
* there's a reference to the @pid through a dentry already that can't
915+
* go away.
916+
*/
917+
void pidfs_get_pid(struct pid *pid)
918+
{
919+
if (!pid)
920+
return;
921+
WARN_ON_ONCE(!stashed_dentry_get(&pid->stashed));
922+
}
923+
924+
/**
925+
* pidfs_put_pid - drop a pidfs reference
926+
* @pid: pid to drop
927+
*
928+
* Drop a reference to @pid via pidfs. This is only safe if the
929+
* reference has been taken via pidfs_get_pid().
930+
*/
931+
void pidfs_put_pid(struct pid *pid)
932+
{
933+
might_sleep();
934+
935+
if (!pid)
936+
return;
937+
VFS_WARN_ON_ONCE(!pid->stashed);
938+
dput(pid->stashed);
939+
}
940+
877941
static void pidfs_inode_init_once(void *data)
878942
{
879943
struct pidfs_inode *pi = data;

include/linux/coredump.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ struct coredump_params {
2828
int vma_count;
2929
size_t vma_data_size;
3030
struct core_vma_metadata *vma_meta;
31+
struct pid *pid;
3132
};
3233

3334
extern unsigned int core_file_note_size_limit;

include/linux/pid.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ struct file;
7777
struct pid *pidfd_pid(const struct file *file);
7878
struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags);
7979
struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags);
80-
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret);
80+
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret_file);
8181
void do_notify_pidfd(struct task_struct *task);
8282

8383
static inline struct pid *get_pid(struct pid *pid)

include/linux/pidfs.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,8 @@ void pidfs_add_pid(struct pid *pid);
88
void pidfs_remove_pid(struct pid *pid);
99
void pidfs_exit(struct task_struct *tsk);
1010
extern const struct dentry_operations pidfs_dentry_operations;
11+
int pidfs_register_pid(struct pid *pid);
12+
void pidfs_get_pid(struct pid *pid);
13+
void pidfs_put_pid(struct pid *pid);
1114

1215
#endif /* _LINUX_PID_FS_H */

include/uapi/linux/pidfd.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
#define PIDFD_THREAD O_EXCL
1313
#ifdef __KERNEL__
1414
#include <linux/sched.h>
15-
#define PIDFD_CLONE CLONE_PIDFD
15+
#define PIDFD_STALE CLONE_PIDFD
1616
#endif
1717

1818
/* Flags for pidfd_send_signal(). */

kernel/exit.c

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,13 @@ struct release_task_post {
133133
static void __unhash_process(struct release_task_post *post, struct task_struct *p,
134134
bool group_dead)
135135
{
136+
struct pid *pid = task_pid(p);
137+
136138
nr_threads--;
139+
137140
detach_pid(post->pids, p, PIDTYPE_PID);
141+
wake_up_all(&pid->wait_pidfd);
142+
138143
if (group_dead) {
139144
detach_pid(post->pids, p, PIDTYPE_TGID);
140145
detach_pid(post->pids, p, PIDTYPE_PGID);
@@ -253,7 +258,8 @@ void release_task(struct task_struct *p)
253258
pidfs_exit(p);
254259
cgroup_release(p);
255260

256-
thread_pid = get_pid(p->thread_pid);
261+
/* Retrieve @thread_pid before __unhash_process() may set it to NULL. */
262+
thread_pid = task_pid(p);
257263

258264
write_lock_irq(&tasklist_lock);
259265
ptrace_release_task(p);
@@ -282,8 +288,8 @@ void release_task(struct task_struct *p)
282288
}
283289

284290
write_unlock_irq(&tasklist_lock);
291+
/* @thread_pid can't go away until free_pids() below */
285292
proc_flush_pid(thread_pid);
286-
put_pid(thread_pid);
287293
add_device_randomness(&p->se.sum_exec_runtime,
288294
sizeof(p->se.sum_exec_runtime));
289295
free_pids(post.pids);

0 commit comments

Comments
 (0)