Skip to content

Commit 672dcda

Browse files
committed
Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner: - persistent info Persist exit and coredump information independent of whether anyone currently holds a pidfd for the struct pid. The current scheme allocated pidfs dentries on-demand repeatedly. This scheme is reaching it's limits as it makes it impossible to pin information that needs to be available after the task has exited or coredumped and that should not be lost simply because the pidfd got closed temporarily. The next opener should still see the stashed information. This is also a prerequisite for supporting extended attributes on pidfds to allow attaching meta information to them. If someone opens a pidfd for a struct pid a pidfs dentry is allocated and stashed in pid->stashed. Once the last pidfd for the struct pid is closed the pidfs dentry is released and removed from pid->stashed. So if 10 callers create a pidfs dentry for the same struct pid sequentially, i.e., each closing the pidfd before the other creates a new one then a new pidfs dentry is allocated every time. Because multiple tasks acquiring and releasing a pidfd for the same struct pid can race with each another a task may still find a valid pidfs entry from the previous task in pid->stashed and reuse it. Or it might find a dead dentry in there and fail to reuse it and so stashes a new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry but only one will succeed, the other ones will put their dentry. The current scheme aims to ensure that a pidfs dentry for a struct pid can only be created if the task is still alive or if a pidfs dentry already existed before the task was reaped and so exit information has been was stashed in the pidfs inode. That's great except that it's buggy. If a pidfs dentry is stashed in pid->stashed after pidfs_exit() but before __unhash_process() is called we will return a pidfd for a reaped task without exit information being available. The pidfds_pid_valid() check does not guard against this race as it doens't sync at all with pidfs_exit(). The pid_has_task() check might be successful simply because we're before __unhash_process() but after pidfs_exit(). Introduce a new scheme where the lifetime of information associated with a pidfs entry (coredump and exit information) isn't bound to the lifetime of the pidfs inode but the struct pid itself. The first time a pidfs dentry is allocated for a struct pid a struct pidfs_attr will be allocated which will be used to store exit and coredump information. If all pidfs for the pidfs dentry are closed the dentry and inode can be cleaned up but the struct pidfs_attr will stick until the struct pid itself is freed. This will ensure minimal memory usage while persisting relevant information. The new scheme has various advantages. First, it allows to close the race where we end up handing out a pidfd for a reaped task for which no exit information is available. Second, it minimizes memory usage. Third, it allows to remove complex lifetime tracking via dentries when registering a struct pid with pidfs. There's no need to get or put a reference. Instead, the lifetime of exit and coredump information associated with a struct pid is bound to the lifetime of struct pid itself. - extended attributes Now that we have a way to persist information for pidfs dentries we can start supporting extended attributes on pidfds. This will allow userspace to attach meta information to tasks. One natural extension would be to introduce a custom pidfs.* extended attribute space and allow for the inheritance of extended attributes across fork() and exec(). The first simple scheme will allow privileged userspace to set trusted extended attributes on pidfs inodes. - Allow autonomous pidfs file handles Various filesystems such as pidfs and drm support opening file handles without having to require a file descriptor to identify the filesystem. The filesystem are global single instances and can be trivially identified solely on the information encoded in the file handle. This makes it possible to not have to keep or acquire a sentinal file descriptor just to pass it to open_by_handle_at() to identify the filesystem. That's especially useful when such sentinel file descriptor cannot or should not be acquired. For pidfs this means a file handle can function as full replacement for storing a pid in a file. Instead a file handle can be stored and reopened purely based on the file handle. Such autonomous file handles can be opened with or without specifying a a file descriptor. If no proper file descriptor is used the FD_PIDFS_ROOT sentinel must be passed. This allows us to define further special negative fd sentinels in the future. Userspace can trivially test for support by trying to open the file handle with an invalid file descriptor. - Allow pidfds for reaped tasks with SCM_PIDFD messages This is a logical continuation of the earlier work to create pidfds for reaped tasks through the SO_PEERPIDFD socket option merged in 923ea4d ("Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid""). - Two minor fixes: * Fold fs_struct->{lock,seq} into a seqlock * Don't bother with path_{get,put}() in unix_open_file() * tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits) don't bother with path_get()/path_put() in unix_open_file() fold fs_struct->{lock,seq} into a seqlock selftests: net: extend SCM_PIDFD test to cover stale pidfds af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD af_unix: stash pidfs dentry when needed af_unix/scm: fix whitespace errors af_unix: introduce and use scm_replace_pid() helper af_unix: introduce unix_skb_to_scm helper af_unix: rework unix_maybe_add_creds() to allow sleep selftests/pidfd: decode pidfd file handles withou having to specify an fd fhandle, pidfs: support open_by_handle_at() purely based on file handle uapi/fcntl: add FD_PIDFS_ROOT uapi/fcntl: add FD_INVALID fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP uapi/fcntl: mark range as reserved fhandle: reflow get_path_anchor() pidfs: add pidfs_root_path() helper fhandle: rename to get_path_anchor() fhandle: hoist copy_from_user() above get_path_from_fd() fhandle: raise FILEID_IS_DIR in handle_type ...
2 parents 7031769 + 1f531e3 commit 672dcda

File tree

26 files changed

+894
-381
lines changed

26 files changed

+894
-381
lines changed

fs/coredump.c

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -710,11 +710,6 @@ static bool coredump_sock_connect(struct core_name *cn, struct coredump_params *
710710

711711
retval = kernel_connect(socket, (struct sockaddr *)(&addr), addr_len,
712712
O_NONBLOCK | SOCK_COREDUMP);
713-
/*
714-
* ... Make sure to only put our reference after connect() took
715-
* its own reference keeping the pidfs entry alive ...
716-
*/
717-
pidfs_put_pid(cprm->pid);
718713

719714
if (retval) {
720715
if (retval == -EAGAIN)

fs/d_path.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -241,9 +241,9 @@ static void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
241241
unsigned seq;
242242

243243
do {
244-
seq = read_seqcount_begin(&fs->seq);
244+
seq = read_seqbegin(&fs->seq);
245245
*root = fs->root;
246-
} while (read_seqcount_retry(&fs->seq, seq));
246+
} while (read_seqretry(&fs->seq, seq));
247247
}
248248

249249
/**
@@ -385,10 +385,10 @@ static void get_fs_root_and_pwd_rcu(struct fs_struct *fs, struct path *root,
385385
unsigned seq;
386386

387387
do {
388-
seq = read_seqcount_begin(&fs->seq);
388+
seq = read_seqbegin(&fs->seq);
389389
*root = fs->root;
390390
*pwd = fs->pwd;
391-
} while (read_seqcount_retry(&fs->seq, seq));
391+
} while (read_seqretry(&fs->seq, seq));
392392
}
393393

394394
/*

fs/exec.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1515,7 +1515,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
15151515
* state is protected by cred_guard_mutex we hold.
15161516
*/
15171517
n_fs = 1;
1518-
spin_lock(&p->fs->lock);
1518+
read_seqlock_excl(&p->fs->seq);
15191519
rcu_read_lock();
15201520
for_other_threads(p, t) {
15211521
if (t->fs == p->fs)
@@ -1528,7 +1528,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
15281528
bprm->unsafe |= LSM_UNSAFE_SHARE;
15291529
else
15301530
p->fs->in_exec = 1;
1531-
spin_unlock(&p->fs->lock);
1531+
read_sequnlock_excl(&p->fs->seq);
15321532
}
15331533

15341534
static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file)

fs/fhandle.c

Lines changed: 30 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ static long do_sys_name_to_handle(const struct path *path,
8888
if (fh_flags & EXPORT_FH_CONNECTABLE) {
8989
handle->handle_type |= FILEID_IS_CONNECTABLE;
9090
if (d_is_dir(path->dentry))
91-
fh_flags |= FILEID_IS_DIR;
91+
handle->handle_type |= FILEID_IS_DIR;
9292
}
9393
retval = 0;
9494
}
@@ -168,23 +168,28 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
168168
return err;
169169
}
170170

171-
static int get_path_from_fd(int fd, struct path *root)
171+
static int get_path_anchor(int fd, struct path *root)
172172
{
173-
if (fd == AT_FDCWD) {
174-
struct fs_struct *fs = current->fs;
175-
spin_lock(&fs->lock);
176-
*root = fs->pwd;
177-
path_get(root);
178-
spin_unlock(&fs->lock);
179-
} else {
173+
if (fd >= 0) {
180174
CLASS(fd, f)(fd);
181175
if (fd_empty(f))
182176
return -EBADF;
183177
*root = fd_file(f)->f_path;
184178
path_get(root);
179+
return 0;
185180
}
186181

187-
return 0;
182+
if (fd == AT_FDCWD) {
183+
get_fs_pwd(current->fs, root);
184+
return 0;
185+
}
186+
187+
if (fd == FD_PIDFS_ROOT) {
188+
pidfs_get_root(root);
189+
return 0;
190+
}
191+
192+
return -EBADF;
188193
}
189194

190195
static int vfs_dentry_acceptable(void *context, struct dentry *dentry)
@@ -323,13 +328,24 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
323328
{
324329
int retval = 0;
325330
struct file_handle f_handle;
326-
struct file_handle *handle = NULL;
331+
struct file_handle *handle __free(kfree) = NULL;
327332
struct handle_to_path_ctx ctx = {};
328333
const struct export_operations *eops;
329334

330-
retval = get_path_from_fd(mountdirfd, &ctx.root);
335+
if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle)))
336+
return -EFAULT;
337+
338+
if ((f_handle.handle_bytes > MAX_HANDLE_SZ) ||
339+
(f_handle.handle_bytes == 0))
340+
return -EINVAL;
341+
342+
if (f_handle.handle_type < 0 ||
343+
FILEID_USER_FLAGS(f_handle.handle_type) & ~FILEID_VALID_USER_FLAGS)
344+
return -EINVAL;
345+
346+
retval = get_path_anchor(mountdirfd, &ctx.root);
331347
if (retval)
332-
goto out_err;
348+
return retval;
333349

334350
eops = ctx.root.mnt->mnt_sb->s_export_op;
335351
if (eops && eops->permission)
@@ -339,21 +355,6 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
339355
if (retval)
340356
goto out_path;
341357

342-
if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
343-
retval = -EFAULT;
344-
goto out_path;
345-
}
346-
if ((f_handle.handle_bytes > MAX_HANDLE_SZ) ||
347-
(f_handle.handle_bytes == 0)) {
348-
retval = -EINVAL;
349-
goto out_path;
350-
}
351-
if (f_handle.handle_type < 0 ||
352-
FILEID_USER_FLAGS(f_handle.handle_type) & ~FILEID_VALID_USER_FLAGS) {
353-
retval = -EINVAL;
354-
goto out_path;
355-
}
356-
357358
handle = kmalloc(struct_size(handle, f_handle, f_handle.handle_bytes),
358359
GFP_KERNEL);
359360
if (!handle) {
@@ -366,7 +367,7 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
366367
&ufh->f_handle,
367368
f_handle.handle_bytes)) {
368369
retval = -EFAULT;
369-
goto out_handle;
370+
goto out_path;
370371
}
371372

372373
/*
@@ -384,11 +385,8 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
384385
handle->handle_type &= ~FILEID_USER_FLAGS_MASK;
385386
retval = do_handle_to_path(handle, path, &ctx);
386387

387-
out_handle:
388-
kfree(handle);
389388
out_path:
390389
path_put(&ctx.root);
391-
out_err:
392390
return retval;
393391
}
394392

fs/fs_struct.c

Lines changed: 14 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,10 @@ void set_fs_root(struct fs_struct *fs, const struct path *path)
1717
struct path old_root;
1818

1919
path_get(path);
20-
spin_lock(&fs->lock);
21-
write_seqcount_begin(&fs->seq);
20+
write_seqlock(&fs->seq);
2221
old_root = fs->root;
2322
fs->root = *path;
24-
write_seqcount_end(&fs->seq);
25-
spin_unlock(&fs->lock);
23+
write_sequnlock(&fs->seq);
2624
if (old_root.dentry)
2725
path_put(&old_root);
2826
}
@@ -36,12 +34,10 @@ void set_fs_pwd(struct fs_struct *fs, const struct path *path)
3634
struct path old_pwd;
3735

3836
path_get(path);
39-
spin_lock(&fs->lock);
40-
write_seqcount_begin(&fs->seq);
37+
write_seqlock(&fs->seq);
4138
old_pwd = fs->pwd;
4239
fs->pwd = *path;
43-
write_seqcount_end(&fs->seq);
44-
spin_unlock(&fs->lock);
40+
write_sequnlock(&fs->seq);
4541

4642
if (old_pwd.dentry)
4743
path_put(&old_pwd);
@@ -67,16 +63,14 @@ void chroot_fs_refs(const struct path *old_root, const struct path *new_root)
6763
fs = p->fs;
6864
if (fs) {
6965
int hits = 0;
70-
spin_lock(&fs->lock);
71-
write_seqcount_begin(&fs->seq);
66+
write_seqlock(&fs->seq);
7267
hits += replace_path(&fs->root, old_root, new_root);
7368
hits += replace_path(&fs->pwd, old_root, new_root);
74-
write_seqcount_end(&fs->seq);
7569
while (hits--) {
7670
count++;
7771
path_get(new_root);
7872
}
79-
spin_unlock(&fs->lock);
73+
write_sequnlock(&fs->seq);
8074
}
8175
task_unlock(p);
8276
}
@@ -99,10 +93,10 @@ void exit_fs(struct task_struct *tsk)
9993
if (fs) {
10094
int kill;
10195
task_lock(tsk);
102-
spin_lock(&fs->lock);
96+
read_seqlock_excl(&fs->seq);
10397
tsk->fs = NULL;
10498
kill = !--fs->users;
105-
spin_unlock(&fs->lock);
99+
read_sequnlock_excl(&fs->seq);
106100
task_unlock(tsk);
107101
if (kill)
108102
free_fs_struct(fs);
@@ -116,16 +110,15 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
116110
if (fs) {
117111
fs->users = 1;
118112
fs->in_exec = 0;
119-
spin_lock_init(&fs->lock);
120-
seqcount_spinlock_init(&fs->seq, &fs->lock);
113+
seqlock_init(&fs->seq);
121114
fs->umask = old->umask;
122115

123-
spin_lock(&old->lock);
116+
read_seqlock_excl(&old->seq);
124117
fs->root = old->root;
125118
path_get(&fs->root);
126119
fs->pwd = old->pwd;
127120
path_get(&fs->pwd);
128-
spin_unlock(&old->lock);
121+
read_sequnlock_excl(&old->seq);
129122
}
130123
return fs;
131124
}
@@ -140,10 +133,10 @@ int unshare_fs_struct(void)
140133
return -ENOMEM;
141134

142135
task_lock(current);
143-
spin_lock(&fs->lock);
136+
read_seqlock_excl(&fs->seq);
144137
kill = !--fs->users;
145138
current->fs = new_fs;
146-
spin_unlock(&fs->lock);
139+
read_sequnlock_excl(&fs->seq);
147140
task_unlock(current);
148141

149142
if (kill)
@@ -162,7 +155,6 @@ EXPORT_SYMBOL(current_umask);
162155
/* to be mentioned only in INIT_TASK */
163156
struct fs_struct init_fs = {
164157
.users = 1,
165-
.lock = __SPIN_LOCK_UNLOCKED(init_fs.lock),
166-
.seq = SEQCNT_SPINLOCK_ZERO(init_fs.seq, &init_fs.lock),
158+
.seq = __SEQLOCK_UNLOCKED(init_fs.seq),
167159
.umask = 0022,
168160
};

fs/internal.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,12 +323,15 @@ struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
323323
struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
324324
void mnt_idmap_put(struct mnt_idmap *idmap);
325325
struct stashed_operations {
326+
struct dentry *(*stash_dentry)(struct dentry **stashed,
327+
struct dentry *dentry);
326328
void (*put_data)(void *data);
327329
int (*init_inode)(struct inode *inode, void *data);
328330
};
329331
int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
330332
struct path *path);
331333
void stashed_dentry_prune(struct dentry *dentry);
334+
struct dentry *stash_dentry(struct dentry **stashed, struct dentry *dentry);
332335
struct dentry *stashed_dentry_get(struct dentry **stashed);
333336
/**
334337
* path_mounted - check whether path is mounted
@@ -351,3 +354,4 @@ int anon_inode_getattr(struct mnt_idmap *idmap, const struct path *path,
351354
unsigned int query_flags);
352355
int anon_inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
353356
struct iattr *attr);
357+
void pidfs_get_root(struct path *path);

fs/libfs.c

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2143,6 +2143,8 @@ struct dentry *stashed_dentry_get(struct dentry **stashed)
21432143
dentry = rcu_dereference(*stashed);
21442144
if (!dentry)
21452145
return NULL;
2146+
if (IS_ERR(dentry))
2147+
return dentry;
21462148
if (!lockref_get_not_dead(&dentry->d_lockref))
21472149
return NULL;
21482150
return dentry;
@@ -2175,7 +2177,6 @@ static struct dentry *prepare_anon_dentry(struct dentry **stashed,
21752177

21762178
/* Notice when this is changed. */
21772179
WARN_ON_ONCE(!S_ISREG(inode->i_mode));
2178-
WARN_ON_ONCE(!IS_IMMUTABLE(inode));
21792180

21802181
dentry = d_alloc_anon(sb);
21812182
if (!dentry) {
@@ -2191,8 +2192,7 @@ static struct dentry *prepare_anon_dentry(struct dentry **stashed,
21912192
return dentry;
21922193
}
21932194

2194-
static struct dentry *stash_dentry(struct dentry **stashed,
2195-
struct dentry *dentry)
2195+
struct dentry *stash_dentry(struct dentry **stashed, struct dentry *dentry)
21962196
{
21972197
guard(rcu)();
21982198
for (;;) {
@@ -2233,14 +2233,16 @@ static struct dentry *stash_dentry(struct dentry **stashed,
22332233
int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
22342234
struct path *path)
22352235
{
2236-
struct dentry *dentry;
2236+
struct dentry *dentry, *res;
22372237
const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
22382238

22392239
/* See if dentry can be reused. */
2240-
path->dentry = stashed_dentry_get(stashed);
2241-
if (path->dentry) {
2240+
res = stashed_dentry_get(stashed);
2241+
if (IS_ERR(res))
2242+
return PTR_ERR(res);
2243+
if (res) {
22422244
sops->put_data(data);
2243-
goto out_path;
2245+
goto make_path;
22442246
}
22452247

22462248
/* Allocate a new dentry. */
@@ -2249,14 +2251,22 @@ int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
22492251
return PTR_ERR(dentry);
22502252

22512253
/* Added a new dentry. @data is now owned by the filesystem. */
2252-
path->dentry = stash_dentry(stashed, dentry);
2253-
if (path->dentry != dentry)
2254+
if (sops->stash_dentry)
2255+
res = sops->stash_dentry(stashed, dentry);
2256+
else
2257+
res = stash_dentry(stashed, dentry);
2258+
if (IS_ERR(res)) {
2259+
dput(dentry);
2260+
return PTR_ERR(res);
2261+
}
2262+
if (res != dentry)
22542263
dput(dentry);
22552264

2256-
out_path:
2257-
WARN_ON_ONCE(path->dentry->d_fsdata != stashed);
2258-
WARN_ON_ONCE(d_inode(path->dentry)->i_private != data);
2265+
make_path:
2266+
path->dentry = res;
22592267
path->mnt = mntget(mnt);
2268+
VFS_WARN_ON_ONCE(path->dentry->d_fsdata != stashed);
2269+
VFS_WARN_ON_ONCE(d_inode(path->dentry)->i_private != data);
22602270
return 0;
22612271
}
22622272

0 commit comments

Comments
 (0)