Skip to content

Commit 303cc57

Browse files
author
Christian Brauner
committed
nsproxy: attach to namespaces via pidfds
For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wait to see how the general api would be received and adopted. Now that more and more programs in userspace have started using pidfds for process management it's time to send this one out. This patch makes it possible to use pidfds to attach to the namespaces of another process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd. Here are just two obvious examples: setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces. If the need arises, as Eric suggested, we can extend this patchset to assume even more context than just attaching all namespaces. His suggestion specifically was about assuming the process' root directory when setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just keep it flexible in terms of supporting subsets of namespaces but let's wait until we have users asking for even more context to be assumed. At that point we can add an extension. The obvious example where this is useful is a standard container manager interacting with a running container: pushing and pulling files or directories, injecting mounts, attaching/execing any kind of process, managing network devices all these operations require attaching to all or at least multiple namespaces at the same time. Given that nowadays most containers are spawned with all namespaces enabled we're currently looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> nsfds, another 7 to actually perform the namespace switch. With time namespaces we're looking at about 16 syscalls. (We could amortize the first 7 or 8 syscalls for opening the nsfds by stashing them in each container's monitor process but that would mean we need to send around those file descriptors through unix sockets everytime we want to interact with the container or keep on-disk state. Even in scenarios where a caller wants to join a particular namespace in a particular order callers still profit from batching other namespaces. That mostly applies to the user namespace but all container runtimes I found join the user namespace first no matter if it privileges or deprivileges the container similar to how unshare behaves.) With pidfds this becomes a single syscall no matter how many namespaces are supposed to be attached to. A decently designed, large-scale container manager usually isn't the parent of any of the containers it spawns so the containers don't die when it crashes or needs to update or reinitialize. This means that for the manager to interact with containers through pids is inherently racy especially on systems where the maximum pid number is not significicantly bumped. This is even more problematic since we often spawn and manage thousands or ten-thousands of containers. Interacting with a container through a pid thus can become risky quite quickly. Especially since we allow for an administrator to enable advanced features such as syscall interception where we're performing syscalls in lieu of the container. In all of those cases we use pidfds if they are available and we pass them around as stable references. Using them to setns() to the target process' namespaces is as reliable as using nsfds. Either the target process is already dead and we get ESRCH or we manage to attach to its namespaces but we can't accidently attach to another process' namespaces. So pidfds lend themselves to be used with this api. The other main advantage is that with this change the pidfd becomes the only relevant token for most container interactions and it's the only token we need to create and send around. Apart from significiantly reducing the number of syscalls from double digit to single digit which is a decent reason post-spectre/meltdown this also allows to switch to a set of namespaces atomically, i.e. either attaching to all the specified namespaces succeeds or we fail. If we fail we haven't changed a single namespace. There are currently three namespaces that can fail (other than for ENOMEM which really is not very interesting since we then have other problems anyway) for non-trivial reasons, user, mount, and pid namespaces. We can fail to attach to a pid namespace if it is not our current active pid namespace or a descendant of it. We can fail to attach to a user namespace because we are multi-threaded or because our current mount namespace shares filesystem state with other tasks, or because we're trying to setns() to the same user namespace, i.e. the target task has the same user namespace as we do. We can fail to attach to a mount namespace because it shares filesystem state with other tasks or because we fail to lookup the new root for the new mount namespace. In most non-pathological scenarios these issues can be somewhat mitigated. But there are cases where we're half-attached to some namespace and failing to attach to another one. I've talked about some of these problem during the hallway track (something only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles in 2018(?). Even if all these issues could be avoided with super careful userspace coding it would be nicer to have this done in-kernel. Pidfds seem to lend themselves nicely for this. The other neat thing about this is that setns() becomes an actual counterpart to the namespace bits of unshare(). Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Aleksa Sarai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent f2a8d52 commit 303cc57

File tree

5 files changed

+226
-16
lines changed

5 files changed

+226
-16
lines changed

fs/namespace.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1733,6 +1733,11 @@ static struct mnt_namespace *to_mnt_ns(struct ns_common *ns)
17331733
return container_of(ns, struct mnt_namespace, ns);
17341734
}
17351735

1736+
struct ns_common *from_mnt_ns(struct mnt_namespace *mnt)
1737+
{
1738+
return &mnt->ns;
1739+
}
1740+
17361741
static bool mnt_ns_loop(struct dentry *dentry)
17371742
{
17381743
/* Could bind mounting the mount namespace inode cause a

fs/nsfs.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,11 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
229229
return res;
230230
}
231231

232+
bool proc_ns_file(const struct file *file)
233+
{
234+
return file->f_op == &ns_file_operations;
235+
}
236+
232237
struct file *proc_ns_fget(int fd)
233238
{
234239
struct file *file;

include/linux/mnt_namespace.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ struct ns_common;
1111
extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
1212
struct user_namespace *, struct fs_struct *);
1313
extern void put_mnt_ns(struct mnt_namespace *ns);
14+
extern struct ns_common *from_mnt_ns(struct mnt_namespace *);
1415

1516
extern const struct file_operations proc_mounts_operations;
1617
extern const struct file_operations proc_mountinfo_operations;

include/linux/proc_fs.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,4 +179,6 @@ static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
179179
return inode->i_sb->s_fs_info;
180180
}
181181

182+
bool proc_ns_file(const struct file *file);
183+
182184
#endif /* _LINUX_PROC_FS_H */

kernel/nsproxy.c

Lines changed: 213 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
#include <linux/ipc_namespace.h>
2121
#include <linux/time_namespace.h>
2222
#include <linux/fs_struct.h>
23+
#include <linux/proc_fs.h>
2324
#include <linux/proc_ns.h>
2425
#include <linux/file.h>
2526
#include <linux/syscalls.h>
@@ -258,42 +259,221 @@ void exit_task_namespaces(struct task_struct *p)
258259
switch_task_namespaces(p, NULL);
259260
}
260261

262+
static int check_setns_flags(unsigned long flags)
263+
{
264+
if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
265+
CLONE_NEWNET | CLONE_NEWUSER | CLONE_NEWPID |
266+
CLONE_NEWCGROUP)))
267+
return -EINVAL;
268+
269+
#ifndef CONFIG_USER_NS
270+
if (flags & CLONE_NEWUSER)
271+
return -EINVAL;
272+
#endif
273+
#ifndef CONFIG_PID_NS
274+
if (flags & CLONE_NEWPID)
275+
return -EINVAL;
276+
#endif
277+
#ifndef CONFIG_UTS_NS
278+
if (flags & CLONE_NEWUTS)
279+
return -EINVAL;
280+
#endif
281+
#ifndef CONFIG_IPC_NS
282+
if (flags & CLONE_NEWIPC)
283+
return -EINVAL;
284+
#endif
285+
#ifndef CONFIG_CGROUPS
286+
if (flags & CLONE_NEWCGROUP)
287+
return -EINVAL;
288+
#endif
289+
#ifndef CONFIG_NET_NS
290+
if (flags & CLONE_NEWNET)
291+
return -EINVAL;
292+
#endif
293+
294+
return 0;
295+
}
296+
261297
static void put_nsset(struct nsset *nsset)
262298
{
263299
unsigned flags = nsset->flags;
264300

265301
if (flags & CLONE_NEWUSER)
266302
put_cred(nsset_cred(nsset));
303+
/*
304+
* We only created a temporary copy if we attached to more than just
305+
* the mount namespace.
306+
*/
307+
if (nsset->fs && (flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS))
308+
free_fs_struct(nsset->fs);
267309
if (nsset->nsproxy)
268310
free_nsproxy(nsset->nsproxy);
269311
}
270312

271-
static int prepare_nsset(int nstype, struct nsset *nsset)
313+
static int prepare_nsset(unsigned flags, struct nsset *nsset)
272314
{
273315
struct task_struct *me = current;
274316

275317
nsset->nsproxy = create_new_namespaces(0, me, current_user_ns(), me->fs);
276318
if (IS_ERR(nsset->nsproxy))
277319
return PTR_ERR(nsset->nsproxy);
278320

279-
if (nstype == CLONE_NEWUSER)
321+
if (flags & CLONE_NEWUSER)
280322
nsset->cred = prepare_creds();
281323
else
282324
nsset->cred = current_cred();
283325
if (!nsset->cred)
284326
goto out;
285327

286-
if (nstype == CLONE_NEWNS)
328+
/* Only create a temporary copy of fs_struct if we really need to. */
329+
if (flags == CLONE_NEWNS) {
287330
nsset->fs = me->fs;
331+
} else if (flags & CLONE_NEWNS) {
332+
nsset->fs = copy_fs_struct(me->fs);
333+
if (!nsset->fs)
334+
goto out;
335+
}
288336

289-
nsset->flags = nstype;
337+
nsset->flags = flags;
290338
return 0;
291339

292340
out:
293341
put_nsset(nsset);
294342
return -ENOMEM;
295343
}
296344

345+
static inline int validate_ns(struct nsset *nsset, struct ns_common *ns)
346+
{
347+
return ns->ops->install(nsset, ns);
348+
}
349+
350+
/*
351+
* This is the inverse operation to unshare().
352+
* Ordering is equivalent to the standard ordering used everywhere else
353+
* during unshare and process creation. The switch to the new set of
354+
* namespaces occurs at the point of no return after installation of
355+
* all requested namespaces was successful in commit_nsset().
356+
*/
357+
static int validate_nsset(struct nsset *nsset, struct pid *pid)
358+
{
359+
int ret = 0;
360+
unsigned flags = nsset->flags;
361+
struct user_namespace *user_ns = NULL;
362+
struct pid_namespace *pid_ns = NULL;
363+
struct nsproxy *nsp;
364+
struct task_struct *tsk;
365+
366+
/* Take a "snapshot" of the target task's namespaces. */
367+
rcu_read_lock();
368+
tsk = pid_task(pid, PIDTYPE_PID);
369+
if (!tsk) {
370+
rcu_read_unlock();
371+
return -ESRCH;
372+
}
373+
374+
if (!ptrace_may_access(tsk, PTRACE_MODE_READ_REALCREDS)) {
375+
rcu_read_unlock();
376+
return -EPERM;
377+
}
378+
379+
task_lock(tsk);
380+
nsp = tsk->nsproxy;
381+
if (nsp)
382+
get_nsproxy(nsp);
383+
task_unlock(tsk);
384+
if (!nsp) {
385+
rcu_read_unlock();
386+
return -ESRCH;
387+
}
388+
389+
#ifdef CONFIG_PID_NS
390+
if (flags & CLONE_NEWPID) {
391+
pid_ns = task_active_pid_ns(tsk);
392+
if (unlikely(!pid_ns)) {
393+
rcu_read_unlock();
394+
ret = -ESRCH;
395+
goto out;
396+
}
397+
get_pid_ns(pid_ns);
398+
}
399+
#endif
400+
401+
#ifdef CONFIG_USER_NS
402+
if (flags & CLONE_NEWUSER)
403+
user_ns = get_user_ns(__task_cred(tsk)->user_ns);
404+
#endif
405+
rcu_read_unlock();
406+
407+
/*
408+
* Install requested namespaces. The caller will have
409+
* verified earlier that the requested namespaces are
410+
* supported on this kernel. We don't report errors here
411+
* if a namespace is requested that isn't supported.
412+
*/
413+
#ifdef CONFIG_USER_NS
414+
if (flags & CLONE_NEWUSER) {
415+
ret = validate_ns(nsset, &user_ns->ns);
416+
if (ret)
417+
goto out;
418+
}
419+
#endif
420+
421+
if (flags & CLONE_NEWNS) {
422+
ret = validate_ns(nsset, from_mnt_ns(nsp->mnt_ns));
423+
if (ret)
424+
goto out;
425+
}
426+
427+
#ifdef CONFIG_UTS_NS
428+
if (flags & CLONE_NEWUTS) {
429+
ret = validate_ns(nsset, &nsp->uts_ns->ns);
430+
if (ret)
431+
goto out;
432+
}
433+
#endif
434+
435+
#ifdef CONFIG_IPC_NS
436+
if (flags & CLONE_NEWIPC) {
437+
ret = validate_ns(nsset, &nsp->ipc_ns->ns);
438+
if (ret)
439+
goto out;
440+
}
441+
#endif
442+
443+
#ifdef CONFIG_PID_NS
444+
if (flags & CLONE_NEWPID) {
445+
ret = validate_ns(nsset, &pid_ns->ns);
446+
if (ret)
447+
goto out;
448+
}
449+
#endif
450+
451+
#ifdef CONFIG_CGROUPS
452+
if (flags & CLONE_NEWCGROUP) {
453+
ret = validate_ns(nsset, &nsp->cgroup_ns->ns);
454+
if (ret)
455+
goto out;
456+
}
457+
#endif
458+
459+
#ifdef CONFIG_NET_NS
460+
if (flags & CLONE_NEWNET) {
461+
ret = validate_ns(nsset, &nsp->net_ns->ns);
462+
if (ret)
463+
goto out;
464+
}
465+
#endif
466+
467+
out:
468+
if (pid_ns)
469+
put_pid_ns(pid_ns);
470+
if (nsp)
471+
put_nsproxy(nsp);
472+
put_user_ns(user_ns);
473+
474+
return ret;
475+
}
476+
297477
/*
298478
* This is the point of no return. There are just a few namespaces
299479
* that do some actual work here and it's sufficiently minimal that
@@ -316,6 +496,12 @@ static void commit_nsset(struct nsset *nsset)
316496
}
317497
#endif
318498

499+
/* We only need to commit if we have used a temporary fs_struct. */
500+
if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) {
501+
set_fs_root(me->fs, &nsset->fs->root);
502+
set_fs_pwd(me->fs, &nsset->fs->pwd);
503+
}
504+
319505
#ifdef CONFIG_IPC_NS
320506
if (flags & CLONE_NEWIPC)
321507
exit_sem(me);
@@ -326,27 +512,38 @@ static void commit_nsset(struct nsset *nsset)
326512
nsset->nsproxy = NULL;
327513
}
328514

329-
SYSCALL_DEFINE2(setns, int, fd, int, nstype)
515+
SYSCALL_DEFINE2(setns, int, fd, int, flags)
330516
{
331517
struct file *file;
332-
struct ns_common *ns;
518+
struct ns_common *ns = NULL;
333519
struct nsset nsset = {};
334-
int err;
335-
336-
file = proc_ns_fget(fd);
337-
if (IS_ERR(file))
338-
return PTR_ERR(file);
520+
int err = 0;
339521

340-
err = -EINVAL;
341-
ns = get_proc_ns(file_inode(file));
342-
if (nstype && (ns->ops->type != nstype))
522+
file = fget(fd);
523+
if (!file)
524+
return -EBADF;
525+
526+
if (proc_ns_file(file)) {
527+
ns = get_proc_ns(file_inode(file));
528+
if (flags && (ns->ops->type != flags))
529+
err = -EINVAL;
530+
flags = ns->ops->type;
531+
} else if (!IS_ERR(pidfd_pid(file))) {
532+
err = check_setns_flags(flags);
533+
} else {
534+
err = -EBADF;
535+
}
536+
if (err)
343537
goto out;
344538

345-
err = prepare_nsset(ns->ops->type, &nsset);
539+
err = prepare_nsset(flags, &nsset);
346540
if (err)
347541
goto out;
348542

349-
err = ns->ops->install(&nsset, ns);
543+
if (proc_ns_file(file))
544+
err = validate_ns(&nsset, ns);
545+
else
546+
err = validate_nsset(&nsset, file->private_data);
350547
if (!err) {
351548
commit_nsset(&nsset);
352549
perf_event_namespaces(current);

0 commit comments

Comments
 (0)