Skip to content

Commit 6aee4ba

Browse files
committed
Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull openat2 support from Al Viro: "This is the openat2() series from Aleksa Sarai. I'm afraid that the rest of namei stuff will have to wait - it got zero review the last time I'd posted #work.namei, and there had been a leak in the posted series I'd caught only last weekend. I was going to repost it on Monday, but the window opened and the odds of getting any review during that... Oh, well. Anyway, openat2 part should be ready; that _did_ get sane amount of review and public testing, so here it comes" From Aleksa's description of the series: "For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful. In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added: LOOKUP_NO_XDEV: Blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted). LOOKUP_NO_MAGICLINKS: Blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name. It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link. In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required. LOOKUP_BENEATH: Disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..". Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion. In addition, two new flags are added that expand on the above ideas: LOOKUP_NO_SYMLINKS: Does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component. LOOKUP_IN_ROOT: This is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not. If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT. The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few). In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready. Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles)" * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Documentation: path-lookup: include new LOOKUP flags selftests: add openat2(2) selftests open: introduce openat2(2) syscall namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: allow set_root() to produce errors namei: allow nd_jump_link() to produce errors nsfs: clean-up ns_get_path() signature to return int namei: only return -ECHILD from follow_dotdot_rcu()
2 parents 15d6632 + b55eef8 commit 6aee4ba

File tree

44 files changed

+1696
-116
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1696
-116
lines changed

CREDITS

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3302,7 +3302,9 @@ S: France
33023302
N: Aleksa Sarai
33033303
33043304
W: https://www.cyphar.com/
3305-
D: `pids` cgroup subsystem
3305+
D: /sys/fs/cgroup/pids
3306+
D: openat2(2)
3307+
S: Sydney, Australia
33063308

33073309
N: Dipankar Sarma
33083310

Documentation/filesystems/path-lookup.rst

Lines changed: 62 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ It has subsequently been updated to reflect changes in the kernel
1313
including:
1414

1515
- per-directory parallel name lookup.
16+
- ``openat2()`` resolution restriction flags.
1617

1718
Introduction to pathname lookup
1819
===============================
@@ -235,6 +236,13 @@ renamed. If ``d_lookup`` finds that a rename happened while it
235236
unsuccessfully scanned a chain in the hash table, it simply tries
236237
again.
237238

239+
``rename_lock`` is also used to detect and defend against potential attacks
240+
against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
241+
the parent directory is moved outside the root, bypassing the ``path_equal()``
242+
check). If ``rename_lock`` is updated during the lookup and the path encounters
243+
a "..", a potential attack occurred and ``handle_dots()`` will bail out with
244+
``-EAGAIN``.
245+
238246
inode->i_rwsem
239247
~~~~~~~~~~~~~~
240248

@@ -348,6 +356,13 @@ any changes to any mount points while stepping up. This locking is
348356
needed to stabilize the link to the mounted-on dentry, which the
349357
refcount on the mount itself doesn't ensure.
350358

359+
``mount_lock`` is also used to detect and defend against potential attacks
360+
against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
361+
the parent directory is moved outside the root, bypassing the ``path_equal()``
362+
check). If ``mount_lock`` is updated during the lookup and the path encounters
363+
a "..", a potential attack occurred and ``handle_dots()`` will bail out with
364+
``-EAGAIN``.
365+
351366
RCU
352367
~~~
353368

@@ -405,6 +420,10 @@ is requested. Keeping a reference in the ``nameidata`` ensures that
405420
only one root is in effect for the entire path walk, even if it races
406421
with a ``chroot()`` system call.
407422

423+
It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
424+
``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
425+
passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
426+
408427
The root is needed when either of two conditions holds: (1) either the
409428
pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
410429
component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1168,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and
11491168
the stack frame discarded.
11501169

11511170
The other case involves things in ``/proc`` that look like symlinks but
1152-
aren't really::
1171+
aren't really (and are therefore commonly referred to as "magic-links")::
11531172

11541173
$ ls -l /proc/self/fd/1
11551174
lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1286,7 +1305,9 @@ A few flags
12861305
A suitable way to wrap up this tour of pathname walking is to list
12871306
the various flags that can be stored in the ``nameidata`` to guide the
12881307
lookup process. Many of these are only meaningful on the final
1289-
component, others reflect the current state of the pathname lookup.
1308+
component, others reflect the current state of the pathname lookup, and some
1309+
apply restrictions to all path components encountered in the path lookup.
1310+
12901311
And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
12911312
the others. If this is not set, an empty pathname causes an error
12921313
very early on. If it is set, empty pathnames are not considered to be
@@ -1310,13 +1331,48 @@ longer needed.
13101331
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
13111332
it had the right name but for some other reason. This happens when
13121333
following "``..``", following a symlink to ``/``, crossing a mount point
1313-
or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
1314-
filesystem has not been asked to revalidate the name (with
1315-
``d_revalidate()``). In such cases the inode may still need to be
1316-
revalidated, so ``d_op->d_weak_revalidate()`` is called if
1334+
or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
1335+
link"). In this case the filesystem has not been asked to revalidate the
1336+
name (with ``d_revalidate()``). In such cases the inode may still need
1337+
to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
13171338
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
13181339
final component or, when creating, unlinking, or renaming, at the penultimate component.
13191340

1341+
Resolution-restriction flags
1342+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1343+
1344+
In order to allow userspace to protect itself against certain race conditions
1345+
and attack scenarios involving changing path components, a series of flags are
1346+
available which apply restrictions to all path components encountered during
1347+
path lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
1348+
1349+
``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1350+
This is distinctly different from ``LOOKUP_FOLLOW``, because the latter only
1351+
relates to restricting the following of trailing symlinks.
1352+
1353+
``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1354+
ensure that they return errors from ``nd_jump_link()``, because that is how
1355+
``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1356+
1357+
``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both
1358+
bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1359+
lookup is determined by the first mountpoint the path lookup reaches --
1360+
absolute paths start with the ``vfsmount`` of ``/``, and relative paths start
1361+
with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1362+
``vfsmount`` of the path is unchanged.
1363+
1364+
``LOOKUP_BENEATH`` blocks any path components which resolve outside the
1365+
starting point of the resolution. This is done by blocking ``nd_jump_root()``
1366+
as well as blocking ".." if it would jump outside the starting point.
1367+
``rename_lock`` and ``mount_lock`` are used to detect attacks against the
1368+
resolution of "..". Magic-links are also blocked.
1369+
1370+
``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
1371+
were the filesystem root. ``nd_jump_root()`` brings the resolution back to to
1372+
the starting point, and ".." at the starting point will act as a no-op. As with
1373+
``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
1374+
attacks against ".." resolution. Magic-links are also blocked.
1375+
13201376
Final-component flags
13211377
~~~~~~~~~~~~~~~~~~~~~
13221378

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6458,6 +6458,7 @@ F: fs/*
64586458
F: include/linux/fs.h
64596459
F: include/linux/fs_types.h
64606460
F: include/uapi/linux/fs.h
6461+
F: include/uapi/linux/openat2.h
64616462

64626463
FINTEK F75375S HARDWARE MONITOR AND FAN CONTROLLER DRIVER
64636464
M: Riku Voipio <[email protected]>

arch/alpha/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -475,3 +475,4 @@
475475
543 common fspick sys_fspick
476476
544 common pidfd_open sys_pidfd_open
477477
# 545 reserved for clone3
478+
547 common openat2 sys_openat2

arch/arm/tools/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -449,3 +449,4 @@
449449
433 common fspick sys_fspick
450450
434 common pidfd_open sys_pidfd_open
451451
435 common clone3 sys_clone3
452+
437 common openat2 sys_openat2

arch/arm64/include/asm/unistd.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
3939
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
4040

41-
#define __NR_compat_syscalls 436
41+
#define __NR_compat_syscalls 438
4242
#endif
4343

4444
#define __ARCH_WANT_SYS_CLONE

arch/arm64/include/asm/unistd32.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
879879
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
880880
#define __NR_clone3 435
881881
__SYSCALL(__NR_clone3, sys_clone3)
882+
#define __NR_openat2 437
883+
__SYSCALL(__NR_openat2, sys_openat2)
882884

883885
/*
884886
* Please add new compat syscalls above this comment and update

arch/ia64/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,3 +356,4 @@
356356
433 common fspick sys_fspick
357357
434 common pidfd_open sys_pidfd_open
358358
# 435 reserved for clone3
359+
437 common openat2 sys_openat2

arch/m68k/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,3 +435,4 @@
435435
433 common fspick sys_fspick
436436
434 common pidfd_open sys_pidfd_open
437437
435 common clone3 __sys_clone3
438+
437 common openat2 sys_openat2

arch/microblaze/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -441,3 +441,4 @@
441441
433 common fspick sys_fspick
442442
434 common pidfd_open sys_pidfd_open
443443
435 common clone3 sys_clone3
444+
437 common openat2 sys_openat2

0 commit comments

Comments
 (0)