Skip to content

Commit fb3c538

Browse files
Christian Braunerkees
authored andcommitted
seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
This allows the seccomp notifier to continue a syscall. A positive discussion about this feature was triggered by a post to the ksummit-discuss mailing list (cf. [3]) and took place during KSummit (cf. [1]) and again at the containers/checkpoint-restore micro-conference at Linux Plumbers. Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4]) which enables a process (watchee) to retrieve an fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (watcher). The watcher will then be able to receive seccomp messages about the syscalls having been performed by the watchee. This feature is heavily used in some userspace workloads. For example, it is currently used to intercept mknod() syscalls in user namespaces aka in containers. The mknod() syscall can be easily filtered based on dev_t. This allows us to only intercept a very specific subset of mknod() syscalls. Furthermore, mknod() is not possible in user namespaces toto coelo and so intercepting and denying syscalls that are not in the whitelist on accident is not a big deal. The watchee won't notice a difference. In contrast to mknod(), a lot of other syscall we intercept (e.g. setxattr()) cannot be easily filtered like mknod() because they have pointer arguments. Additionally, some of them might actually succeed in user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we currently cannot tell seccomp to continue from a user notifier we are stuck with performing all of the syscalls in lieu of the container. This is a huge security liability since it is extremely difficult to correctly assume all of the necessary privileges of the calling task such that the syscall can be successfully emulated without escaping other additional security restrictions (think missing CAP_MKNOD for mknod(), or MS_NODEV on a filesystem etc.). This can be solved by telling seccomp to resume the syscall. One thing that came up in the discussion was the problem that another thread could change the memory after userspace has decided to let the syscall continue which is a well known TOCTOU with seccomp which is present in other ways already. The discussion showed that this feature is already very useful for any syscall without pointer arguments. For any accidentally intercepted non-pointer syscall it is safe to continue. For syscalls with pointer arguments there is a race but for any cautious userspace and the main usec cases the race doesn't matter. The notifier is intended to be used in a scenario where a more privileged watcher supervises the syscalls of lesser privileged watchee to allow it to get around kernel-enforced limitations by performing the syscall for it whenever deemed save by the watcher. Hence, if a user tricks the watcher into allowing a syscall they will either get a deny based on kernel-enforced restrictions later or they will have changed the arguments in such a way that they manage to perform a syscall with arguments that they would've been allowed to do anyway. In general, it is good to point out again, that the notifier fd was not intended to allow userspace to implement a security policy but rather to work around kernel security mechanisms in cases where the watcher knows that a given action is safe to perform. /* References */ [1]: https://linuxplumbersconf.org/event/4/contributions/560 [2]: https://linuxplumbersconf.org/event/4/contributions/477 [3]: https://lore.kernel.org/r/[email protected] [4]: commit 6a21cc5 ("seccomp: add a return code to trap to userspace") Co-developed-by: Kees Cook <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Tycho Andersen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Will Drewry <[email protected]> CC: Tyler Hicks <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
1 parent 223e660 commit fb3c538

File tree

2 files changed

+51
-6
lines changed

2 files changed

+51
-6
lines changed

include/uapi/linux/seccomp.h

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,35 @@ struct seccomp_notif {
7676
struct seccomp_data data;
7777
};
7878

79+
/*
80+
* Valid flags for struct seccomp_notif_resp
81+
*
82+
* Note, the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution!
83+
* If set by the process supervising the syscalls of another process the
84+
* syscall will continue. This is problematic because of an inherent TOCTOU.
85+
* An attacker can exploit the time while the supervised process is waiting on
86+
* a response from the supervising process to rewrite syscall arguments which
87+
* are passed as pointers of the intercepted syscall.
88+
* It should be absolutely clear that this means that the seccomp notifier
89+
* _cannot_ be used to implement a security policy! It should only ever be used
90+
* in scenarios where a more privileged process supervises the syscalls of a
91+
* lesser privileged process to get around kernel-enforced security
92+
* restrictions when the privileged process deems this safe. In other words,
93+
* in order to continue a syscall the supervising process should be sure that
94+
* another security mechanism or the kernel itself will sufficiently block
95+
* syscalls if arguments are rewritten to something unsafe.
96+
*
97+
* Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
98+
* or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
99+
* same syscall, the most recently added filter takes precedence. This means
100+
* that the new SECCOMP_RET_USER_NOTIF filter can override any
101+
* SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
102+
* such filtered syscalls to be executed by sending the response
103+
* SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
104+
* be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
105+
*/
106+
#define SECCOMP_USER_NOTIF_FLAG_CONTINUE BIT(0)
107+
79108
struct seccomp_notif_resp {
80109
__u64 id;
81110
__s64 val;

kernel/seccomp.c

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ struct seccomp_knotif {
7575
/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
7676
int error;
7777
long val;
78+
u32 flags;
7879

7980
/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
8081
struct completion ready;
@@ -732,11 +733,12 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
732733
return filter->notif->next_id++;
733734
}
734735

735-
static void seccomp_do_user_notification(int this_syscall,
736-
struct seccomp_filter *match,
737-
const struct seccomp_data *sd)
736+
static int seccomp_do_user_notification(int this_syscall,
737+
struct seccomp_filter *match,
738+
const struct seccomp_data *sd)
738739
{
739740
int err;
741+
u32 flags = 0;
740742
long ret = 0;
741743
struct seccomp_knotif n = {};
742744

@@ -764,6 +766,7 @@ static void seccomp_do_user_notification(int this_syscall,
764766
if (err == 0) {
765767
ret = n.val;
766768
err = n.error;
769+
flags = n.flags;
767770
}
768771

769772
/*
@@ -780,8 +783,14 @@ static void seccomp_do_user_notification(int this_syscall,
780783
list_del(&n.list);
781784
out:
782785
mutex_unlock(&match->notify_lock);
786+
787+
/* Userspace requests to continue the syscall. */
788+
if (flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE)
789+
return 0;
790+
783791
syscall_set_return_value(current, task_pt_regs(current),
784792
err, ret);
793+
return -1;
785794
}
786795

787796
static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
@@ -867,8 +876,10 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
867876
return 0;
868877

869878
case SECCOMP_RET_USER_NOTIF:
870-
seccomp_do_user_notification(this_syscall, match, sd);
871-
goto skip;
879+
if (seccomp_do_user_notification(this_syscall, match, sd))
880+
goto skip;
881+
882+
return 0;
872883

873884
case SECCOMP_RET_LOG:
874885
seccomp_log(this_syscall, 0, action, true);
@@ -1087,7 +1098,11 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
10871098
if (copy_from_user(&resp, buf, sizeof(resp)))
10881099
return -EFAULT;
10891100

1090-
if (resp.flags)
1101+
if (resp.flags & ~SECCOMP_USER_NOTIF_FLAG_CONTINUE)
1102+
return -EINVAL;
1103+
1104+
if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) &&
1105+
(resp.error || resp.val))
10911106
return -EINVAL;
10921107

10931108
ret = mutex_lock_interruptible(&filter->notify_lock);
@@ -1116,6 +1131,7 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
11161131
knotif->state = SECCOMP_NOTIFY_REPLIED;
11171132
knotif->error = resp.error;
11181133
knotif->val = resp.val;
1134+
knotif->flags = resp.flags;
11191135
complete(&knotif->ready);
11201136
out:
11211137
mutex_unlock(&filter->notify_lock);

0 commit comments

Comments
 (0)