Skip to content

Commit 74858ab

Browse files
committed
Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull checkpoint-restore updates from Christian Brauner: "This enables unprivileged checkpoint/restore of processes. Given that this work has been going on for quite some time the first sentence in this summary is hopefully more exciting than the actual final code changes required. Unprivileged checkpoint/restore has seen a frequent increase in interest over the last two years and has thus been one of the main topics for the combined containers & checkpoint/restore microconference since at least 2018 (cf. [1]). Here are just the three most frequent use-cases that were brought forward: - The JVM developers are integrating checkpoint/restore into a Java VM to significantly decrease the startup time. - In high-performance computing environment a resource manager will typically be distributing jobs where users are always running as non-root. Long-running and "large" processes with significant startup times are supposed to be checkpointed and restored with CRIU. - Container migration as a non-root user. In all of these scenarios it is either desirable or required to run without CAP_SYS_ADMIN. The userspace implementation of checkpoint/restore CRIU already has the pull request for supporting unprivileged checkpoint/restore up (cf. [2]). To enable unprivileged checkpoint/restore a new dedicated capability CAP_CHECKPOINT_RESTORE is introduced. This solution has last been discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1] "Update on Task Migration at Google Using CRIU") with Adrian and Nicolas providing the implementation now over the last months. In essence, this allows the CRIU binary to be installed with the CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling unprivileged users to restore processes. To make this possible the following permissions are altered: - Selecting a specific PID via clone3() set_tid relaxed from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE. - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE. - Accessing /proc/pid/map_files relaxed from init userns CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE. - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE. Of these four changes the /proc/self/exe change deserves a few words because the reasoning behind even restricting /proc/self/exe changes in the first place is just full of historical quirks and tracking this down was a questionable version of fun that I'd like to spare others. In short, it is trivial to change /proc/self/exe as an unprivileged user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace() or by simply intercepting the elf loader in userspace during exec. Nicolas was nice enough to even provide a POC for the latter (cf. [3]) to illustrate this fact. The original patchset which introduced PR_SET_MM_MAP had no permissions around changing the exe link. They too argued that it is trivial to spoof the exe link already which is true. The argument brought up against this was that the Tomoyo LSM uses the exe link in tomoyo_manager() to detect whether the calling process is a policy manager. This caused changing the exe links to be guarded by userns CAP_SYS_ADMIN. All in all this rather seems like a "better guard it with something rather than nothing" argument which imho doesn't qualify as a great security policy. Again, because spoofing the exe link is possible for the calling process so even if this were security relevant it was broken back then and would be broken today. So technically, dropping all permissions around changing the exe link would probably be possible and would send a clearer message to any userspace that relies on /proc/self/exe for security reasons that they should stop doing this but for now we're only relaxing the exe link permissions from userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE. There's a final uapi change in here. Changing the exe link used to accidently return EINVAL when the caller lacked the necessary permissions instead of the more correct EPERM. This pr contains a commit fixing this. I assume that userspace won't notice or care and if they do I will revert this commit. But since we are changing the permissions anyway it seems like a good opportunity to try this fix. With these changes merged unprivileged checkpoint/restore will be possible and has already been tested by various users" [1] LPC 2018 1. "Task Migration at Google Using CRIU" https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095 2. "Securely Migrating Untrusted Workloads with CRIU" https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400 LPC 2019 1. "CRIU and the PID dance" https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s 2. "Update on Task Migration at Google Using CRIU" https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s [2] checkpoint-restore/criu#1155 [3] https://github.com/nviennot/run_as_exe * tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests: add clone3() CAP_CHECKPOINT_RESTORE test prctl: exe link permission error changed from -EINVAL to -EPERM prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid pid: use checkpoint_restore_ns_capable() for set_tid capabilities: Introduce CAP_CHECKPOINT_RESTORE
2 parents 9ba2741 + 1d27a0b commit 74858ab

File tree

10 files changed

+217
-15
lines changed

10 files changed

+217
-15
lines changed

fs/proc/base.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2189,16 +2189,16 @@ struct map_files_info {
21892189
};
21902190

21912191
/*
2192-
* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
2193-
* symlinks may be used to bypass permissions on ancestor directories in the
2194-
* path to the file in question.
2192+
* Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
2193+
* to concerns about how the symlinks may be used to bypass permissions on
2194+
* ancestor directories in the path to the file in question.
21952195
*/
21962196
static const char *
21972197
proc_map_files_get_link(struct dentry *dentry,
21982198
struct inode *inode,
21992199
struct delayed_call *done)
22002200
{
2201-
if (!capable(CAP_SYS_ADMIN))
2201+
if (!checkpoint_restore_ns_capable(&init_user_ns))
22022202
return ERR_PTR(-EPERM);
22032203

22042204
return proc_pid_get_link(dentry, inode, done);

include/linux/capability.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
261261
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
262262
}
263263

264+
static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
265+
{
266+
return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
267+
ns_capable(ns, CAP_SYS_ADMIN);
268+
}
269+
264270
/* audit system wants to get cap info from files as well */
265271
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
266272

include/uapi/linux/capability.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
408408
*/
409409
#define CAP_BPF 39
410410

411-
#define CAP_LAST_CAP CAP_BPF
411+
412+
/* Allow checkpoint/restore related operations */
413+
/* Allow PID selection during clone3() */
414+
/* Allow writing to ns_last_pid */
415+
416+
#define CAP_CHECKPOINT_RESTORE 40
417+
418+
#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
412419

413420
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
414421

kernel/pid.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
199199
if (tid != 1 && !tmp->child_reaper)
200200
goto out_free;
201201
retval = -EPERM;
202-
if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
202+
if (!checkpoint_restore_ns_capable(tmp->user_ns))
203203
goto out_free;
204204
set_tid_size--;
205205
}

kernel/pid_namespace.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
269269
struct ctl_table tmp = *table;
270270
int ret, next;
271271

272-
if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
272+
if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
273273
return -EPERM;
274274

275275
/*

kernel/sys.c

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2007,12 +2007,15 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
20072007

20082008
if (prctl_map.exe_fd != (u32)-1) {
20092009
/*
2010-
* Make sure the caller has the rights to
2011-
* change /proc/pid/exe link: only local sys admin should
2012-
* be allowed to.
2010+
* Check if the current user is checkpoint/restore capable.
2011+
* At the time of this writing, it checks for CAP_SYS_ADMIN
2012+
* or CAP_CHECKPOINT_RESTORE.
2013+
* Note that a user with access to ptrace can masquerade an
2014+
* arbitrary program as any executable, even setuid ones.
2015+
* This may have implications in the tomoyo subsystem.
20132016
*/
2014-
if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
2015-
return -EINVAL;
2017+
if (!checkpoint_restore_ns_capable(current_user_ns()))
2018+
return -EPERM;
20162019

20172020
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
20182021
if (error)

security/selinux/include/classmap.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,10 @@
2727
"audit_control", "setfcap"
2828

2929
#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
30-
"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
30+
"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
31+
"checkpoint_restore"
3132

32-
#if CAP_LAST_CAP > CAP_BPF
33+
#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
3334
#error New capability defined, please update COMMON_CAP2_PERMS.
3435
#endif
3536

tools/testing/selftests/clone3/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
clone3
33
clone3_clear_sighand
44
clone3_set_tid
5+
clone3_cap_checkpoint_restore
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# SPDX-License-Identifier: GPL-2.0
22
CFLAGS += -g -I../../../../usr/include/
3+
LDLIBS += -lcap
34

4-
TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
5+
TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
6+
clone3_cap_checkpoint_restore
57

68
include ../lib.mk
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
// SPDX-License-Identifier: GPL-2.0
2+
3+
/*
4+
* Based on Christian Brauner's clone3() example.
5+
* These tests are assuming to be running in the host's
6+
* PID namespace.
7+
*/
8+
9+
/* capabilities related code based on selftests/bpf/test_verifier.c */
10+
11+
#define _GNU_SOURCE
12+
#include <errno.h>
13+
#include <linux/types.h>
14+
#include <linux/sched.h>
15+
#include <stdio.h>
16+
#include <stdlib.h>
17+
#include <stdbool.h>
18+
#include <sys/capability.h>
19+
#include <sys/prctl.h>
20+
#include <sys/syscall.h>
21+
#include <sys/types.h>
22+
#include <sys/un.h>
23+
#include <sys/wait.h>
24+
#include <unistd.h>
25+
#include <sched.h>
26+
27+
#include "../kselftest_harness.h"
28+
#include "clone3_selftests.h"
29+
30+
#ifndef MAX_PID_NS_LEVEL
31+
#define MAX_PID_NS_LEVEL 32
32+
#endif
33+
34+
static void child_exit(int ret)
35+
{
36+
fflush(stdout);
37+
fflush(stderr);
38+
_exit(ret);
39+
}
40+
41+
static int call_clone3_set_tid(struct __test_metadata *_metadata,
42+
pid_t *set_tid, size_t set_tid_size)
43+
{
44+
int status;
45+
pid_t pid = -1;
46+
47+
struct clone_args args = {
48+
.exit_signal = SIGCHLD,
49+
.set_tid = ptr_to_u64(set_tid),
50+
.set_tid_size = set_tid_size,
51+
};
52+
53+
pid = sys_clone3(&args, sizeof(struct clone_args));
54+
if (pid < 0) {
55+
TH_LOG("%s - Failed to create new process", strerror(errno));
56+
return -errno;
57+
}
58+
59+
if (pid == 0) {
60+
int ret;
61+
char tmp = 0;
62+
63+
TH_LOG("I am the child, my PID is %d (expected %d)", getpid(), set_tid[0]);
64+
65+
if (set_tid[0] != getpid())
66+
child_exit(EXIT_FAILURE);
67+
child_exit(EXIT_SUCCESS);
68+
}
69+
70+
TH_LOG("I am the parent (%d). My child's pid is %d", getpid(), pid);
71+
72+
if (waitpid(pid, &status, 0) < 0) {
73+
TH_LOG("Child returned %s", strerror(errno));
74+
return -errno;
75+
}
76+
77+
if (!WIFEXITED(status))
78+
return -1;
79+
80+
return WEXITSTATUS(status);
81+
}
82+
83+
static int test_clone3_set_tid(struct __test_metadata *_metadata,
84+
pid_t *set_tid, size_t set_tid_size)
85+
{
86+
int ret;
87+
88+
TH_LOG("[%d] Trying clone3() with CLONE_SET_TID to %d", getpid(), set_tid[0]);
89+
ret = call_clone3_set_tid(_metadata, set_tid, set_tid_size);
90+
TH_LOG("[%d] clone3() with CLONE_SET_TID %d says:%d", getpid(), set_tid[0], ret);
91+
return ret;
92+
}
93+
94+
struct libcap {
95+
struct __user_cap_header_struct hdr;
96+
struct __user_cap_data_struct data[2];
97+
};
98+
99+
static int set_capability(void)
100+
{
101+
cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
102+
struct libcap *cap;
103+
int ret = -1;
104+
cap_t caps;
105+
106+
caps = cap_get_proc();
107+
if (!caps) {
108+
perror("cap_get_proc");
109+
return -1;
110+
}
111+
112+
/* Drop all capabilities */
113+
if (cap_clear(caps)) {
114+
perror("cap_clear");
115+
goto out;
116+
}
117+
118+
cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
119+
cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
120+
121+
cap = (struct libcap *) caps;
122+
123+
/* 40 -> CAP_CHECKPOINT_RESTORE */
124+
cap->data[1].effective |= 1 << (40 - 32);
125+
cap->data[1].permitted |= 1 << (40 - 32);
126+
127+
if (cap_set_proc(caps)) {
128+
perror("cap_set_proc");
129+
goto out;
130+
}
131+
ret = 0;
132+
out:
133+
if (cap_free(caps))
134+
perror("cap_free");
135+
return ret;
136+
}
137+
138+
TEST(clone3_cap_checkpoint_restore)
139+
{
140+
pid_t pid;
141+
int status;
142+
int ret = 0;
143+
pid_t set_tid[1];
144+
145+
test_clone3_supported();
146+
147+
EXPECT_EQ(getuid(), 0)
148+
XFAIL(return, "Skipping all tests as non-root\n");
149+
150+
memset(&set_tid, 0, sizeof(set_tid));
151+
152+
/* Find the current active PID */
153+
pid = fork();
154+
if (pid == 0) {
155+
TH_LOG("Child has PID %d", getpid());
156+
child_exit(EXIT_SUCCESS);
157+
}
158+
ASSERT_GT(waitpid(pid, &status, 0), 0)
159+
TH_LOG("Waiting for child %d failed", pid);
160+
161+
/* After the child has finished, its PID should be free. */
162+
set_tid[0] = pid;
163+
164+
ASSERT_EQ(set_capability(), 0)
165+
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
166+
167+
ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
168+
169+
EXPECT_EQ(setgid(65534), 0)
170+
TH_LOG("Failed to setgid(65534)");
171+
ASSERT_EQ(setuid(65534), 0);
172+
173+
set_tid[0] = pid;
174+
/* This would fail without CAP_CHECKPOINT_RESTORE */
175+
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
176+
ASSERT_EQ(set_capability(), 0)
177+
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
178+
/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
179+
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
180+
}
181+
182+
TEST_HARNESS_MAIN

0 commit comments

Comments
 (0)