Skip to content

Commit 9b62498

Browse files
committed
ucounts: Count rlimits in each user namespace
Preface ------- These patches are for binding the rlimit counters to a user in user namespace. This patch set can be applied on top of: git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4 Problem ------- The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits implementation places the counters in user_struct [1]. These limits are global between processes and persists for the lifetime of the process, even if processes are in different user namespaces. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. The problem is not that the limit from container1 affects container2. The problem is that limit is verified against the global counter that reflects the number of processes in all containers. This problem can be worked around by using different users for each container but in this case we face a different problem of uid mapping when transferring files from one container to another. Eric W. Biederman mentioned this issue [2][3]. Introduced changes ------------------ To address the problem, we bind rlimit counters to user namespace. Each counter reflects the number of processes in a given uid in a given user namespace. The result is a tree of rlimit counters with the biggest value at the root (aka init_user_ns). The limit is considered exceeded if it's exceeded up in the tree. [1]: https://lore.kernel.org/containers/[email protected]/ [2]: https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html [3]: https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html Changelog --------- v11: * Revert most of changes in signal.c to fix performance issues and remove unnecessary memory allocations. * Fixed issue found by lkp robot (again). v10: * Fixed memory leak in __sigqueue_alloc. * Handled an unlikely situation when all consumers will return ucounts at once. * Addressed other review comments from Eric W. Biederman. v9: * Used a negative value to check that the ucounts->count is close to overflow. * Rebased onto v5.12-rc4. v8: * Used atomic_t for ucounts reference counting. Also added counter overflow check (thanks to Linus Torvalds for the idea). * Fixed other issues found by lkp-tests project in the patch that Reimplements RLIMIT_MEMLOCK on top of ucounts. v7: * Fixed issues found by lkp-tests project in the patch that Reimplements RLIMIT_MEMLOCK on top of ucounts. v6: * Fixed issues found by lkp-tests project. * Rebased onto v5.11. v5: * Split the first commit into two commits: change ucounts.count type to atomic_long_t and add ucounts to cred. These commits were merged by mistake during the rebase. * The __get_ucounts() renamed to alloc_ucounts(). * The cred.ucounts update has been moved from commit_creds() as it did not allow to handle errors. * Added error handling of set_cred_ucounts(). v4: * Reverted the type change of ucounts.count to refcount_t. * Fixed typo in the kernel/cred.c v3: * Added get_ucounts() function to increase the reference count. The existing get_counts() function renamed to __get_ucounts(). * The type of ucounts.count changed from atomic_t to refcount_t. * Dropped 'const' from set_cred_ucounts() arguments. * Fixed a bug with freeing the cred structure after calling cred_alloc_blank(). * Commit messages have been updated. * Added selftest. v2: * RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts. * Added ucounts for pair uid and user namespace into cred. * Added the ability to increase ucount by more than 1. v1: * After discussion with Eric W. Biederman, I increased the size of ucounts to atomic_long_t. * Added ucount_max to avoid the fork bomb. -- Alexey Gladkov (9): Increase size of ucounts to atomic_long_t Add a reference to ucounts for each cred Use atomic_t for ucounts reference counting Reimplement RLIMIT_NPROC on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MEMLOCK on top of ucounts kselftests: Add test to check for rlimit changes in different user namespaces ucounts: Set ucount_max to the largest positive value the type can hold fs/exec.c | 6 +- fs/hugetlbfs/inode.c | 16 +- fs/proc/array.c | 2 +- include/linux/cred.h | 4 + include/linux/hugetlb.h | 4 +- include/linux/mm.h | 4 +- include/linux/sched/user.h | 7 - include/linux/shmem_fs.h | 2 +- include/linux/signal_types.h | 4 +- include/linux/user_namespace.h | 31 +++- ipc/mqueue.c | 40 ++--- ipc/shm.c | 26 +-- kernel/cred.c | 50 +++++- kernel/exit.c | 2 +- kernel/fork.c | 18 +- kernel/signal.c | 25 +-- kernel/sys.c | 14 +- kernel/ucount.c | 116 ++++++++++--- kernel/user.c | 3 - kernel/user_namespace.c | 9 +- mm/memfd.c | 4 +- mm/mlock.c | 22 ++- mm/mmap.c | 4 +- mm/shmem.c | 10 +- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/rlimits/.gitignore | 2 + tools/testing/selftests/rlimits/Makefile | 6 + tools/testing/selftests/rlimits/config | 1 + .../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++ 29 files changed, 467 insertions(+), 127 deletions(-) create mode 100644 tools/testing/selftests/rlimits/.gitignore create mode 100644 tools/testing/selftests/rlimits/Makefile create mode 100644 tools/testing/selftests/rlimits/config create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c Link: https://lkml.kernel.org/r/[email protected]
2 parents 9f4ad9e + c1ada3d commit 9b62498

File tree

29 files changed

+467
-127
lines changed

29 files changed

+467
-127
lines changed

fs/exec.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
13601360
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
13611361
flush_signal_handlers(me, 0);
13621362

1363+
retval = set_cred_ucounts(bprm->cred);
1364+
if (retval < 0)
1365+
goto out_unlock;
1366+
13631367
/*
13641368
* install the new credentials for this executable
13651369
*/
@@ -1874,7 +1878,7 @@ static int do_execveat_common(int fd, struct filename *filename,
18741878
* whether NPROC limit is still exceeded.
18751879
*/
18761880
if ((current->flags & PF_NPROC_EXCEEDED) &&
1877-
atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
1881+
is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
18781882
retval = -EAGAIN;
18791883
goto out_ret;
18801884
}

fs/hugetlbfs/inode.c

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1443,7 +1443,7 @@ static int get_hstate_idx(int page_size_log)
14431443
* otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
14441444
*/
14451445
struct file *hugetlb_file_setup(const char *name, size_t size,
1446-
vm_flags_t acctflag, struct user_struct **user,
1446+
vm_flags_t acctflag, struct ucounts **ucounts,
14471447
int creat_flags, int page_size_log)
14481448
{
14491449
struct inode *inode;
@@ -1455,20 +1455,20 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
14551455
if (hstate_idx < 0)
14561456
return ERR_PTR(-ENODEV);
14571457

1458-
*user = NULL;
1458+
*ucounts = NULL;
14591459
mnt = hugetlbfs_vfsmount[hstate_idx];
14601460
if (!mnt)
14611461
return ERR_PTR(-ENOENT);
14621462

14631463
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
1464-
*user = current_user();
1465-
if (user_shm_lock(size, *user)) {
1464+
*ucounts = current_ucounts();
1465+
if (user_shm_lock(size, *ucounts)) {
14661466
task_lock(current);
14671467
pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
14681468
current->comm, current->pid);
14691469
task_unlock(current);
14701470
} else {
1471-
*user = NULL;
1471+
*ucounts = NULL;
14721472
return ERR_PTR(-EPERM);
14731473
}
14741474
}
@@ -1495,9 +1495,9 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
14951495

14961496
iput(inode);
14971497
out:
1498-
if (*user) {
1499-
user_shm_unlock(size, *user);
1500-
*user = NULL;
1498+
if (*ucounts) {
1499+
user_shm_unlock(size, *ucounts);
1500+
*ucounts = NULL;
15011501
}
15021502
return file;
15031503
}

fs/proc/array.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct task_struct *p)
284284
collect_sigign_sigcatch(p, &ignored, &caught);
285285
num_threads = get_nr_threads(p);
286286
rcu_read_lock(); /* FIXME: is this correct? */
287-
qsize = atomic_read(&__task_cred(p)->user->sigpending);
287+
qsize = get_ucounts_value(task_ucounts(p), UCOUNT_RLIMIT_SIGPENDING);
288288
rcu_read_unlock();
289289
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
290290
unlock_task_sighand(p, &flags);

include/linux/cred.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ struct cred {
144144
#endif
145145
struct user_struct *user; /* real user ID subscription */
146146
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
147+
struct ucounts *ucounts;
147148
struct group_info *group_info; /* supplementary groups for euid/fsgid */
148149
/* RCU deletion */
149150
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, const char *);
170171
extern int set_create_files_as(struct cred *, struct inode *);
171172
extern int cred_fscmp(const struct cred *, const struct cred *);
172173
extern void __init cred_init(void);
174+
extern int set_cred_ucounts(struct cred *);
173175

174176
/*
175177
* check for validity of credentials
@@ -370,6 +372,7 @@ static inline void put_cred(const struct cred *_cred)
370372

371373
#define task_uid(task) (task_cred_xxx((task), uid))
372374
#define task_euid(task) (task_cred_xxx((task), euid))
375+
#define task_ucounts(task) (task_cred_xxx((task), ucounts))
373376

374377
#define current_cred_xxx(xxx) \
375378
({ \
@@ -386,6 +389,7 @@ static inline void put_cred(const struct cred *_cred)
386389
#define current_fsgid() (current_cred_xxx(fsgid))
387390
#define current_cap() (current_cred_xxx(cap_effective))
388391
#define current_user() (current_cred_xxx(user))
392+
#define current_ucounts() (current_cred_xxx(ucounts))
389393

390394
extern struct user_namespace init_user_ns;
391395
#ifdef CONFIG_USER_NS

include/linux/hugetlb.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
434434
extern const struct file_operations hugetlbfs_file_operations;
435435
extern const struct vm_operations_struct hugetlb_vm_ops;
436436
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
437-
struct user_struct **user, int creat_flags,
437+
struct ucounts **ucounts, int creat_flags,
438438
int page_size_log);
439439

440440
static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
454454
#define is_file_hugepages(file) false
455455
static inline struct file *
456456
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
457-
struct user_struct **user, int creat_flags,
457+
struct ucounts **ucounts, int creat_flags,
458458
int page_size_log)
459459
{
460460
return ERR_PTR(-ENOSYS);

include/linux/mm.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1670,8 +1670,8 @@ extern bool can_do_mlock(void);
16701670
#else
16711671
static inline bool can_do_mlock(void) { return false; }
16721672
#endif
1673-
extern int user_shm_lock(size_t, struct user_struct *);
1674-
extern void user_shm_unlock(size_t, struct user_struct *);
1673+
extern int user_shm_lock(size_t, struct ucounts *);
1674+
extern void user_shm_unlock(size_t, struct ucounts *);
16751675

16761676
/*
16771677
* Parameter block passed down to zap_pte_range in exceptional cases.

include/linux/sched/user.h

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,12 @@
1212
*/
1313
struct user_struct {
1414
refcount_t __count; /* reference count */
15-
atomic_t processes; /* How many processes does this user have? */
16-
atomic_t sigpending; /* How many pending signals does this user have? */
1715
#ifdef CONFIG_FANOTIFY
1816
atomic_t fanotify_listeners;
1917
#endif
2018
#ifdef CONFIG_EPOLL
2119
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
2220
#endif
23-
#ifdef CONFIG_POSIX_MQUEUE
24-
/* protected by mq_lock */
25-
unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
26-
#endif
27-
unsigned long locked_shm; /* How many pages of mlocked shm ? */
2821
unsigned long unix_inflight; /* How many files in flight in unix sockets */
2922
atomic_long_t pipe_bufs; /* how many pages are allocated in pipe buffers */
3023

include/linux/shmem_fs.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
6565
extern int shmem_zero_setup(struct vm_area_struct *);
6666
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
6767
unsigned long len, unsigned long pgoff, unsigned long flags);
68-
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
68+
extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
6969
#ifdef CONFIG_SHMEM
7070
extern const struct address_space_operations shmem_aops;
7171
static inline bool shmem_mapping(struct address_space *mapping)

include/linux/signal_types.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
1313
__SIGINFO;
1414
} kernel_siginfo_t;
1515

16+
struct ucounts;
17+
1618
/*
1719
* Real Time signals may be queued.
1820
*/
@@ -21,7 +23,7 @@ struct sigqueue {
2123
struct list_head list;
2224
int flags;
2325
kernel_siginfo_t info;
24-
struct user_struct *user;
26+
struct ucounts *ucounts;
2527
};
2628

2729
/* flags values. */

include/linux/user_namespace.h

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,15 @@ enum ucount_type {
5050
UCOUNT_INOTIFY_INSTANCES,
5151
UCOUNT_INOTIFY_WATCHES,
5252
#endif
53+
UCOUNT_RLIMIT_NPROC,
54+
UCOUNT_RLIMIT_MSGQUEUE,
55+
UCOUNT_RLIMIT_SIGPENDING,
56+
UCOUNT_RLIMIT_MEMLOCK,
5357
UCOUNT_COUNTS,
5458
};
5559

60+
#define MAX_PER_NAMESPACE_UCOUNTS UCOUNT_RLIMIT_NPROC
61+
5662
struct user_namespace {
5763
struct uid_gid_map uid_map;
5864
struct uid_gid_map gid_map;
@@ -88,23 +94,42 @@ struct user_namespace {
8894
struct ctl_table_header *sysctls;
8995
#endif
9096
struct ucounts *ucounts;
91-
int ucount_max[UCOUNT_COUNTS];
97+
long ucount_max[UCOUNT_COUNTS];
9298
} __randomize_layout;
9399

94100
struct ucounts {
95101
struct hlist_node node;
96102
struct user_namespace *ns;
97103
kuid_t uid;
98-
int count;
99-
atomic_t ucount[UCOUNT_COUNTS];
104+
atomic_t count;
105+
atomic_long_t ucount[UCOUNT_COUNTS];
100106
};
101107

102108
extern struct user_namespace init_user_ns;
109+
extern struct ucounts init_ucounts;
103110

104111
bool setup_userns_sysctls(struct user_namespace *ns);
105112
void retire_userns_sysctls(struct user_namespace *ns);
106113
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
107114
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
115+
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
116+
struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
117+
void put_ucounts(struct ucounts *ucounts);
118+
119+
static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type type)
120+
{
121+
return atomic_long_read(&ucounts->ucount[type]);
122+
}
123+
124+
long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
125+
bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
126+
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
127+
128+
static inline void set_rlimit_ucount_max(struct user_namespace *ns,
129+
enum ucount_type type, unsigned long max)
130+
{
131+
ns->ucount_max[type] = max <= LONG_MAX ? max : LONG_MAX;
132+
}
108133

109134
#ifdef CONFIG_USER_NS
110135

0 commit comments

Comments
 (0)