Skip to content

Commit 08ef26e

Browse files
committed
fs: add file_ref
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on *_relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. As suggested by Linus, add a file specific variant instead of making this a generic library. Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu protection. In short, freeing of files isn't delayed until a grace period has elapsed. Instead, they are freed immediately and thus can be reused (multiple times) within the same grace period. So when picking a file from the file descriptor table via its file descriptor number it is thus possible to see an elevated reference count on file->f_count even though the file has already been recycled possibly multiple times by another task. To guard against this the vfs will pick the file from the file descriptor table twice. Once before the refcount increment and once after to compare the pointers (grossly simplified). If they match then the file is still valid. If not the caller needs to fput() it. The unconditional increment makes the following race possible as illustrated by rcuref: > Deconstruction race > =================== > > The release operation must be protected by prohibiting a grace period in > order to prevent a possible use after free: > > T1 T2 > put() get() > // ref->refcnt = ONEREF > if (!atomic_add_negative(-1, &ref->refcnt)) > return false; <- Not taken > > // ref->refcnt == NOREF > --> preemption > // Elevates ref->refcnt to ONEREF > if (!atomic_add_negative(1, &ref->refcnt)) > return true; <- taken > > if (put(&p->ref)) { <-- Succeeds > remove_pointer(p); > kfree_rcu(p, rcu); > } > > RCU grace period ends, object is freed > > atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF > > [...] it prevents the grace period which keeps the object alive until > all put() operations complete. Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for this deconstruction race. Afaict, the only interesting case would be someone freeing the file and someone immediately recycling it within the same grace period and reinitializing file->f_count to ONEREF while a concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as in the race above. But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should be safe from rcuref's perspective. T1 T2 T3 fput() fget() // f_count->refcnt = ONEREF if (!atomic_add_negative(-1, &f_count->refcnt)) return false; <- Not taken // f_count->refcnt == NOREF --> preemption // Elevates f_count->refcnt to ONEREF if (!atomic_add_negative(1, &f_count->refcnt)) return true; <- taken if (put(&f_count)) { <-- Succeeds remove_pointer(p); /* * Cache is SLAB_TYPESAFE_BY_RCU * so this is freed without a grace period. */ kmem_cache_free(p); } kmem_cache_alloc() init_file() { // Sets f_count->refcnt to ONEREF rcuref_long_init(&f->f_count, 1); } Object has been reused within the same grace period via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU. /* * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and * it would work correctly because the atomic_cmpxchg() * will fail because the refcount has been reset to ONEREF by T3. */ atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF However, there are other cases to consider: (1) Benign race due to multiple atomic_long_read() CPU1 CPU2 file_ref_put() // last reference // => count goes negative/FILE_REF_NOREF atomic_long_add_negative_release(-1, &ref->refcnt) -> __file_ref_put() file_ref_get() // goes back from negative/FILE_REF_NOREF to 0 // and file_ref_get() succeeds atomic_long_add_negative(1, &ref->refcnt) // This is immediately followed by file_ref_put() // managing to set FILE_REF_DEAD file_ref_put() // __file_ref_put() continues and sees // cnt > FILE_REF_RELEASED // and splats with // "imbalanced put on file reference count" cnt = atomic_long_read(&ref->refcnt); The race however is benign and the problem is the atomic_long_read(). Instead of performing a separate read this uses atomic_long_dec_return() and pass the value to __file_ref_put(). Thanks to Linus for pointing out that braino. (2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead When a file is recycled the following race exists: CPU1 CPU2 // @file is already dead and thus // cnt >= FILE_REF_RELEASED. file_ref_get(file) atomic_long_add_negative(1, &ref->refcnt) // We thus call into __file_ref_get() -> __file_ref_get() // which sees cnt >= FILE_REF_RELEASED cnt = atomic_long_read(&ref->refcnt); // In the meantime @file gets freed kmem_cache_free() // and is immediately recycled file = kmem_cache_zalloc() // and the reference count is reinitialized // and the file alive again in someone // else's file descriptor table file_ref_init(&ref->refcnt, 1); // the __file_ref_get() slowpath now continues // and as it saw earlier that cnt >= FILE_REF_RELEASED // it wants to ensure that we're staying in the middle // of the deadzone and unconditionally sets // FILE_REF_DEAD. // This marks @file dead for CPU2... atomic_long_set(&ref->refcnt, FILE_REF_DEAD); // Caller issues a close() system call to close @file close(fd) file = file_close_fd_locked() filp_flush() // The caller sees that cnt >= FILE_REF_RELEASED // and warns the first time... CHECK_DATA_CORRUPTION(file_count(file) == 0) // and then splats a second time because // __file_ref_put() sees cnt >= FILE_REF_RELEASED file_ref_put(&ref->refcnt); -> __file_ref_put() My initial inclination was to replace the unconditional atomic_long_set() with an atomic_long_try_cmpxchg() but Linus pointed out that: > I think we should just make file_ref_get() do a simple > > return !atomic_long_add_negative(1, &ref->refcnt)); > > and nothing else. Yes, multiple CPU's can race, and you can increment > more than once, but the gap - even on 32-bit - between DEAD and > becoming close to REF_RELEASED is so big that we simply don't care. > That's the point of having a gap. I've been testing this with will-it-scale using fstat() on a machine that Jens gave me access (thank you very much!): processor : 511 vendor_id : AuthenticAMD cpu family : 25 model : 160 model name : AMD EPYC 9754 128-Core Processor and I consistently get a 3-5% improvement on 256+ threads. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Closes: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
1 parent 8b1bc25 commit 08ef26e

File tree

2 files changed

+240
-0
lines changed

2 files changed

+240
-0
lines changed

fs/file.c

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,73 @@
2020
#include <linux/spinlock.h>
2121
#include <linux/rcupdate.h>
2222
#include <linux/close_range.h>
23+
#include <linux/file_ref.h>
2324
#include <net/sock.h>
2425

2526
#include "internal.h"
2627

28+
/**
29+
* __file_ref_put - Slowpath of file_ref_put()
30+
* @ref: Pointer to the reference count
31+
* @cnt: Current reference count
32+
*
33+
* Invoked when the reference count is outside of the valid zone.
34+
*
35+
* Return:
36+
* True if this was the last reference with no future references
37+
* possible. This signals the caller that it can safely schedule the
38+
* object, which is protected by the reference counter, for
39+
* deconstruction.
40+
*
41+
* False if there are still active references or the put() raced
42+
* with a concurrent get()/put() pair. Caller is not allowed to
43+
* deconstruct the protected object.
44+
*/
45+
bool __file_ref_put(file_ref_t *ref, unsigned long cnt)
46+
{
47+
/* Did this drop the last reference? */
48+
if (likely(cnt == FILE_REF_NOREF)) {
49+
/*
50+
* Carefully try to set the reference count to FILE_REF_DEAD.
51+
*
52+
* This can fail if a concurrent get() operation has
53+
* elevated it again or the corresponding put() even marked
54+
* it dead already. Both are valid situations and do not
55+
* require a retry. If this fails the caller is not
56+
* allowed to deconstruct the object.
57+
*/
58+
if (!atomic_long_try_cmpxchg_release(&ref->refcnt, &cnt, FILE_REF_DEAD))
59+
return false;
60+
61+
/*
62+
* The caller can safely schedule the object for
63+
* deconstruction. Provide acquire ordering.
64+
*/
65+
smp_acquire__after_ctrl_dep();
66+
return true;
67+
}
68+
69+
/*
70+
* If the reference count was already in the dead zone, then this
71+
* put() operation is imbalanced. Warn, put the reference count back to
72+
* DEAD and tell the caller to not deconstruct the object.
73+
*/
74+
if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) {
75+
atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
76+
return false;
77+
}
78+
79+
/*
80+
* This is a put() operation on a saturated refcount. Restore the
81+
* mean saturation value and tell the caller to not deconstruct the
82+
* object.
83+
*/
84+
if (cnt > FILE_REF_MAXREF)
85+
atomic_long_set(&ref->refcnt, FILE_REF_SATURATED);
86+
return false;
87+
}
88+
EXPORT_SYMBOL_GPL(__file_ref_put);
89+
2790
unsigned int sysctl_nr_open __read_mostly = 1024*1024;
2891
unsigned int sysctl_nr_open_min = BITS_PER_LONG;
2992
/* our min() is unusable in constant expressions ;-/ */

include/linux/file_ref.h

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
/* SPDX-License-Identifier: GPL-2.0-only */
2+
#ifndef _LINUX_FILE_REF_H
3+
#define _LINUX_FILE_REF_H
4+
5+
#include <linux/atomic.h>
6+
#include <linux/preempt.h>
7+
#include <linux/types.h>
8+
9+
/*
10+
* file_ref is a reference count implementation specifically for use by
11+
* files. It takes inspiration from rcuref but differs in key aspects
12+
* such as support for SLAB_TYPESAFE_BY_RCU type caches.
13+
*
14+
* FILE_REF_ONEREF FILE_REF_MAXREF
15+
* 0x0000000000000000UL 0x7FFFFFFFFFFFFFFFUL
16+
* <-------------------valid ------------------->
17+
*
18+
* FILE_REF_SATURATED
19+
* 0x8000000000000000UL 0xA000000000000000UL 0xBFFFFFFFFFFFFFFFUL
20+
* <-----------------------saturation zone---------------------->
21+
*
22+
* FILE_REF_RELEASED FILE_REF_DEAD
23+
* 0xC000000000000000UL 0xE000000000000000UL
24+
* <-------------------dead zone------------------->
25+
*
26+
* FILE_REF_NOREF
27+
* 0xFFFFFFFFFFFFFFFFUL
28+
*/
29+
30+
#ifdef CONFIG_64BIT
31+
#define FILE_REF_ONEREF 0x0000000000000000UL
32+
#define FILE_REF_MAXREF 0x7FFFFFFFFFFFFFFFUL
33+
#define FILE_REF_SATURATED 0xA000000000000000UL
34+
#define FILE_REF_RELEASED 0xC000000000000000UL
35+
#define FILE_REF_DEAD 0xE000000000000000UL
36+
#define FILE_REF_NOREF 0xFFFFFFFFFFFFFFFFUL
37+
#else
38+
#define FILE_REF_ONEREF 0x00000000U
39+
#define FILE_REF_MAXREF 0x7FFFFFFFU
40+
#define FILE_REF_SATURATED 0xA0000000U
41+
#define FILE_REF_RELEASED 0xC0000000U
42+
#define FILE_REF_DEAD 0xE0000000U
43+
#define FILE_REF_NOREF 0xFFFFFFFFU
44+
#endif
45+
46+
typedef struct {
47+
#ifdef CONFIG_64BIT
48+
atomic64_t refcnt;
49+
#else
50+
atomic_t refcnt;
51+
#endif
52+
} file_ref_t;
53+
54+
/**
55+
* file_ref_init - Initialize a file reference count
56+
* @ref: Pointer to the reference count
57+
* @cnt: The initial reference count typically '1'
58+
*/
59+
static inline void file_ref_init(file_ref_t *ref, unsigned long cnt)
60+
{
61+
atomic_long_set(&ref->refcnt, cnt - 1);
62+
}
63+
64+
bool __file_ref_put(file_ref_t *ref, unsigned long cnt);
65+
66+
/**
67+
* file_ref_get - Acquire one reference on a file
68+
* @ref: Pointer to the reference count
69+
*
70+
* Similar to atomic_inc_not_zero() but saturates at FILE_REF_MAXREF.
71+
*
72+
* Provides full memory ordering.
73+
*
74+
* Return: False if the attempt to acquire a reference failed. This happens
75+
* when the last reference has been put already. True if a reference
76+
* was successfully acquired
77+
*/
78+
static __always_inline __must_check bool file_ref_get(file_ref_t *ref)
79+
{
80+
/*
81+
* Unconditionally increase the reference count with full
82+
* ordering. The saturation and dead zones provide enough
83+
* tolerance for this.
84+
*
85+
* If this indicates negative the file in question the fail can
86+
* be freed and immediately reused due to SLAB_TYPSAFE_BY_RCU.
87+
* Hence, unconditionally altering the file reference count to
88+
* e.g., reset the file reference count back to the middle of
89+
* the deadzone risk end up marking someone else's file as dead
90+
* behind their back.
91+
*
92+
* It would be possible to do a careful:
93+
*
94+
* cnt = atomic_long_inc_return();
95+
* if (likely(cnt >= 0))
96+
* return true;
97+
*
98+
* and then something like:
99+
*
100+
* if (cnt >= FILE_REF_RELEASE)
101+
* atomic_long_try_cmpxchg(&ref->refcnt, &cnt, FILE_REF_DEAD),
102+
*
103+
* to set the value back to the middle of the deadzone. But it's
104+
* practically impossible to go from FILE_REF_DEAD to
105+
* FILE_REF_ONEREF. It would need 2305843009213693952/2^61
106+
* file_ref_get()s to resurrect such a dead file.
107+
*/
108+
return !atomic_long_add_negative(1, &ref->refcnt);
109+
}
110+
111+
/**
112+
* file_ref_inc - Acquire one reference on a file
113+
* @ref: Pointer to the reference count
114+
*
115+
* Acquire an additional reference on a file. Warns if the caller didn't
116+
* already hold a reference.
117+
*/
118+
static __always_inline void file_ref_inc(file_ref_t *ref)
119+
{
120+
long prior = atomic_long_fetch_inc_relaxed(&ref->refcnt);
121+
WARN_ONCE(prior < 0, "file_ref_inc() on a released file reference");
122+
}
123+
124+
/**
125+
* file_ref_put -- Release a file reference
126+
* @ref: Pointer to the reference count
127+
*
128+
* Provides release memory ordering, such that prior loads and stores
129+
* are done before, and provides an acquire ordering on success such
130+
* that free() must come after.
131+
*
132+
* Return: True if this was the last reference with no future references
133+
* possible. This signals the caller that it can safely release
134+
* the object which is protected by the reference counter.
135+
* False if there are still active references or the put() raced
136+
* with a concurrent get()/put() pair. Caller is not allowed to
137+
* release the protected object.
138+
*/
139+
static __always_inline __must_check bool file_ref_put(file_ref_t *ref)
140+
{
141+
long cnt;
142+
143+
/*
144+
* While files are SLAB_TYPESAFE_BY_RCU and thus file_ref_put()
145+
* calls don't risk UAFs when a file is recyclyed, it is still
146+
* vulnerable to UAFs caused by freeing the whole slab page once
147+
* it becomes unused. Prevent file_ref_put() from being
148+
* preempted protects against this.
149+
*/
150+
guard(preempt)();
151+
/*
152+
* Unconditionally decrease the reference count. The saturation
153+
* and dead zones provide enough tolerance for this. If this
154+
* fails then we need to handle the last reference drop and
155+
* cases inside the saturation and dead zones.
156+
*/
157+
cnt = atomic_long_dec_return(&ref->refcnt);
158+
if (cnt >= 0)
159+
return false;
160+
return __file_ref_put(ref, cnt);
161+
}
162+
163+
/**
164+
* file_ref_read - Read the number of file references
165+
* @ref: Pointer to the reference count
166+
*
167+
* Return: The number of held references (0 ... N)
168+
*/
169+
static inline unsigned long file_ref_read(file_ref_t *ref)
170+
{
171+
unsigned long c = atomic_long_read(&ref->refcnt);
172+
173+
/* Return 0 if within the DEAD zone. */
174+
return c >= FILE_REF_RELEASED ? 0 : c + 1;
175+
}
176+
177+
#endif

0 commit comments

Comments
 (0)