Skip to content

Commit 4f07ec0

Browse files
committed
futex: Prevent inconsistent state and exit race
The recent rework of the requeue PI code introduced a possibility for going back to user space in inconsistent state: CPU 0 CPU 1 requeue_futex() if (lock_pifutex_user()) { dequeue_waiter(); wake_waiter(task); sched_in(task); return_from_futex_syscall(); ---> Inconsistent state because PI state is not established It becomes worse if the woken up task immediately exits: sys_exit(); attach_pistate(vpid); <--- FAIL Attach the pi state before dequeuing and waking the waiter. If the waiter gets a spurious wakeup before the dequeue operation it will wait in futex_requeue_pi_wakeup_sync() and therefore cannot return and exit. Fixes: 07d91ef ("futex: Prevent requeue_pi() lock nesting issue on RT") Reported-by: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent a974b54 commit 4f07ec0

File tree

1 file changed

+55
-43
lines changed

1 file changed

+55
-43
lines changed

kernel/futex.c

Lines changed: 55 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1454,8 +1454,23 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
14541454
newval |= FUTEX_WAITERS;
14551455

14561456
ret = lock_pi_update_atomic(uaddr, uval, newval);
1457-
/* If the take over worked, return 1 */
1458-
return ret < 0 ? ret : 1;
1457+
if (ret)
1458+
return ret;
1459+
1460+
/*
1461+
* If the waiter bit was requested the caller also needs PI
1462+
* state attached to the new owner of the user space futex.
1463+
*
1464+
* @task is guaranteed to be alive and it cannot be exiting
1465+
* because it is either sleeping or waiting in
1466+
* futex_requeue_pi_wakeup_sync().
1467+
*/
1468+
if (set_waiters) {
1469+
ret = attach_to_pi_owner(uaddr, newval, key, ps,
1470+
exiting);
1471+
WARN_ON(ret);
1472+
}
1473+
return 1;
14591474
}
14601475

14611476
/*
@@ -2036,17 +2051,24 @@ futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1,
20362051
return -EAGAIN;
20372052

20382053
/*
2039-
* Try to take the lock for top_waiter. Set the FUTEX_WAITERS bit in
2040-
* the contended case or if set_waiters is 1. The pi_state is returned
2041-
* in ps in contended cases.
2054+
* Try to take the lock for top_waiter and set the FUTEX_WAITERS bit
2055+
* in the contended case or if @set_waiters is true.
2056+
*
2057+
* In the contended case PI state is attached to the lock owner. If
2058+
* the user space lock can be acquired then PI state is attached to
2059+
* the new owner (@top_waiter->task) when @set_waiters is true.
20422060
*/
20432061
vpid = task_pid_vnr(top_waiter->task);
20442062
ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
20452063
exiting, set_waiters);
20462064
if (ret == 1) {
2047-
/* Dequeue, wake up and update top_waiter::requeue_state */
2065+
/*
2066+
* Lock was acquired in user space and PI state was
2067+
* attached to @top_waiter->task. That means state is fully
2068+
* consistent and the waiter can return to user space
2069+
* immediately after the wakeup.
2070+
*/
20482071
requeue_pi_wake_futex(top_waiter, key2, hb2);
2049-
return vpid;
20502072
} else if (ret < 0) {
20512073
/* Rewind top_waiter::requeue_state */
20522074
futex_requeue_pi_complete(top_waiter, ret);
@@ -2208,19 +2230,26 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
22082230
&exiting, nr_requeue);
22092231

22102232
/*
2211-
* At this point the top_waiter has either taken uaddr2 or is
2212-
* waiting on it. If the former, then the pi_state will not
2213-
* exist yet, look it up one more time to ensure we have a
2214-
* reference to it. If the lock was taken, @ret contains the
2215-
* VPID of the top waiter task.
2216-
* If the lock was not taken, we have pi_state and an initial
2217-
* refcount on it. In case of an error we have nothing.
2233+
* At this point the top_waiter has either taken uaddr2 or
2234+
* is waiting on it. In both cases pi_state has been
2235+
* established and an initial refcount on it. In case of an
2236+
* error there's nothing.
22182237
*
22192238
* The top waiter's requeue_state is up to date:
22202239
*
2221-
* - If the lock was acquired atomically (ret > 0), then
2240+
* - If the lock was acquired atomically (ret == 1), then
22222241
* the state is Q_REQUEUE_PI_LOCKED.
22232242
*
2243+
* The top waiter has been dequeued and woken up and can
2244+
* return to user space immediately. The kernel/user
2245+
* space state is consistent. In case that there must be
2246+
* more waiters requeued the WAITERS bit in the user
2247+
* space futex is set so the top waiter task has to go
2248+
* into the syscall slowpath to unlock the futex. This
2249+
* will block until this requeue operation has been
2250+
* completed and the hash bucket locks have been
2251+
* dropped.
2252+
*
22242253
* - If the trylock failed with an error (ret < 0) then
22252254
* the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
22262255
* happened", or Q_REQUEUE_PI_IGNORE when there was an
@@ -2234,36 +2263,20 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
22342263
* the same sanity checks for requeue_pi as the loop
22352264
* below does.
22362265
*/
2237-
if (ret > 0) {
2238-
WARN_ON(pi_state);
2239-
task_count++;
2240-
/*
2241-
* If futex_proxy_trylock_atomic() acquired the
2242-
* user space futex, then the user space value
2243-
* @uaddr2 has been set to the @hb1's top waiter
2244-
* task VPID. This task is guaranteed to be alive
2245-
* and cannot be exiting because it is either
2246-
* sleeping or blocked on @hb2 lock.
2247-
*
2248-
* The @uaddr2 futex cannot have waiters either as
2249-
* otherwise futex_proxy_trylock_atomic() would not
2250-
* have succeeded.
2251-
*
2252-
* In order to requeue waiters to @hb2, pi state is
2253-
* required. Hand in the VPID value (@ret) and
2254-
* allocate PI state with an initial refcount on
2255-
* it.
2256-
*/
2257-
ret = attach_to_pi_owner(uaddr2, ret, &key2, &pi_state,
2258-
&exiting);
2259-
WARN_ON(ret);
2260-
}
2261-
22622266
switch (ret) {
22632267
case 0:
22642268
/* We hold a reference on the pi state. */
22652269
break;
22662270

2271+
case 1:
2272+
/*
2273+
* futex_proxy_trylock_atomic() acquired the user space
2274+
* futex. Adjust task_count.
2275+
*/
2276+
task_count++;
2277+
ret = 0;
2278+
break;
2279+
22672280
/*
22682281
* If the above failed, then pi_state is NULL and
22692282
* waiter::requeue_state is correct.
@@ -2395,9 +2408,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
23952408
}
23962409

23972410
/*
2398-
* We took an extra initial reference to the pi_state either in
2399-
* futex_proxy_trylock_atomic() or in attach_to_pi_owner(). We need
2400-
* to drop it here again.
2411+
* We took an extra initial reference to the pi_state in
2412+
* futex_proxy_trylock_atomic(). We need to drop it here again.
24012413
*/
24022414
put_pi_state(pi_state);
24032415

0 commit comments

Comments
 (0)