Skip to content

Commit b6e13e8

Browse files
Peter Zijlstrasuryasaimadhu
authored andcommitted
sched/core: Fix ttwu() race
Paul reported rcutorture occasionally hitting a NULL deref: sched_ttwu_pending() ttwu_do_wakeup() check_preempt_curr() := check_preempt_wakeup() find_matching_se() is_same_group() if (se->cfs_rq == pse->cfs_rq) <-- *BOOM* Debugging showed that this only appears to happen when we take the new code-path from commit: 2ebb177 ("sched/core: Offload wakee task activation if it the wakee is descheduling") and only when @cpu == smp_processor_id(). Something which should not be possible, because p->on_cpu can only be true for remote tasks. Similarly, without the new code-path from commit: c6e7bd7 ("sched/core: Optimize ttwu() spinning on p->on_cpu") this would've unconditionally hit: smp_cond_load_acquire(&p->on_cpu, !VAL); and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this would result in an instant live-lock (with IRQs disabled), something that hasn't been reported. The NULL deref can be explained however if the task_cpu(p) load at the beginning of try_to_wake_up() returns an old value, and this old value happens to be smp_processor_id(). Further assume that the p->on_cpu load accurately returns 1, it really is still running, just not here. Then, when we enqueue the task locally, we can crash in exactly the observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq is from the wrong CPU, therefore we'll iterate into the non-existant parents and NULL deref. The closest semi-plausible scenario I've managed to contrive is somewhat elaborate (then again, actual reproduction takes many CPU hours of rcutorture, so it can't be anything obvious): X->cpu = 1 rq(1)->curr = X CPU0 CPU1 CPU2 // switch away from X LOCK rq(1)->lock smp_mb__after_spinlock dequeue_task(X) X->on_rq = 9 switch_to(Z) X->on_cpu = 0 UNLOCK rq(1)->lock // migrate X to cpu 0 LOCK rq(1)->lock dequeue_task(X) set_task_cpu(X, 0) X->cpu = 0 UNLOCK rq(1)->lock LOCK rq(0)->lock enqueue_task(X) X->on_rq = 1 UNLOCK rq(0)->lock // switch to X LOCK rq(0)->lock smp_mb__after_spinlock switch_to(X) X->on_cpu = 1 UNLOCK rq(0)->lock // X goes sleep X->state = TASK_UNINTERRUPTIBLE smp_mb(); // wake X ttwu() LOCK X->pi_lock smp_mb__after_spinlock if (p->state) cpu = X->cpu; // =? 1 smp_rmb() // X calls schedule() LOCK rq(0)->lock smp_mb__after_spinlock dequeue_task(X) X->on_rq = 0 if (p->on_rq) smp_rmb(); if (p->on_cpu && ttwu_queue_wakelist(..)) [*] smp_cond_load_acquire(&p->on_cpu, !VAL) cpu = select_task_rq(X, X->wake_cpu, ...) if (X->cpu != cpu) switch_to(Y) X->on_cpu = 0 UNLOCK rq(0)->lock However I'm having trouble convincing myself that's actually possible on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu observes ->state != RUNNING, it must also observe ->cpu != 1. (Most of the previous ttwu() races were found on very large PowerPC) Nevertheless, this fully explains the observed failure case. Fix it by ordering the task_cpu(p) load after the p->on_cpu load, which is easy since nothing actually uses @cpu before this. Fixes: c6e7bd7 ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Paul E. McKenney <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
1 parent 740797c commit b6e13e8

File tree

1 file changed

+28
-5
lines changed

1 file changed

+28
-5
lines changed

kernel/sched/core.c

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2293,8 +2293,15 @@ void sched_ttwu_pending(void *arg)
22932293
rq_lock_irqsave(rq, &rf);
22942294
update_rq_clock(rq);
22952295

2296-
llist_for_each_entry_safe(p, t, llist, wake_entry)
2296+
llist_for_each_entry_safe(p, t, llist, wake_entry) {
2297+
if (WARN_ON_ONCE(p->on_cpu))
2298+
smp_cond_load_acquire(&p->on_cpu, !VAL);
2299+
2300+
if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
2301+
set_task_cpu(p, cpu_of(rq));
2302+
22972303
ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
2304+
}
22982305

22992306
rq_unlock_irqrestore(rq, &rf);
23002307
}
@@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
23782385
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
23792386
{
23802387
if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
2388+
if (WARN_ON_ONCE(cpu == smp_processor_id()))
2389+
return false;
2390+
23812391
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
23822392
__ttwu_queue_wakelist(p, cpu, wake_flags);
23832393
return true;
@@ -2528,7 +2538,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
25282538
goto out;
25292539

25302540
success = 1;
2531-
cpu = task_cpu(p);
25322541
trace_sched_waking(p);
25332542
p->state = TASK_RUNNING;
25342543
trace_sched_wakeup(p);
@@ -2550,7 +2559,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
25502559

25512560
/* We're going to change ->state: */
25522561
success = 1;
2553-
cpu = task_cpu(p);
25542562

25552563
/*
25562564
* Ensure we load p->on_rq _after_ p->state, otherwise it would
@@ -2614,8 +2622,21 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
26142622
* which potentially sends an IPI instead of spinning on p->on_cpu to
26152623
* let the waker make forward progress. This is safe because IRQs are
26162624
* disabled and the IPI will deliver after on_cpu is cleared.
2625+
*
2626+
* Ensure we load task_cpu(p) after p->on_cpu:
2627+
*
2628+
* set_task_cpu(p, cpu);
2629+
* STORE p->cpu = @cpu
2630+
* __schedule() (switch to task 'p')
2631+
* LOCK rq->lock
2632+
* smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
2633+
* STORE p->on_cpu = 1 LOAD p->cpu
2634+
*
2635+
* to ensure we observe the correct CPU on which the task is currently
2636+
* scheduling.
26172637
*/
2618-
if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
2638+
if (smp_load_acquire(&p->on_cpu) &&
2639+
ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_RQ))
26192640
goto unlock;
26202641

26212642
/*
@@ -2635,14 +2656,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
26352656
psi_ttwu_dequeue(p);
26362657
set_task_cpu(p, cpu);
26372658
}
2659+
#else
2660+
cpu = task_cpu(p);
26382661
#endif /* CONFIG_SMP */
26392662

26402663
ttwu_queue(p, cpu, wake_flags);
26412664
unlock:
26422665
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
26432666
out:
26442667
if (success)
2645-
ttwu_stat(p, cpu, wake_flags);
2668+
ttwu_stat(p, task_cpu(p), wake_flags);
26462669
preempt_enable();
26472670

26482671
return success;

0 commit comments

Comments
 (0)