Skip to content

Commit e787644

Browse files
Frederic Weisbeckerneeraju
authored andcommitted
rcu: Defer RCU kthreads wakeup when CPU is dying
When the CPU goes idle for the last time during the CPU down hotplug process, RCU reports a final quiescent state for the current CPU. If this quiescent state propagates up to the top, some tasks may then be woken up to complete the grace period: the main grace period kthread and/or the expedited main workqueue (or kworker). If those kthreads have a SCHED_FIFO policy, the wake up can indirectly arm the RT bandwith timer to the local offline CPU. Since this happens after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the timer gets ignored. Therefore if the RCU kthreads are waiting for RT bandwidth to be available, they may never be actually scheduled. This triggers TREE03 rcutorture hangs: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2eb/0xa80 schedule+0x1f/0x90 schedule_timeout+0x163/0x270 ? __pfx_process_timeout+0x10/0x10 rcu_gp_fqs_loop+0x37c/0x5b0 ? __pfx_rcu_gp_kthread+0x10/0x10 rcu_gp_kthread+0x17c/0x200 kthread+0xde/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> The situation can't be solved with just unpinning the timer. The hrtimer infrastructure and the nohz heuristics involved in finding the best remote target for an unpinned timer would then also need to handle enqueues from an offline CPU in the most horrendous way. So fix this on the RCU side instead and defer the wake up to an online CPU if it's too late for the local one. Reported-by: Paul E. McKenney <[email protected]> Fixes: 5c0930c ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay (AMD) <[email protected]>
1 parent 6613476 commit e787644

File tree

2 files changed

+34
-3
lines changed

2 files changed

+34
-3
lines changed

kernel/rcu/tree.c

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
10131013
return needmore;
10141014
}
10151015

1016+
static void swake_up_one_online_ipi(void *arg)
1017+
{
1018+
struct swait_queue_head *wqh = arg;
1019+
1020+
swake_up_one(wqh);
1021+
}
1022+
1023+
static void swake_up_one_online(struct swait_queue_head *wqh)
1024+
{
1025+
int cpu = get_cpu();
1026+
1027+
/*
1028+
* If called from rcutree_report_cpu_starting(), wake up
1029+
* is dangerous that late in the CPU-down hotplug process. The
1030+
* scheduler might queue an ignored hrtimer. Defer the wake up
1031+
* to an online CPU instead.
1032+
*/
1033+
if (unlikely(cpu_is_offline(cpu))) {
1034+
int target;
1035+
1036+
target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
1037+
cpu_online_mask);
1038+
1039+
smp_call_function_single(target, swake_up_one_online_ipi,
1040+
wqh, 0);
1041+
put_cpu();
1042+
} else {
1043+
put_cpu();
1044+
swake_up_one(wqh);
1045+
}
1046+
}
1047+
10161048
/*
10171049
* Awaken the grace-period kthread. Don't do a self-awaken (unless in an
10181050
* interrupt or softirq handler, in which case we just might immediately
@@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
10371069
return;
10381070
WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
10391071
WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
1040-
swake_up_one(&rcu_state.gp_wq);
1072+
swake_up_one_online(&rcu_state.gp_wq);
10411073
}
10421074

10431075
/*

kernel/rcu/tree_exp.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
173173
return ret;
174174
}
175175

176-
177176
/*
178177
* Report the exit from RCU read-side critical section for the last task
179178
* that queued itself during or before the current expedited preemptible-RCU
@@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
201200
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
202201
if (wake) {
203202
smp_mb(); /* EGP done before wake_up(). */
204-
swake_up_one(&rcu_state.expedited_wq);
203+
swake_up_one_online(&rcu_state.expedited_wq);
205204
}
206205
break;
207206
}

0 commit comments

Comments
 (0)