Skip to content

Commit 98f887f

Browse files
npigginhtejun
authored andcommitted
workqueue: Improve scalability of workqueue watchdog touch
On a ~2000 CPU powerpc system, hard lockups have been observed in the workqueue code when stop_machine runs (in this case due to CPU hotplug). This is due to lots of CPUs spinning in multi_cpu_stop, calling touch_nmi_watchdog() which ends up calling wq_watchdog_touch(). wq_watchdog_touch() writes to the global variable wq_watchdog_touched, and that can find itself in the same cacheline as other important workqueue data, which slows down operations to the point of lockups. In the case of the following abridged trace, worker_pool_idr was in the hot line, causing the lockups to always appear at idr_find. watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find Call Trace: get_work_pool __queue_work call_timer_fn run_timer_softirq __do_softirq do_softirq_own_stack irq_exit timer_interrupt decrementer_common_virt * interrupt: 900 (timer) at multi_cpu_stop multi_cpu_stop cpu_stopper_thread smpboot_thread_fn kthread Fix this by having wq_watchdog_touch() only write to the line if the last time a touch was recorded exceeds 1/4 of the watchdog threshold. Reported-by: Srikar Dronamraju <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
1 parent 18e24de commit 98f887f

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

kernel/workqueue.c

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7524,12 +7524,18 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
75247524

75257525
notrace void wq_watchdog_touch(int cpu)
75267526
{
7527+
unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;
7528+
unsigned long touch_ts = READ_ONCE(wq_watchdog_touched);
7529+
unsigned long now = jiffies;
7530+
75277531
if (cpu >= 0)
7528-
per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
7532+
per_cpu(wq_watchdog_touched_cpu, cpu) = now;
75297533
else
75307534
WARN_ONCE(1, "%s should be called with valid CPU", __func__);
75317535

7532-
wq_watchdog_touched = jiffies;
7536+
/* Don't unnecessarily store to global cacheline */
7537+
if (time_after(now, touch_ts + thresh / 4))
7538+
WRITE_ONCE(wq_watchdog_touched, jiffies);
75337539
}
75347540

75357541
static void wq_watchdog_set_thresh(unsigned long thresh)

0 commit comments

Comments
 (0)