Skip to content

Commit a5aabac

Browse files
jgross1Ingo Molnar
authored andcommitted
locking/csd_lock: Add more data to CSD lock debugging
In order to help identifying problems with IPI handling and remote function execution add some more data to IPI debugging code. There have been multiple reports of CPUs looping long times (many seconds) in smp_call_function_many() waiting for another CPU executing a function like tlb flushing. Most of these reports have been for cases where the kernel was running as a guest on top of KVM or Xen (there are rumours of that happening under VMWare, too, and even on bare metal). Finding the root cause hasn't been successful yet, even after more than 2 years of chasing this bug by different developers. Commit: 35feb60 ("kernel/smp: Provide CSD lock timeout diagnostics") tried to address this by adding some debug code and by issuing another IPI when a hang was detected. This helped mitigating the problem (the repeated IPI unlocks the hang), but the root cause is still unknown. Current available data suggests that either an IPI wasn't sent when it should have been, or that the IPI didn't result in the target CPU executing the queued function (due to the IPI not reaching the CPU, the IPI handler not being called, or the handler not seeing the queued request). Try to add more diagnostic data by introducing a global atomic counter which is being incremented when doing critical operations (before and after queueing a new request, when sending an IPI, and when dequeueing a request). The counter value is stored in percpu variables which can be printed out when a hang is detected. The data of the last event (consisting of sequence counter, source CPU, target CPU, and event type) is stored in a global variable. When a new event is to be traced, the data of the last event is stored in the event related percpu location and the global data is updated with the new event's data. This allows to track two events in one data location: one by the value of the event data (the event before the current one), and one by the location itself (the current event). A typical printout with a detected hang will look like this: csd: Detected non-responsive CSD lock (Rust-for-Linux#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410). csd: CSD lock (Rust-for-Linux#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request. csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty) csd: cnt(00008cd): ffff->0006 idle csd: cnt(0003668): 0001->0006 queue csd: cnt(0003669): 0001->0006 ipi csd: cnt(0003e0f): 0007->000a queue csd: cnt(0003e10): 0001->ffff ping csd: cnt(0003e71): 0003->0000 ping csd: cnt(0003e72): ffff->0006 gotipi csd: cnt(0003e73): ffff->0006 handle csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty) csd: cnt(0003e7f): 0004->0006 ping csd: cnt(0003e80): 0001->ffff pinged csd: cnt(0003eb2): 0005->0001 noipi csd: cnt(0003eb3): 0001->0006 queue csd: cnt(0003eb4): 0001->0006 noipi csd: cnt now: 0003f00 The idea is to print only relevant entries. Those are all events which are associated with the hang (so sender side events for the source CPU of the hanging request, and receiver side events for the target CPU), and the related events just before those (for adding data needed to identify a possible race). Printing all available data would be possible, but this would add large amounts of data printed on larger configurations. Signed-off-by: Juergen Gross <[email protected]> [ Minor readability edits. Breaks col80 but is far more readable. ] Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent de7b09e commit a5aabac

File tree

2 files changed

+226
-4
lines changed

2 files changed

+226
-4
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -789,6 +789,10 @@
789789
printed to the console in case a hanging CPU is
790790
detected, and that CPU is pinged again in order to try
791791
to resolve the hang situation.
792+
0: disable csdlock debugging (default)
793+
1: enable basic csdlock debugging (minor impact)
794+
ext: enable extended csdlock debugging (more impact,
795+
but more data)
792796

793797
dasd= [HW,NET]
794798
See header of drivers/s390/block/dasd_devmap.c.

kernel/smp.c

Lines changed: 222 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,59 @@
3131

3232
#define CSD_TYPE(_csd) ((_csd)->node.u_flags & CSD_FLAG_TYPE_MASK)
3333

34+
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
35+
union cfd_seq_cnt {
36+
u64 val;
37+
struct {
38+
u64 src:16;
39+
u64 dst:16;
40+
#define CFD_SEQ_NOCPU 0xffff
41+
u64 type:4;
42+
#define CFD_SEQ_QUEUE 0
43+
#define CFD_SEQ_IPI 1
44+
#define CFD_SEQ_NOIPI 2
45+
#define CFD_SEQ_PING 3
46+
#define CFD_SEQ_PINGED 4
47+
#define CFD_SEQ_HANDLE 5
48+
#define CFD_SEQ_DEQUEUE 6
49+
#define CFD_SEQ_IDLE 7
50+
#define CFD_SEQ_GOTIPI 8
51+
#define CFD_SEQ_HDLEND 9
52+
u64 cnt:28;
53+
} u;
54+
};
55+
56+
static char *seq_type[] = {
57+
[CFD_SEQ_QUEUE] = "queue",
58+
[CFD_SEQ_IPI] = "ipi",
59+
[CFD_SEQ_NOIPI] = "noipi",
60+
[CFD_SEQ_PING] = "ping",
61+
[CFD_SEQ_PINGED] = "pinged",
62+
[CFD_SEQ_HANDLE] = "handle",
63+
[CFD_SEQ_DEQUEUE] = "dequeue (src CPU 0 == empty)",
64+
[CFD_SEQ_IDLE] = "idle",
65+
[CFD_SEQ_GOTIPI] = "gotipi",
66+
[CFD_SEQ_HDLEND] = "hdlend (src CPU 0 == early)",
67+
};
68+
69+
struct cfd_seq_local {
70+
u64 ping;
71+
u64 pinged;
72+
u64 handle;
73+
u64 dequeue;
74+
u64 idle;
75+
u64 gotipi;
76+
u64 hdlend;
77+
};
78+
#endif
79+
3480
struct cfd_percpu {
3581
call_single_data_t csd;
82+
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
83+
u64 seq_queue;
84+
u64 seq_ipi;
85+
u64 seq_noipi;
86+
#endif
3687
};
3788

3889
struct call_function_data {
@@ -108,12 +159,18 @@ void __init call_function_init(void)
108159
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
109160

110161
static DEFINE_STATIC_KEY_FALSE(csdlock_debug_enabled);
162+
static DEFINE_STATIC_KEY_FALSE(csdlock_debug_extended);
111163

112164
static int __init csdlock_debug(char *str)
113165
{
114166
unsigned int val = 0;
115167

116-
get_option(&str, &val);
168+
if (str && !strcmp(str, "ext")) {
169+
val = 1;
170+
static_branch_enable(&csdlock_debug_extended);
171+
} else
172+
get_option(&str, &val);
173+
117174
if (val)
118175
static_branch_enable(&csdlock_debug_enabled);
119176

@@ -124,9 +181,34 @@ early_param("csdlock_debug", csdlock_debug);
124181
static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
125182
static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
126183
static DEFINE_PER_CPU(void *, cur_csd_info);
184+
static DEFINE_PER_CPU(struct cfd_seq_local, cfd_seq_local);
127185

128186
#define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC)
129187
static atomic_t csd_bug_count = ATOMIC_INIT(0);
188+
static u64 cfd_seq;
189+
190+
#define CFD_SEQ(s, d, t, c) \
191+
(union cfd_seq_cnt){ .u.src = s, .u.dst = d, .u.type = t, .u.cnt = c }
192+
193+
static u64 cfd_seq_inc(unsigned int src, unsigned int dst, unsigned int type)
194+
{
195+
union cfd_seq_cnt new, old;
196+
197+
new = CFD_SEQ(src, dst, type, 0);
198+
199+
do {
200+
old.val = READ_ONCE(cfd_seq);
201+
new.u.cnt = old.u.cnt + 1;
202+
} while (cmpxchg(&cfd_seq, old.val, new.val) != old.val);
203+
204+
return old.val;
205+
}
206+
207+
#define cfd_seq_store(var, src, dst, type) \
208+
do { \
209+
if (static_branch_unlikely(&csdlock_debug_extended)) \
210+
var = cfd_seq_inc(src, dst, type); \
211+
} while (0)
130212

131213
/* Record current CSD work for current CPU, NULL to erase. */
132214
static void __csd_lock_record(call_single_data_t *csd)
@@ -160,6 +242,80 @@ static int csd_lock_wait_getcpu(call_single_data_t *csd)
160242
return -1;
161243
}
162244

245+
static void cfd_seq_data_add(u64 val, unsigned int src, unsigned int dst,
246+
unsigned int type, union cfd_seq_cnt *data,
247+
unsigned int *n_data, unsigned int now)
248+
{
249+
union cfd_seq_cnt new[2];
250+
unsigned int i, j, k;
251+
252+
new[0].val = val;
253+
new[1] = CFD_SEQ(src, dst, type, new[0].u.cnt + 1);
254+
255+
for (i = 0; i < 2; i++) {
256+
if (new[i].u.cnt <= now)
257+
new[i].u.cnt |= 0x80000000U;
258+
for (j = 0; j < *n_data; j++) {
259+
if (new[i].u.cnt == data[j].u.cnt) {
260+
/* Direct read value trumps generated one. */
261+
if (i == 0)
262+
data[j].val = new[i].val;
263+
break;
264+
}
265+
if (new[i].u.cnt < data[j].u.cnt) {
266+
for (k = *n_data; k > j; k--)
267+
data[k].val = data[k - 1].val;
268+
data[j].val = new[i].val;
269+
(*n_data)++;
270+
break;
271+
}
272+
}
273+
if (j == *n_data) {
274+
data[j].val = new[i].val;
275+
(*n_data)++;
276+
}
277+
}
278+
}
279+
280+
static const char *csd_lock_get_type(unsigned int type)
281+
{
282+
return (type >= ARRAY_SIZE(seq_type)) ? "?" : seq_type[type];
283+
}
284+
285+
static void csd_lock_print_extended(call_single_data_t *csd, int cpu)
286+
{
287+
struct cfd_seq_local *seq = &per_cpu(cfd_seq_local, cpu);
288+
unsigned int srccpu = csd->node.src;
289+
struct call_function_data *cfd = per_cpu_ptr(&cfd_data, srccpu);
290+
struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
291+
unsigned int now;
292+
union cfd_seq_cnt data[2 * ARRAY_SIZE(seq_type)];
293+
unsigned int n_data = 0, i;
294+
295+
data[0].val = READ_ONCE(cfd_seq);
296+
now = data[0].u.cnt;
297+
298+
cfd_seq_data_add(pcpu->seq_queue, srccpu, cpu, CFD_SEQ_QUEUE, data, &n_data, now);
299+
cfd_seq_data_add(pcpu->seq_ipi, srccpu, cpu, CFD_SEQ_IPI, data, &n_data, now);
300+
cfd_seq_data_add(pcpu->seq_noipi, srccpu, cpu, CFD_SEQ_NOIPI, data, &n_data, now);
301+
302+
cfd_seq_data_add(per_cpu(cfd_seq_local.ping, srccpu), srccpu, CFD_SEQ_NOCPU, CFD_SEQ_PING, data, &n_data, now);
303+
cfd_seq_data_add(per_cpu(cfd_seq_local.pinged, srccpu), srccpu, CFD_SEQ_NOCPU, CFD_SEQ_PINGED, data, &n_data, now);
304+
305+
cfd_seq_data_add(seq->idle, CFD_SEQ_NOCPU, cpu, CFD_SEQ_IDLE, data, &n_data, now);
306+
cfd_seq_data_add(seq->gotipi, CFD_SEQ_NOCPU, cpu, CFD_SEQ_GOTIPI, data, &n_data, now);
307+
cfd_seq_data_add(seq->handle, CFD_SEQ_NOCPU, cpu, CFD_SEQ_HANDLE, data, &n_data, now);
308+
cfd_seq_data_add(seq->dequeue, CFD_SEQ_NOCPU, cpu, CFD_SEQ_DEQUEUE, data, &n_data, now);
309+
cfd_seq_data_add(seq->hdlend, CFD_SEQ_NOCPU, cpu, CFD_SEQ_HDLEND, data, &n_data, now);
310+
311+
for (i = 0; i < n_data; i++) {
312+
pr_alert("\tcsd: cnt(%07x): %04x->%04x %s\n",
313+
data[i].u.cnt & ~0x80000000U, data[i].u.src,
314+
data[i].u.dst, csd_lock_get_type(data[i].u.type));
315+
}
316+
pr_alert("\tcsd: cnt now: %07x\n", now);
317+
}
318+
163319
/*
164320
* Complain if too much time spent waiting. Note that only
165321
* the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
@@ -209,6 +365,8 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
209365
*bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request");
210366
}
211367
if (cpu >= 0) {
368+
if (static_branch_unlikely(&csdlock_debug_extended))
369+
csd_lock_print_extended(csd, cpu);
212370
if (!trigger_single_cpu_backtrace(cpu))
213371
dump_cpu_task(cpu);
214372
if (!cpu_cur_csd) {
@@ -252,7 +410,27 @@ static __always_inline void csd_lock_wait(call_single_data_t *csd)
252410

253411
smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK));
254412
}
413+
414+
static void __smp_call_single_queue_debug(int cpu, struct llist_node *node)
415+
{
416+
unsigned int this_cpu = smp_processor_id();
417+
struct cfd_seq_local *seq = this_cpu_ptr(&cfd_seq_local);
418+
struct call_function_data *cfd = this_cpu_ptr(&cfd_data);
419+
struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
420+
421+
cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE);
422+
if (llist_add(node, &per_cpu(call_single_queue, cpu))) {
423+
cfd_seq_store(pcpu->seq_ipi, this_cpu, cpu, CFD_SEQ_IPI);
424+
cfd_seq_store(seq->ping, this_cpu, cpu, CFD_SEQ_PING);
425+
send_call_function_single_ipi(cpu);
426+
cfd_seq_store(seq->pinged, this_cpu, cpu, CFD_SEQ_PINGED);
427+
} else {
428+
cfd_seq_store(pcpu->seq_noipi, this_cpu, cpu, CFD_SEQ_NOIPI);
429+
}
430+
}
255431
#else
432+
#define cfd_seq_store(var, src, dst, type)
433+
256434
static void csd_lock_record(call_single_data_t *csd)
257435
{
258436
}
@@ -290,6 +468,19 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
290468

291469
void __smp_call_single_queue(int cpu, struct llist_node *node)
292470
{
471+
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
472+
if (static_branch_unlikely(&csdlock_debug_extended)) {
473+
unsigned int type;
474+
475+
type = CSD_TYPE(container_of(node, call_single_data_t,
476+
node.llist));
477+
if (type == CSD_TYPE_SYNC || type == CSD_TYPE_ASYNC) {
478+
__smp_call_single_queue_debug(cpu, node);
479+
return;
480+
}
481+
}
482+
#endif
483+
293484
/*
294485
* The list addition should be visible before sending the IPI
295486
* handler locks the list to pull the entry off it because of
@@ -348,6 +539,8 @@ static int generic_exec_single(int cpu, call_single_data_t *csd)
348539
*/
349540
void generic_smp_call_function_single_interrupt(void)
350541
{
542+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->gotipi, CFD_SEQ_NOCPU,
543+
smp_processor_id(), CFD_SEQ_GOTIPI);
351544
flush_smp_call_function_queue(true);
352545
}
353546

@@ -375,7 +568,13 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
375568
lockdep_assert_irqs_disabled();
376569

377570
head = this_cpu_ptr(&call_single_queue);
571+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->handle, CFD_SEQ_NOCPU,
572+
smp_processor_id(), CFD_SEQ_HANDLE);
378573
entry = llist_del_all(head);
574+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->dequeue,
575+
/* Special meaning of source cpu: 0 == queue empty */
576+
entry ? CFD_SEQ_NOCPU : 0,
577+
smp_processor_id(), CFD_SEQ_DEQUEUE);
379578
entry = llist_reverse_order(entry);
380579

381580
/* There shouldn't be any pending callbacks on an offline CPU. */
@@ -434,8 +633,12 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
434633
}
435634
}
436635

437-
if (!entry)
636+
if (!entry) {
637+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->hdlend,
638+
0, smp_processor_id(),
639+
CFD_SEQ_HDLEND);
438640
return;
641+
}
439642

440643
/*
441644
* Second; run all !SYNC callbacks.
@@ -473,6 +676,9 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
473676
*/
474677
if (entry)
475678
sched_ttwu_pending(entry);
679+
680+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->hdlend, CFD_SEQ_NOCPU,
681+
smp_processor_id(), CFD_SEQ_HDLEND);
476682
}
477683

478684
void flush_smp_call_function_from_idle(void)
@@ -482,6 +688,8 @@ void flush_smp_call_function_from_idle(void)
482688
if (llist_empty(this_cpu_ptr(&call_single_queue)))
483689
return;
484690

691+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU,
692+
smp_processor_id(), CFD_SEQ_IDLE);
485693
local_irq_save(flags);
486694
flush_smp_call_function_queue(true);
487695
if (local_softirq_pending())
@@ -698,7 +906,8 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
698906

699907
cpumask_clear(cfd->cpumask_ipi);
700908
for_each_cpu(cpu, cfd->cpumask) {
701-
call_single_data_t *csd = &per_cpu_ptr(cfd->pcpu, cpu)->csd;
909+
struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
910+
call_single_data_t *csd = &pcpu->csd;
702911

703912
if (cond_func && !cond_func(cpu, info))
704913
continue;
@@ -712,12 +921,21 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
712921
csd->node.src = smp_processor_id();
713922
csd->node.dst = cpu;
714923
#endif
715-
if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu)))
924+
cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE);
925+
if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
716926
__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
927+
cfd_seq_store(pcpu->seq_ipi, this_cpu, cpu, CFD_SEQ_IPI);
928+
} else {
929+
cfd_seq_store(pcpu->seq_noipi, this_cpu, cpu, CFD_SEQ_NOIPI);
930+
}
717931
}
718932

719933
/* Send a message to all CPUs in the map */
934+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->ping, this_cpu,
935+
CFD_SEQ_NOCPU, CFD_SEQ_PING);
720936
arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
937+
cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->pinged, this_cpu,
938+
CFD_SEQ_NOCPU, CFD_SEQ_PINGED);
721939

722940
if (wait) {
723941
for_each_cpu(cpu, cfd->cpumask) {

0 commit comments

Comments
 (0)