@@ -390,3 +390,91 @@ for example, "P3421".
390
390
391
391
It is entirely possible to see stall warnings from normal and from
392
392
expedited grace periods at about the same time during the same run.
393
+
394
+ RCU_CPU_STALL_CPUTIME
395
+ =====================
396
+
397
+ In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
398
+ rcupdate.rcu_cpu_stall_cputime=1, the following additional information
399
+ is supplied with each RCU CPU stall warning::
400
+
401
+ rcu: hardirqs softirqs csw/system
402
+ rcu: number: 624 45 0
403
+ rcu: cputime: 69 1 2425 ==> 2500(ms)
404
+
405
+ These statistics are collected during the sampling period. The values
406
+ in row "number:" are the number of hard interrupts, number of soft
407
+ interrupts, and number of context switches on the stalled CPU. The
408
+ first three values in row "cputime:" indicate the CPU time in
409
+ milliseconds consumed by hard interrupts, soft interrupts, and tasks
410
+ on the stalled CPU. The last number is the measurement interval, again
411
+ in milliseconds. Because user-mode tasks normally do not cause RCU CPU
412
+ stalls, these tasks are typically kernel tasks, which is why only the
413
+ system CPU time are considered.
414
+
415
+ The sampling period is shown as follows:
416
+ :<------------first timeout---------->:<-----second timeout----->:
417
+ :<--half timeout-->:<--half timeout-->: :
418
+ : :<--first period-->: :
419
+ : :<-----------second sampling period---------->:
420
+ : : : :
421
+ : snapshot time point 1st-stall 2nd-stall
422
+
423
+
424
+ The following describes four typical scenarios:
425
+
426
+ 1. A CPU looping with interrupts disabled.::
427
+
428
+ rcu: hardirqs softirqs csw/system
429
+ rcu: number: 0 0 0
430
+ rcu: cputime: 0 0 0 ==> 2500(ms)
431
+
432
+ Because interrupts have been disabled throughout the measurement
433
+ interval, there are no interrupts and no context switches.
434
+ Furthermore, because CPU time consumption was measured using interrupt
435
+ handlers, the system CPU consumption is misleadingly measured as zero.
436
+ This scenario will normally also have "(0 ticks this GP)" printed on
437
+ this CPU's summary line.
438
+
439
+ 2. A CPU looping with bottom halves disabled.
440
+
441
+ This is similar to the previous example, but with non-zero number of
442
+ and CPU time consumed by hard interrupts, along with non-zero CPU
443
+ time consumed by in-kernel execution.::
444
+
445
+ rcu: hardirqs softirqs csw/system
446
+ rcu: number: 624 0 0
447
+ rcu: cputime: 49 0 2446 ==> 2500(ms)
448
+
449
+ The fact that there are zero softirqs gives a hint that these were
450
+ disabled, perhaps via local_bh_disable(). It is of course possible
451
+ that there were no softirqs, perhaps because all events that would
452
+ result in softirq execution are confined to other CPUs. In this case,
453
+ the diagnosis should continue as shown in the next example.
454
+
455
+ 3. A CPU looping with preemption disabled.
456
+
457
+ Here, only the number of context switches is zero.::
458
+
459
+ rcu: hardirqs softirqs csw/system
460
+ rcu: number: 624 45 0
461
+ rcu: cputime: 69 1 2425 ==> 2500(ms)
462
+
463
+ This situation hints that the stalled CPU was looping with preemption
464
+ disabled.
465
+
466
+ 4. No looping, but massive hard and soft interrupts.::
467
+
468
+ rcu: hardirqs softirqs csw/system
469
+ rcu: number: xx xx 0
470
+ rcu: cputime: xx xx 0 ==> 2500(ms)
471
+
472
+ Here, the number and CPU time of hard interrupts are all non-zero,
473
+ but the number of context switches and the in-kernel CPU time consumed
474
+ are zero. The number and cputime of soft interrupts will usually be
475
+ non-zero, but could be zero, for example, if the CPU was spinning
476
+ within a single hard interrupt handler.
477
+
478
+ If this type of RCU CPU stall warning can be reproduced, you can
479
+ narrow it down by looking at /proc/interrupts or by writing code to
480
+ trace each interrupt, for example, by referring to show_interrupts().
0 commit comments