You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Patch series "sysfs: add counters for lockups and stalls", v2.
Commits 9db89b4 ("exit: Expose "oops_count" to sysfs") and
8b05aa2 ("panic: Expose "warn_count" to sysfs") added counters for
oopses and warnings to sysfs, and these two patches do the same for
hard/soft lockups and RCU stalls.
All of these counters are useful for monitoring tools to detect whether
the machine is healthy. If the kernel has experienced a lockup or a
stall, it's probably due to a kernel bug, and I'd like to detect that
quickly and easily. There is currently no way to detect that, other than
parsing dmesg. Or observing indirect effects: such as certain tasks not
responding, but then I need to observe all tasks, and it may take a while
until these effects become visible/measurable. I'd rather be able to
detect the primary cause more quickly, possibly before everything falls
apart.
This patch (of 2):
There is /proc/sys/kernel/hung_task_detect_count, /sys/kernel/warn_count
and /sys/kernel/oops_count but there is no userspace-accessible counter
for hard/soft lockups. Having this is useful for monitoring tools.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Max Kellermann <[email protected]>
Cc:
Cc: Core Minyard <[email protected]>
Cc: Doug Anderson <[email protected]>
Cc: Joel Granados <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
0 commit comments