Skip to content

Commit c926087

Browse files
rikvanrielsuryasaimadhu
authored andcommitted
x86/mm: Print likely CPU at segfault time
In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system. However, the failure modes in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaults in programs like bash, python, or various system daemons that run fine everywhere else. Add a printk() to show_signal_msg() to print the CPU, core, and socket at segfault time. This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice this has been good enough to help people identify several bad CPU cores. For example: segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in \ segfault[401000+1000] likely on CPU 0 (core 0, socket 0) This printk can be controlled through /proc/sys/debug/exception-trace. [ bp: Massage a bit, add "likely" to the printed line to denote that the CPU number is not always reliable. ] Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 0db7058 commit c926087

File tree

1 file changed

+10
-0
lines changed

1 file changed

+10
-0
lines changed

arch/x86/mm/fault.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
769769
unsigned long address, struct task_struct *tsk)
770770
{
771771
const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
772+
/* This is a racy snapshot, but it's better than nothing. */
773+
int cpu = raw_smp_processor_id();
772774

773775
if (!unhandled_signal(tsk, SIGSEGV))
774776
return;
@@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
782784

783785
print_vma_addr(KERN_CONT " in ", regs->ip);
784786

787+
/*
788+
* Dump the likely CPU where the fatal segfault happened.
789+
* This can help identify faulty hardware.
790+
*/
791+
printk(KERN_CONT " likely on CPU %d (core %d, socket %d)", cpu,
792+
topology_core_id(cpu), topology_physical_package_id(cpu));
793+
794+
785795
printk(KERN_CONT "\n");
786796

787797
show_opcodes(regs, loglvl);

0 commit comments

Comments
 (0)