Skip to content

Commit 9efcdaa

Browse files
Ganesh Goudarmpe
authored andcommitted
powerpc/eeh: Set channel state after notifying the drivers
When a PCI error is encountered 6th time in an hour we set the channel state to perm_failure and notify the driver about the permanent failure. However, after upstream commit 38ddc01 ("powerpc/eeh: Make permanently failed devices non-actionable"), EEH handler stops calling any routine once the device is marked as permanent failure. This issue can lead to fatal consequences like kernel hang with certain PCI devices. Following log is observed with lpfc driver, with and without this change, Without this change kernel hangs, If PCI error is encountered 6 times for a device in an hour. Without the change EEH: Beginning: 'error_detected(permanent failure)' PCI 0132:60:00.0#600000: EEH: not actionable (1,1,1) PCI 0132:60:00.1#600000: EEH: not actionable (1,1,1) EEH: Finished:'error_detected(permanent failure)' With the change EEH: Beginning: 'error_detected(permanent failure)' EEH: Invoking lpfc->error_detected(permanent failure) EEH: lpfc driver reports: 'disconnect' EEH: Invoking lpfc->error_detected(permanent failure) EEH: lpfc driver reports: 'disconnect' EEH: Finished:'error_detected(permanent failure)' To fix the issue, set channel state to permanent failure after notifying the drivers. Fixes: 38ddc01 ("powerpc/eeh: Make permanently failed devices non-actionable") Suggested-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Ganesh Goudar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 4f11410 commit 9efcdaa

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

arch/powerpc/kernel/eeh_driver.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1065,10 +1065,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
10651065
eeh_slot_error_detail(pe, EEH_LOG_PERM);
10661066

10671067
/* Notify all devices that they're about to go down. */
1068-
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
10691068
eeh_set_irq_state(pe, false);
10701069
eeh_pe_report("error_detected(permanent failure)", pe,
10711070
eeh_report_failure, NULL);
1071+
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
10721072

10731073
/* Mark the PE to be removed permanently */
10741074
eeh_pe_state_mark(pe, EEH_PE_REMOVED);
@@ -1185,10 +1185,10 @@ void eeh_handle_special_event(void)
11851185

11861186
/* Notify all devices to be down */
11871187
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
1188-
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
11891188
eeh_pe_report(
11901189
"error_detected(permanent failure)", pe,
11911190
eeh_report_failure, NULL);
1191+
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
11921192

11931193
pci_lock_rescan_remove();
11941194
list_for_each_entry(hose, &hose_list, list_node) {

0 commit comments

Comments
 (0)