Skip to content

Commit eef5c7b

Browse files
ming4lidjbw
authored andcommitted
cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
The PCI AER model is an awkward fit for CXL error handling. While the expectation is that a PCI device can escalate to link reset to recover from an AER event, the same reset on CXL amounts to a surprise memory hotplug of massive amounts of memory. At present, the CXL error handler attempts some optimistic error handling to unbind the device from the cxl_mem driver after reaping some RAS register values. This results in a "hopeful" attempt to unplug the memory, but there is no guarantee that will succeed. A subsequent AER notification after the memdev unbind event can no longer assume the registers are mapped. Check for memdev bind before reaping status register values to avoid crashes of the form: BUG: unable to handle page fault for address: ffa00000195e9100 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page [...] RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core] [...] Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x82/0x160 ? kernelmode_fixup_or_oops+0x84/0x110 ? exc_page_fault+0x113/0x170 ? asm_exc_page_fault+0x26/0x30 ? __pfx_dpc_reset_link+0x10/0x10 ? __cxl_handle_ras+0x30/0x110 [cxl_core] ? find_cxl_port+0x59/0x80 [cxl_core] cxl_handle_rp_ras+0xbc/0xd0 [cxl_core] cxl_error_detected+0x6c/0xf0 [cxl_core] report_error_detected+0xc7/0x1c0 pci_walk_bus+0x73/0x90 pcie_do_recovery+0x23f/0x330 Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might need to be replaced with a new PCI_ERS_RESULT_PANIC. Fixes: 6ac0788 ("cxl/pci: Add RCH downstream port error logging") Cc: [email protected] Suggested-by: Dan Williams <[email protected]> Signed-off-by: Li Ming <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dan Williams <[email protected]>
1 parent 41bccc9 commit eef5c7b

File tree

1 file changed

+31
-12
lines changed

1 file changed

+31
-12
lines changed

drivers/cxl/core/pci.c

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -932,11 +932,21 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
932932
void cxl_cor_error_detected(struct pci_dev *pdev)
933933
{
934934
struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
935+
struct device *dev = &cxlds->cxlmd->dev;
936+
937+
scoped_guard(device, dev) {
938+
if (!dev->driver) {
939+
dev_warn(&pdev->dev,
940+
"%s: memdev disabled, abort error handling\n",
941+
dev_name(dev));
942+
return;
943+
}
935944

936-
if (cxlds->rcd)
937-
cxl_handle_rdport_errors(cxlds);
945+
if (cxlds->rcd)
946+
cxl_handle_rdport_errors(cxlds);
938947

939-
cxl_handle_endpoint_cor_ras(cxlds);
948+
cxl_handle_endpoint_cor_ras(cxlds);
949+
}
940950
}
941951
EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
942952

@@ -948,16 +958,25 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
948958
struct device *dev = &cxlmd->dev;
949959
bool ue;
950960

951-
if (cxlds->rcd)
952-
cxl_handle_rdport_errors(cxlds);
961+
scoped_guard(device, dev) {
962+
if (!dev->driver) {
963+
dev_warn(&pdev->dev,
964+
"%s: memdev disabled, abort error handling\n",
965+
dev_name(dev));
966+
return PCI_ERS_RESULT_DISCONNECT;
967+
}
968+
969+
if (cxlds->rcd)
970+
cxl_handle_rdport_errors(cxlds);
971+
/*
972+
* A frozen channel indicates an impending reset which is fatal to
973+
* CXL.mem operation, and will likely crash the system. On the off
974+
* chance the situation is recoverable dump the status of the RAS
975+
* capability registers and bounce the active state of the memdev.
976+
*/
977+
ue = cxl_handle_endpoint_ras(cxlds);
978+
}
953979

954-
/*
955-
* A frozen channel indicates an impending reset which is fatal to
956-
* CXL.mem operation, and will likely crash the system. On the off
957-
* chance the situation is recoverable dump the status of the RAS
958-
* capability registers and bounce the active state of the memdev.
959-
*/
960-
ue = cxl_handle_endpoint_ras(cxlds);
961980

962981
switch (state) {
963982
case pci_channel_io_normal:

0 commit comments

Comments
 (0)