Skip to content

Commit 434fdb5

Browse files
karolherbstBen Skeggs
authored andcommitted
drm/nouveau: workaround runpm fail by disabling PCI power management on certain intel bridges
Fixes the infamous 'runtime PM' bug many users are facing on Laptops with Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU. Depending on the used kernel there might be messages like those in demsg: "nouveau 0000:01:00.0: Refused to change power state, currently in D3" "nouveau 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)" followed by backtraces of kernel crashes or timeouts within nouveau. It's still unkown why this issue exists, but this is a reliable workaround and solves a very annoying issue for user having to choose between a crashing kernel or higher power consumption of their Laptops. Signed-off-by: Karol Herbst <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Lyude Paul <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Mika Westerberg <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205623 Signed-off-by: Ben Skeggs <[email protected]>
1 parent bc7b188 commit 434fdb5

File tree

2 files changed

+65
-0
lines changed

2 files changed

+65
-0
lines changed

drivers/gpu/drm/nouveau/nouveau_drm.c

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -618,6 +618,64 @@ nouveau_drm_device_fini(struct drm_device *dev)
618618
kfree(drm);
619619
}
620620

621+
/*
622+
* On some Intel PCIe bridge controllers doing a
623+
* D0 -> D3hot -> D3cold -> D0 sequence causes Nvidia GPUs to not reappear.
624+
* Skipping the intermediate D3hot step seems to make it work again. This is
625+
* probably caused by not meeting the expectation the involved AML code has
626+
* when the GPU is put into D3hot state before invoking it.
627+
*
628+
* This leads to various manifestations of this issue:
629+
* - AML code execution to power on the GPU hits an infinite loop (as the
630+
* code waits on device memory to change).
631+
* - kernel crashes, as all PCI reads return -1, which most code isn't able
632+
* to handle well enough.
633+
*
634+
* In all cases dmesg will contain at least one line like this:
635+
* 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
636+
* followed by a lot of nouveau timeouts.
637+
*
638+
* In the \_SB.PCI0.PEG0.PG00._OFF code deeper down writes bit 0x80 to the not
639+
* documented PCI config space register 0x248 of the Intel PCIe bridge
640+
* controller (0x1901) in order to change the state of the PCIe link between
641+
* the PCIe port and the GPU. There are alternative code paths using other
642+
* registers, which seem to work fine (executed pre Windows 8):
643+
* - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
644+
* - 0xb0 bit 0x10 (link disable)
645+
* Changing the conditions inside the firmware by poking into the relevant
646+
* addresses does resolve the issue, but it seemed to be ACPI private memory
647+
* and not any device accessible memory at all, so there is no portable way of
648+
* changing the conditions.
649+
* On a XPS 9560 that means bits [0,3] on \CPEX need to be cleared.
650+
*
651+
* The only systems where this behavior can be seen are hybrid graphics laptops
652+
* with a secondary Nvidia Maxwell, Pascal or Turing GPU. It's unclear whether
653+
* this issue only occurs in combination with listed Intel PCIe bridge
654+
* controllers and the mentioned GPUs or other devices as well.
655+
*
656+
* documentation on the PCIe bridge controller can be found in the
657+
* "7th Generation Intel® Processor Families for H Platforms Datasheet Volume 2"
658+
* Section "12 PCI Express* Controller (x16) Registers"
659+
*/
660+
661+
static void quirk_broken_nv_runpm(struct pci_dev *pdev)
662+
{
663+
struct drm_device *dev = pci_get_drvdata(pdev);
664+
struct nouveau_drm *drm = nouveau_drm(dev);
665+
struct pci_dev *bridge = pci_upstream_bridge(pdev);
666+
667+
if (!bridge || bridge->vendor != PCI_VENDOR_ID_INTEL)
668+
return;
669+
670+
switch (bridge->device) {
671+
case 0x1901:
672+
drm->old_pm_cap = pdev->pm_cap;
673+
pdev->pm_cap = 0;
674+
NV_INFO(drm, "Disabling PCI power management to avoid bug\n");
675+
break;
676+
}
677+
}
678+
621679
static int nouveau_drm_probe(struct pci_dev *pdev,
622680
const struct pci_device_id *pent)
623681
{
@@ -699,6 +757,7 @@ static int nouveau_drm_probe(struct pci_dev *pdev,
699757
if (ret)
700758
goto fail_drm_dev_init;
701759

760+
quirk_broken_nv_runpm(pdev);
702761
return 0;
703762

704763
fail_drm_dev_init:
@@ -734,7 +793,11 @@ static void
734793
nouveau_drm_remove(struct pci_dev *pdev)
735794
{
736795
struct drm_device *dev = pci_get_drvdata(pdev);
796+
struct nouveau_drm *drm = nouveau_drm(dev);
737797

798+
/* revert our workaround */
799+
if (drm->old_pm_cap)
800+
pdev->pm_cap = drm->old_pm_cap;
738801
nouveau_drm_device_remove(dev);
739802
pci_disable_device(pdev);
740803
}

drivers/gpu/drm/nouveau/nouveau_drv.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,8 @@ struct nouveau_drm {
140140

141141
struct list_head clients;
142142

143+
u8 old_pm_cap;
144+
143145
struct {
144146
struct agp_bridge_data *bridge;
145147
u32 base;

0 commit comments

Comments
 (0)