Skip to content

Commit ad9001f

Browse files
westeribjorn-helgaas
authored andcommitted
PCI/PM: Add missing link delays required by the PCIe spec
Currently Linux does not follow PCIe spec regarding the required delays after reset. A concrete example is a Thunderbolt add-in-card that consists of a PCIe switch and two PCIe endpoints: +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller +-01.0-[04-36]-- DS hotplug port +-02.0-[37]----00.0 xHCI controller \-04.0-[38-6b]-- DS hotplug port The root port (1b.0) and the PCIe switch downstream ports are all PCIe Gen3 so they support 8GT/s link speeds. We wait for the PCIe hierarchy to enter D3cold (runtime): pcieport 0000:00:1b.0: power state changed by ACPI to D3cold When it wakes up from D3cold, according to the PCIe 5.0 section 5.8 the PCIe switch is put to reset and its power is re-applied. This means that we must follow the rules in PCIe 5.0 section 6.6.1. For the PCIe Gen3 ports we are dealing with here, the following applies: With a Downstream Port that supports Link speeds greater than 5.0 GT/s, software must wait a minimum of 100 ms after Link training completes before sending a Configuration Request to the device immediately below that Port. Software can determine when Link training completes by polling the Data Link Layer Link Active bit or by setting up an associated interrupt (see Section 6.7.3.3). Translating this into the above topology we would need to do this (DLLLA stands for Data Link Layer Link Active): 0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0 0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0 0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0 I've instrumented the kernel with some additional logging so we can see the actual delays performed: pcieport 0000:00:1b.0: power state changed by ACPI to D0 pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms For the switch upstream port (01:00.0 reachable through 00:1b.0 root port) we wait for 100 ms but not taking into account the DLLLA requirement. We then wait 10 ms for D3hot -> D0 transition of the root port and the two downstream hotplug ports. This means that we deviate from what the spec requires. Performing the same check for system sleep (s2idle) transitions it turns out to be even worse. None of the mandatory delays are performed. If this would be S3 instead of s2idle then according to PCI FW spec 3.2 section 4.6.8. there is a specific _DSM that allows the OS to skip the delays but this platform does not provide the _DSM and does not go to S3 anyway so no firmware is involved that could already handle these delays. On this particular platform these delays are not actually needed because there is an additional delay as part of the ACPI power resource that is used to turn on power to the hierarchy but since that additional delay is not required by any of standards (PCIe, ACPI) it is not present in the Intel Ice Lake, for example where missing the mandatory delays causes pciehp to start tearing down the stack too early (links are not yet trained). Below is an example how it looks like when this happens: pcieport 0000:83:04.0: pciehp: Slot(4): Card not present pcieport 0000:87:04.0: PME# disabled pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00 pcieport 0000:86:00.0: Refused to change power state, currently in D3 pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff) pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0) ... There is also one reported case (see the bugzilla link below) where the missing delay causes xHCI on a Titan Ridge controller fail to runtime resume when USB-C dock is plugged. This does not involve pciehp but instead the PCI core fails to runtime resume the xHCI device: pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020) pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406) xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3 xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff) xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0) ... Add a new function pci_bridge_wait_for_secondary_bus() that is called on PCI core resume and runtime resume paths accordingly if the bridge entered D3cold (and thus went through reset). This is second attempt to add the missing delays. The previous solution in c2bf1fc ("PCI: Add missing link delays required by the PCIe spec") was reverted because of two issues it caused: 1. One system become unresponsive after S3 resume due to PME service spinning in pcie_pme_work_fn(). The root port in question reports that the xHCI sent PME but the xHCI device itself does not have PME status set. The PME status bit is never cleared in the root port resulting the indefinite loop in pcie_pme_work_fn(). 2. Slows down resume if the root/downstream port does not support Data Link Layer Active Reporting because pcie_wait_for_link_delay() waits 1100 ms in that case. This version should avoid the above issues because we restrict the delay to happen only if the port went into D3cold. Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885 Link: https://lore.kernel.org/r/[email protected] Reported-by: Kai-Heng Feng <[email protected]> Tested-by: Kai-Heng Feng <[email protected]> Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
1 parent 4827d63 commit ad9001f

File tree

3 files changed

+133
-7
lines changed

3 files changed

+133
-7
lines changed

drivers/pci/pci-driver.c

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -890,6 +890,8 @@ static int pci_pm_resume_noirq(struct device *dev)
890890
{
891891
struct pci_dev *pci_dev = to_pci_dev(dev);
892892
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
893+
pci_power_t prev_state = pci_dev->current_state;
894+
bool skip_bus_pm = pci_dev->skip_bus_pm;
893895

894896
if (dev_pm_may_skip_resume(dev))
895897
return 0;
@@ -908,12 +910,15 @@ static int pci_pm_resume_noirq(struct device *dev)
908910
* configuration here and attempting to put them into D0 again is
909911
* pointless, so avoid doing that.
910912
*/
911-
if (!(pci_dev->skip_bus_pm && pm_suspend_no_platform()))
913+
if (!(skip_bus_pm && pm_suspend_no_platform()))
912914
pci_pm_default_resume_early(pci_dev);
913915

914916
pci_fixup_device(pci_fixup_resume_early, pci_dev);
915917
pcie_pme_root_status_cleanup(pci_dev);
916918

919+
if (!skip_bus_pm && prev_state == PCI_D3cold)
920+
pci_bridge_wait_for_secondary_bus(pci_dev);
921+
917922
if (pci_has_legacy_pm_support(pci_dev))
918923
return 0;
919924

@@ -1299,6 +1304,7 @@ static int pci_pm_runtime_resume(struct device *dev)
12991304
{
13001305
struct pci_dev *pci_dev = to_pci_dev(dev);
13011306
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
1307+
pci_power_t prev_state = pci_dev->current_state;
13021308
int error = 0;
13031309

13041310
/*
@@ -1314,6 +1320,9 @@ static int pci_pm_runtime_resume(struct device *dev)
13141320
pci_fixup_device(pci_fixup_resume_early, pci_dev);
13151321
pci_pm_default_resume(pci_dev);
13161322

1323+
if (prev_state == PCI_D3cold)
1324+
pci_bridge_wait_for_secondary_bus(pci_dev);
1325+
13171326
if (pm && pm->runtime_resume)
13181327
error = pm->runtime_resume(dev);
13191328

drivers/pci/pci.c

Lines changed: 122 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1021,14 +1021,11 @@ int pci_power_up(struct pci_dev *dev)
10211021
pci_platform_power_transition(dev, PCI_D0);
10221022

10231023
/*
1024-
* Mandatory power management transition delays, see PCI Express Base
1025-
* Specification Revision 2.0 Section 6.6.1: Conventional Reset. Do not
1026-
* delay for devices powered on/off by corresponding bridge, because
1027-
* have already delayed for the bridge.
1024+
* Mandatory power management transition delays are handled in
1025+
* pci_pm_resume_noirq() and pci_pm_runtime_resume() of the
1026+
* corresponding bridge.
10281027
*/
10291028
if (dev->runtime_d3cold) {
1030-
if (dev->d3cold_delay && !dev->imm_ready)
1031-
msleep(dev->d3cold_delay);
10321029
/*
10331030
* When powering on a bridge from D3cold, the whole hierarchy
10341031
* may be powered on into D0uninitialized state, resume them to
@@ -4652,6 +4649,125 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
46524649
return pcie_wait_for_link_delay(pdev, active, 100);
46534650
}
46544651

4652+
/*
4653+
* Find maximum D3cold delay required by all the devices on the bus. The
4654+
* spec says 100 ms, but firmware can lower it and we allow drivers to
4655+
* increase it as well.
4656+
*
4657+
* Called with @pci_bus_sem locked for reading.
4658+
*/
4659+
static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
4660+
{
4661+
const struct pci_dev *pdev;
4662+
int min_delay = 100;
4663+
int max_delay = 0;
4664+
4665+
list_for_each_entry(pdev, &bus->devices, bus_list) {
4666+
if (pdev->d3cold_delay < min_delay)
4667+
min_delay = pdev->d3cold_delay;
4668+
if (pdev->d3cold_delay > max_delay)
4669+
max_delay = pdev->d3cold_delay;
4670+
}
4671+
4672+
return max(min_delay, max_delay);
4673+
}
4674+
4675+
/**
4676+
* pci_bridge_wait_for_secondary_bus - Wait for secondary bus to be accessible
4677+
* @dev: PCI bridge
4678+
*
4679+
* Handle necessary delays before access to the devices on the secondary
4680+
* side of the bridge are permitted after D3cold to D0 transition.
4681+
*
4682+
* For PCIe this means the delays in PCIe 5.0 section 6.6.1. For
4683+
* conventional PCI it means Tpvrh + Trhfa specified in PCI 3.0 section
4684+
* 4.3.2.
4685+
*/
4686+
void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
4687+
{
4688+
struct pci_dev *child;
4689+
int delay;
4690+
4691+
if (pci_dev_is_disconnected(dev))
4692+
return;
4693+
4694+
if (!pci_is_bridge(dev) || !dev->bridge_d3)
4695+
return;
4696+
4697+
down_read(&pci_bus_sem);
4698+
4699+
/*
4700+
* We only deal with devices that are present currently on the bus.
4701+
* For any hot-added devices the access delay is handled in pciehp
4702+
* board_added(). In case of ACPI hotplug the firmware is expected
4703+
* to configure the devices before OS is notified.
4704+
*/
4705+
if (!dev->subordinate || list_empty(&dev->subordinate->devices)) {
4706+
up_read(&pci_bus_sem);
4707+
return;
4708+
}
4709+
4710+
/* Take d3cold_delay requirements into account */
4711+
delay = pci_bus_max_d3cold_delay(dev->subordinate);
4712+
if (!delay) {
4713+
up_read(&pci_bus_sem);
4714+
return;
4715+
}
4716+
4717+
child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
4718+
bus_list);
4719+
up_read(&pci_bus_sem);
4720+
4721+
/*
4722+
* Conventional PCI and PCI-X we need to wait Tpvrh + Trhfa before
4723+
* accessing the device after reset (that is 1000 ms + 100 ms). In
4724+
* practice this should not be needed because we don't do power
4725+
* management for them (see pci_bridge_d3_possible()).
4726+
*/
4727+
if (!pci_is_pcie(dev)) {
4728+
pci_dbg(dev, "waiting %d ms for secondary bus\n", 1000 + delay);
4729+
msleep(1000 + delay);
4730+
return;
4731+
}
4732+
4733+
/*
4734+
* For PCIe downstream and root ports that do not support speeds
4735+
* greater than 5 GT/s need to wait minimum 100 ms. For higher
4736+
* speeds (gen3) we need to wait first for the data link layer to
4737+
* become active.
4738+
*
4739+
* However, 100 ms is the minimum and the PCIe spec says the
4740+
* software must allow at least 1s before it can determine that the
4741+
* device that did not respond is a broken device. There is
4742+
* evidence that 100 ms is not always enough, for example certain
4743+
* Titan Ridge xHCI controller does not always respond to
4744+
* configuration requests if we only wait for 100 ms (see
4745+
* https://bugzilla.kernel.org/show_bug.cgi?id=203885).
4746+
*
4747+
* Therefore we wait for 100 ms and check for the device presence.
4748+
* If it is still not present give it an additional 100 ms.
4749+
*/
4750+
if (!pcie_downstream_port(dev))
4751+
return;
4752+
4753+
if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
4754+
pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
4755+
msleep(delay);
4756+
} else {
4757+
pci_dbg(dev, "waiting %d ms for downstream link, after activation\n",
4758+
delay);
4759+
if (!pcie_wait_for_link_delay(dev, true, delay)) {
4760+
/* Did not train, no need to wait any further */
4761+
return;
4762+
}
4763+
}
4764+
4765+
if (!pci_device_is_present(child)) {
4766+
pci_dbg(child, "waiting additional %d ms to become accessible\n", delay);
4767+
msleep(delay);
4768+
}
4769+
}
4770+
46554771
void pci_reset_secondary_bus(struct pci_dev *dev)
46564772
{
46574773
u16 ctrl;

drivers/pci/pci.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev);
104104
void pci_free_cap_save_buffers(struct pci_dev *dev);
105105
bool pci_bridge_d3_possible(struct pci_dev *dev);
106106
void pci_bridge_d3_update(struct pci_dev *dev);
107+
void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev);
107108

108109
static inline void pci_wakeup_event(struct pci_dev *dev)
109110
{

0 commit comments

Comments
 (0)