|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +Debugging AMD Zen systems |
| 4 | ++++++++++++++++++++++++++ |
| 5 | + |
| 6 | +Introduction |
| 7 | +============ |
| 8 | + |
| 9 | +This document describes techniques that are useful for debugging issues with |
| 10 | +AMD Zen systems. It is intended for use by developers and technical users |
| 11 | +to help identify and resolve issues. |
| 12 | + |
| 13 | +S3 vs s2idle |
| 14 | +============ |
| 15 | + |
| 16 | +On AMD systems, it's not possible to simultaneously support suspend-to-RAM (S3) |
| 17 | +and suspend-to-idle (s2idle). To confirm which mode your system supports you |
| 18 | +can look at ``cat /sys/power/mem_sleep``. If it shows ``s2idle [deep]`` then |
| 19 | +*S3* is supported. If it shows ``[s2idle]`` then *s2idle* is |
| 20 | +supported. |
| 21 | + |
| 22 | +On systems that support *S3*, the firmware will be utilized to put all hardware into |
| 23 | +the appropriate low power state. |
| 24 | + |
| 25 | +On systems that support *s2idle*, the kernel will be responsible for transitioning devices |
| 26 | +into the appropriate low power state. When all devices are in the appropriate low |
| 27 | +power state, the hardware will transition into a hardware sleep state. |
| 28 | + |
| 29 | +After a suspend cycle you can tell how much time was spent in a hardware sleep |
| 30 | +state by looking at ``cat /sys/power/suspend_stats/last_hw_sleep``. |
| 31 | + |
| 32 | +This flowchart explains how the AMD s2idle suspend flow works. |
| 33 | + |
| 34 | +.. kernel-figure:: suspend.svg |
| 35 | + |
| 36 | +This flowchart explains how the amd s2idle resume flow works. |
| 37 | + |
| 38 | +.. kernel-figure:: resume.svg |
| 39 | + |
| 40 | +s2idle debugging tool |
| 41 | +===================== |
| 42 | + |
| 43 | +As there are a lot of places that problems can occur, a debugging tool has been |
| 44 | +created at |
| 45 | +`amd-debug-tools <https://git.kernel.org/pub/scm/linux/kernel/git/superm1/amd-debug-tools.git/about/>`_ |
| 46 | +that can help test for common problems and offer suggestions. |
| 47 | + |
| 48 | +If you have an s2idle issue, it's best to start with this and follow instructions |
| 49 | +from its findings. If you continue to have an issue, raise a bug with the |
| 50 | +report generated from this script to |
| 51 | +`drm/amd gitlab <https://gitlab.freedesktop.org/drm/amd/-/issues/new?issuable_template=s2idle_BUG_TEMPLATE>`_. |
| 52 | + |
| 53 | +Spurious s2idle wakeups from an IRQ |
| 54 | +=================================== |
| 55 | +Spurious wakeups will generally have an IRQ set to ``/sys/power/pm_wakeup_irq``. |
| 56 | +This can be matched to ``/proc/interrupts`` to determine what device woke the system. |
| 57 | + |
| 58 | +If this isn't enough to debug the problem, then the following sysfs files |
| 59 | +can be set to add more verbosity to the wakeup process: :: |
| 60 | + |
| 61 | + # echo 1 | sudo tee /sys/power/pm_debug_messages |
| 62 | + # echo 1 | sudo tee /sys/power/pm_print_times |
| 63 | + |
| 64 | +After making those changes, the kernel will display messages that can |
| 65 | +be traced back to kernel s2idle loop code as well as display any active |
| 66 | +GPIO sources while waking up. |
| 67 | + |
| 68 | +If the wakeup is caused by the ACPI SCI, additional ACPI debugging may be |
| 69 | +needed. These commands can enable additional trace data: :: |
| 70 | + |
| 71 | + # echo enable | sudo tee /sys/module/acpi/parameters/trace_state |
| 72 | + # echo 1 | sudo tee /sys/module/acpi/parameters/aml_debug_output |
| 73 | + # echo 0x0800000f | sudo tee /sys/module/acpi/parameters/debug_level |
| 74 | + # echo 0xffff0000 | sudo tee /sys/module/acpi/parameters/debug_layer |
| 75 | + |
| 76 | +Spurious s2idle wakeups from a GPIO |
| 77 | +=================================== |
| 78 | + |
| 79 | +If a GPIO is active when waking up the system ideally you would look at the |
| 80 | +schematic to determine what device it is associated with. If the schematic |
| 81 | +is not available, another tactic is to look at the ACPI _EVT() entry |
| 82 | +to determine what device is notified when that GPIO is active. |
| 83 | + |
| 84 | +For a hypothetical example, say that GPIO 59 woke up the system. You can |
| 85 | +look at the SSDT to determine what device is notified when GPIO 59 is active. |
| 86 | + |
| 87 | +First convert the GPIO number into hex. :: |
| 88 | + |
| 89 | + $ python3 -c "print(hex(59))" |
| 90 | + 0x3b |
| 91 | + |
| 92 | +Next determine which ACPI table has the ``_EVT`` entry. For example: :: |
| 93 | + |
| 94 | + $ sudo grep EVT /sys/firmware/acpi/tables/SSDT* |
| 95 | + grep: /sys/firmware/acpi/tables/SSDT27: binary file matches |
| 96 | + |
| 97 | +Decode this table:: |
| 98 | + |
| 99 | + $ sudo cp /sys/firmware/acpi/tables/SSDT27 . |
| 100 | + $ sudo iasl -d SSDT27 |
| 101 | + |
| 102 | +Then look at the table and find the matching entry for GPIO 0x3b. :: |
| 103 | + |
| 104 | + Case (0x3B) |
| 105 | + { |
| 106 | + M000 (0x393B) |
| 107 | + M460 (" Notify (\\_SB.PCI0.GP17.XHC1, 0x02)\n", Zero, Zero, Zero, Zero, Zero, Zero) |
| 108 | + Notify (\_SB.PCI0.GP17.XHC1, 0x02) // Device Wake |
| 109 | + } |
| 110 | + |
| 111 | +You can see in this case that the device ``\_SB.PCI0.GP17.XHC1`` is notified |
| 112 | +when GPIO 59 is active. It's obvious this is an XHCI controller, but to go a |
| 113 | +step further you can figure out which XHCI controller it is by matching it to |
| 114 | +ACPI.:: |
| 115 | + |
| 116 | + $ grep "PCI0.GP17.XHC1" /sys/bus/acpi/devices/*/path |
| 117 | + /sys/bus/acpi/devices/device:2d/path:\_SB_.PCI0.GP17.XHC1 |
| 118 | + /sys/bus/acpi/devices/device:2e/path:\_SB_.PCI0.GP17.XHC1.RHUB |
| 119 | + /sys/bus/acpi/devices/device:2f/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1 |
| 120 | + /sys/bus/acpi/devices/device:30/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM0 |
| 121 | + /sys/bus/acpi/devices/device:31/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM1 |
| 122 | + /sys/bus/acpi/devices/device:32/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT2 |
| 123 | + /sys/bus/acpi/devices/LNXPOWER:0d/path:\_SB_.PCI0.GP17.XHC1.PWRS |
| 124 | + |
| 125 | +Here you can see it matches to ``device:2d``. Look at the ``physical_node`` |
| 126 | +to determine what PCI device that actually is. :: |
| 127 | + |
| 128 | + $ ls -l /sys/bus/acpi/devices/device:2d/physical_node |
| 129 | + lrwxrwxrwx 1 root root 0 Feb 12 13:22 /sys/bus/acpi/devices/device:2d/physical_node -> ../../../../../pci0000:00/0000:00:08.1/0000:c2:00.4 |
| 130 | + |
| 131 | +So there you have it: the PCI device associated with this GPIO wakeup was ``0000:c2:00.4``. |
| 132 | + |
| 133 | +The ``amd_s2idle.py`` script will capture most of these artifacts for you. |
| 134 | + |
| 135 | +s2idle PM debug messages |
| 136 | +======================== |
| 137 | +During the s2idle flow on AMD systems, the ACPI LPS0 driver is responsible |
| 138 | +to check all uPEP constraints. Failing uPEP constraints does not prevent |
| 139 | +s0i3 entry. This means that if some constraints are not met, it is possible |
| 140 | +the kernel may attempt to enter s2idle even if there are some known issues. |
| 141 | + |
| 142 | +To activate PM debugging, either specify ``pm_debug_messagess`` kernel |
| 143 | +command-line option at boot or write to ``/sys/power/pm_debug_messages``. |
| 144 | +Unmet constraints will be displayed in the kernel log and can be |
| 145 | +viewed by logging tools that process kernel ring buffer like ``dmesg`` or |
| 146 | +``journalctl``." |
| 147 | + |
| 148 | +If the system freezes on entry/exit before these messages are flushed, a |
| 149 | +useful debugging tactic is to unbind the ``amd_pmc`` driver to prevent |
| 150 | +notification to the platform to start s0i3 entry. This will stop the |
| 151 | +system from freezing on entry or exit and let you view all the failed |
| 152 | +constraints. :: |
| 153 | + |
| 154 | + cd /sys/bus/platform/drivers/amd_pmc |
| 155 | + ls | grep AMD | sudo tee unbind |
| 156 | + |
| 157 | +After doing this, run the suspend cycle and look specifically for errors around: :: |
| 158 | + |
| 159 | + ACPI: LPI: Constraint not met; min power state:%s current power state:%s |
| 160 | + |
| 161 | +Historical examples of s2idle issues |
| 162 | +==================================== |
| 163 | +To help understand the types of issues that can occur and how to debug them, |
| 164 | +here are some historical examples of s2idle issues that have been resolved. |
| 165 | + |
| 166 | +Core offlining |
| 167 | +-------------- |
| 168 | +An end user had reported that taking a core offline would prevent the system |
| 169 | +from properly entering s0i3. This was debugged using internal AMD tools |
| 170 | +to capture and display a stream of metrics from the hardware showing what changed |
| 171 | +when a core was offlined. It was determined that the hardware didn't get |
| 172 | +notification the offline cores were in the deepest state, and so it prevented |
| 173 | +CPU from going into the deepest state. The issue was debugged to a missing |
| 174 | +command to put cores into C3 upon offline. |
| 175 | + |
| 176 | +`commit d6b88ce2eb9d2 ("ACPI: processor idle: Allow playing dead in C3 state") <https://git.kernel.org/torvalds/c/d6b88ce2eb9d2>`_ |
| 177 | + |
| 178 | +Corruption after resume |
| 179 | +----------------------- |
| 180 | +A big problem that occurred with Rembrandt was that there was graphical |
| 181 | +corruption after resume. This happened because of a misalignment of PSP |
| 182 | +and driver responsibility. The PSP will save and restore DMCUB, but the |
| 183 | +driver assumed it needed to reset DMCUB on resume. |
| 184 | +This actually was a misalignment for earlier silicon as well, but was not |
| 185 | +observed. |
| 186 | + |
| 187 | +`commit 79d6b9351f086 ("drm/amd/display: Don't reinitialize DMCUB on s0ix resume") <https://git.kernel.org/torvalds/c/79d6b9351f086>`_ |
| 188 | + |
| 189 | +Back to Back suspends fail |
| 190 | +-------------------------- |
| 191 | +When using a wakeup source that triggers the IRQ to wakeup, a bug in the |
| 192 | +pinctrl-amd driver may capture the wrong state of the IRQ and prevent the |
| 193 | +system going back to sleep properly. |
| 194 | + |
| 195 | +`commit b8c824a869f22 ("pinctrl: amd: Don't save/restore interrupt status and wake status bits") <https://git.kernel.org/torvalds/c/b8c824a869f22>`_ |
| 196 | + |
| 197 | +Spurious timer based wakeup after 5 minutes |
| 198 | +------------------------------------------- |
| 199 | +The HPET was being used to program the wakeup source for the system, however |
| 200 | +this was causing a spurious wakeup after 5 minutes. The correct alarm to use |
| 201 | +was the ACPI alarm. |
| 202 | + |
| 203 | +`commit 3d762e21d5637 ("rtc: cmos: Use ACPI alarm for non-Intel x86 systems too") <https://git.kernel.org/torvalds/c/3d762e21d5637>`_ |
| 204 | + |
| 205 | +Disk disappears after resume |
| 206 | +---------------------------- |
| 207 | +After resuming from s2idle, the NVME disk would disappear. This was due to the |
| 208 | +BIOS not specifying the _DSD StorageD3Enable property. This caused the NVME |
| 209 | +driver not to put the disk into the expected state at suspend and to fail |
| 210 | +on resume. |
| 211 | + |
| 212 | +`commit e79a10652bbd3 ("ACPI: x86: Force StorageD3Enable on more products") <https://git.kernel.org/torvalds/c/e79a10652bbd3>`_ |
| 213 | + |
| 214 | +Spurious IRQ1 |
| 215 | +------------- |
| 216 | +A number of Renoir, Lucienne, Cezanne, & Barcelo platforms have a |
| 217 | +platform firmware bug where IRQ1 is triggered during s0i3 resume. |
| 218 | + |
| 219 | +This was fixed in the platform firmware, but a number of systems didn't |
| 220 | +receive any more platform firmware updates. |
| 221 | + |
| 222 | +`commit 8e60615e89321 ("platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN") <https://git.kernel.org/torvalds/c/8e60615e89321>`_ |
| 223 | + |
| 224 | +Hardware timeout |
| 225 | +---------------- |
| 226 | +The hardware performs many actions besides accepting the values from |
| 227 | +amd-pmc driver. As the communication path with the hardware is a mailbox, |
| 228 | +it's possible that it might not respond quickly enough. |
| 229 | +This issue manifested as a failure to suspend: :: |
| 230 | + |
| 231 | + PM: dpm_run_callback(): acpi_subsys_suspend_noirq+0x0/0x50 returns -110 |
| 232 | + amd_pmc AMDI0005:00: PM: failed to suspend noirq: error -110 |
| 233 | + |
| 234 | +The timing problem was identified by comparing the values of the idle mask. |
| 235 | + |
| 236 | +`commit 3c3c8e88c8712 ("platform/x86: amd-pmc: Increase the response register timeout") <https://git.kernel.org/torvalds/c/3c3c8e88c8712>`_ |
| 237 | + |
| 238 | +Failed to reach hardware sleep state with panel on |
| 239 | +-------------------------------------------------- |
| 240 | +On some Strix systems certain panels were observed to block the system from |
| 241 | +entering a hardware sleep state if the internal panel was on during the sequence. |
| 242 | + |
| 243 | +Even though the panel got turned off during suspend it exposed a timing problem |
| 244 | +where an interrupt caused the display hardware to wake up and block low power |
| 245 | +state entry. |
| 246 | + |
| 247 | +`commit 40b8c14936bd2 ("drm/amd/display: Disable unneeded hpd interrupts during dm_init") <https://git.kernel.org/torvalds/c/40b8c14936bd2>`_ |
| 248 | + |
| 249 | +Runtime power consumption issues |
| 250 | +================================ |
| 251 | +Runtime power consumption is influenced by many factors, including but not |
| 252 | +limited to the configuration of the PCIe Active State Power Management (ASPM), |
| 253 | +the display brightness, the EPP policy of the CPU, and the power management |
| 254 | +of the devices. |
| 255 | + |
| 256 | +ASPM |
| 257 | +---- |
| 258 | +For the best runtime power consumption, ASPM should be programmed as intended |
| 259 | +by the BIOS from the hardware vendor. To accomplish this the Linux kernel |
| 260 | +should be compiled with ``CONFIG_PCIEASPM_DEFAULT`` set to ``y`` and the |
| 261 | +sysfs file ``/sys/module/pcie_aspm/parameters/policy`` should not be modified. |
| 262 | + |
| 263 | +Most notably, if L1.2 is not configured properly for any devices, the SoC |
| 264 | +will not be able to enter the deepest idle state. |
| 265 | + |
| 266 | +EPP Policy |
| 267 | +---------- |
| 268 | +The ``energy_performance_preference`` sysfs file can be used to set a bias |
| 269 | +of efficiency or performance for a CPU. This has a direct relationship on |
| 270 | +the battery life when more heavily biased towards performance. |
| 271 | + |
| 272 | + |
| 273 | +BIOS debug messages |
| 274 | +=================== |
| 275 | +Most OEM machines don't have a serial UART for outputting kernel or BIOS |
| 276 | +debug messages. However BIOS debug messages are useful for understanding |
| 277 | +both BIOS bugs and bugs with the Linux kernel drivers that call BIOS AML. |
| 278 | + |
| 279 | +As the BIOS on most OEM AMD systems are based off an AMD reference BIOS, |
| 280 | +the infrastructure used for exporting debugging messages is often the same |
| 281 | +as AMD reference BIOS. |
| 282 | + |
| 283 | +Manually Parsing |
| 284 | +---------------- |
| 285 | +There is generally an ACPI method ``\M460`` that different paths of the AML |
| 286 | +will call to emit a message to the BIOS serial log. This method takes |
| 287 | +7 arguments, with the first being a string and the rest being optional |
| 288 | +integers:: |
| 289 | + |
| 290 | + Method (M460, 7, Serialized) |
| 291 | + |
| 292 | +Here is an example of a string that BIOS AML may call out using ``\M460``:: |
| 293 | + |
| 294 | + M460 (" OEM-ASL-PCIe Address (0x%X)._REG (%d %d) PCSA = %d\n", DADR, Arg0, Arg1, PCSA, Zero, Zero) |
| 295 | + |
| 296 | +Normally when executed, the ``\M460`` method would populate the additional |
| 297 | +arguments into the string. In order to get these messages from the Linux |
| 298 | +kernel a hook has been added into ACPICA that can capture the *arguments* |
| 299 | +sent to ``\M460`` and print them to the kernel ring buffer. |
| 300 | +For example the following message could be emitted into kernel ring buffer:: |
| 301 | + |
| 302 | + extrace-0174 ex_trace_args : " OEM-ASL-PCIe Address (0x%X)._REG (%d %d) PCSA = %d\n", ec106000, 2, 1, 1, 0, 0 |
| 303 | + |
| 304 | +In order to get these messages, you need to compile with ``CONFIG_ACPI_DEBUG`` |
| 305 | +and then turn on the following ACPICA tracing parameters. |
| 306 | +This can be done either on the kernel command line or at runtime: |
| 307 | + |
| 308 | +* ``acpi.trace_method_name=\M460`` |
| 309 | +* ``acpi.trace_state=method`` |
| 310 | + |
| 311 | +NOTE: These can be very noisy at bootup. If you turn these parameters on |
| 312 | +the kernel command, please also consider turning up ``CONFIG_LOG_BUF_SHIFT`` |
| 313 | +to a larger size such as 17 to avoid losing early boot messages. |
| 314 | + |
| 315 | +Tool assisted Parsing |
| 316 | +--------------------- |
| 317 | +As mentioned above, parsing by hand can be tedious, especially with a lot of |
| 318 | +messages. To help with this, a tool has been created at |
| 319 | +`amd-debug-tools <https://git.kernel.org/pub/scm/linux/kernel/git/superm1/amd-debug-tools.git/about/>`_ |
| 320 | +to help parse the messages. |
0 commit comments