Skip to content

Commit 3463315

Browse files
committed
Merge tag 'amd-drm-next-6.10-2024-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.10-2024-04-13: amdgpu: - HDCP fixes - ODM fixes - RAS fixes - Devcoredump improvements - Misc code cleanups - Expose VCN activity via sysfs - SMY 13.0.x updates - Enable fast updates on DCN 3.1.4 - Add dclk and vclk reporting on additional devices - Add ACA RAS infrastructure - Implement TLB flush fence - EEPROM handling fixes - SMUIO 14.0.2 support - SMU 14.0.1 Updates - Sync page table freeing with TLB flushes - DML2 refactor - DC debug improvements - SR-IOV fixes - Suspend and Resume fixes - DCN 3.5.x Updates - Z8 fixes - UMSCH fixes - GPU reset fixes - HDP fix for second GFX pipe on GC 10.x - Enable secondary GFX pipe on GC 10.3 - Refactor and clean up BACO/BOCO/BAMACO handling - VCN partitioning fix - DC DWB fixes - VSC SDP fixes - DCN 3.1.6 fix - GC 11.5 fixes - Remove invalid TTM resource start check - DCN 1.0 fixes amdkfd: - MQD handling cleanup - Preemption handling fixes for XCDs - TLB flush fix for GC 9.4.2 - Properly clean up workqueue during module unload - Fix memory leak process create failure - Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace radeon: - Misc code cleanups From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Dave Airlie <[email protected]>
2 parents 6e1f415 + ab956ed commit 3463315

File tree

376 files changed

+8554
-2759
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

376 files changed

+8554
-2759
lines changed
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
===============
2+
GPU Debugging
3+
===============
4+
5+
GPUVM Debugging
6+
===============
7+
8+
To aid in debugging GPU virtual memory related problems, the driver supports a
9+
number of options module parameters:
10+
11+
`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
12+
13+
`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
14+
the GPU.
15+
16+
17+
Decoding a GPUVM Page Fault
18+
===========================
19+
20+
If you see a GPU page fault in the kernel log, you can decode it to figure
21+
out what is going wrong in your application. A page fault in your kernel
22+
log may look something like this:
23+
24+
::
25+
26+
[gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
27+
in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
28+
VM_L2_PROTECTION_FAULT_STATUS:0x00301030
29+
Faulty UTCL2 client ID: TCP (0x8)
30+
MORE_FAULTS: 0x0
31+
WALKER_ERROR: 0x0
32+
PERMISSION_FAULTS: 0x3
33+
MAPPING_ERROR: 0x0
34+
RW: 0x0
35+
36+
First you have the memory hub, gfxhub and mmhub. gfxhub is the memory
37+
hub used for graphics, compute, and sdma on some chips. mmhub is the
38+
memory hub used for multi-media and sdma on some chips.
39+
40+
Next you have the vmid and pasid. If the vmid is 0, this fault was likely
41+
caused by the kernel driver or firmware. If the vmid is non-0, it is generally
42+
a fault in a user application. The pasid is used to link a vmid to a system
43+
process id. If the process is active when the fault happens, the process
44+
information will be printed.
45+
46+
The GPU virtual address that caused the fault comes next.
47+
48+
The client ID indicates the GPU block that caused the fault.
49+
Some common client IDs:
50+
51+
- CB/DB: The color/depth backend of the graphics pipe
52+
- CPF: Command Processor Frontend
53+
- CPC: Command Processor Compute
54+
- CPG: Command Processor Graphics
55+
- TCP/SQC/SQG: Shaders
56+
- SDMA: SDMA engines
57+
- VCN: Video encode/decode engines
58+
- JPEG: JPEG engines
59+
60+
PERMISSION_FAULTS describe what faults were encountered:
61+
62+
- bit 0: the PTE was not valid
63+
- bit 1: the PTE read bit was not set
64+
- bit 2: the PTE write bit was not set
65+
- bit 3: the PTE execute bit was not set
66+
67+
Finally, RW, indicates whether the access was a read (0) or a write (1).
68+
69+
In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
70+
an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
71+
0x0000800102800000. The user can then inspect their shader code and resource
72+
descriptor state to determine what caused the GPU page fault.
73+
74+
UMR
75+
===
76+
77+
`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
78+
GPU debugging and diagnostics tool. Please see the umr
79+
`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
80+
about its capabilities.

Documentation/gpu/amdgpu/display/display-contributing.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ Enable underlay
135135
---------------
136136

137137
AMD display has this feature called underlay (which you can read more about at
138-
'Documentation/GPU/amdgpu/display/mpo-overview.rst') which is intended to
138+
'Documentation/gpu/amdgpu/display/mpo-overview.rst') which is intended to
139139
save power when playing a video. The basic idea is to put a video in the
140140
underlay plane at the bottom and the desktop in the plane above it with a hole
141141
in the video area. This feature is enabled in ChromeOS, and from our data

Documentation/gpu/amdgpu/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
1515
ras
1616
thermal
1717
driver-misc
18+
debugging
1819
amdgpu-glossary

drivers/gpu/drm/amd/amdgpu/Makefile

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \
7070
amdgpu_cs.o amdgpu_bios.o amdgpu_benchmark.o \
7171
atombios_dp.o amdgpu_afmt.o amdgpu_trace_points.o \
7272
atombios_encoders.o amdgpu_sa.o atombios_i2c.o \
73-
amdgpu_dma_buf.o amdgpu_vm.o amdgpu_vm_pt.o amdgpu_ib.o amdgpu_pll.o \
73+
amdgpu_dma_buf.o amdgpu_vm.o amdgpu_vm_pt.o amdgpu_vm_tlb_fence.o \
74+
amdgpu_ib.o amdgpu_pll.o \
7475
amdgpu_ucode.o amdgpu_bo_list.o amdgpu_ctx.o amdgpu_sync.o \
7576
amdgpu_gtt_mgr.o amdgpu_preempt_mgr.o amdgpu_vram_mgr.o amdgpu_virt.o \
7677
amdgpu_atomfirmware.o amdgpu_vf_error.o amdgpu_sched.o \
@@ -80,7 +81,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \
8081
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
8182
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
8283
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
83-
amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
84+
amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_dev_coredump.o
8485

8586
amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
8687

@@ -247,7 +248,8 @@ amdgpu-y += \
247248
smuio_v11_0_6.o \
248249
smuio_v13_0.o \
249250
smuio_v13_0_3.o \
250-
smuio_v13_0_6.o
251+
smuio_v13_0_6.o \
252+
smuio_v14_0_2.o
251253

252254
# add reset block
253255
amdgpu-y += \

drivers/gpu/drm/amd/amdgpu/amdgpu.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,7 @@ extern int amdgpu_async_gfx_ring;
210210
extern int amdgpu_mcbp;
211211
extern int amdgpu_discovery;
212212
extern int amdgpu_mes;
213+
extern int amdgpu_mes_log_enable;
213214
extern int amdgpu_mes_kiq;
214215
extern int amdgpu_noretry;
215216
extern int amdgpu_force_asic_type;
@@ -605,7 +606,7 @@ struct amdgpu_asic_funcs {
605606
/* PCIe replay counter */
606607
uint64_t (*get_pcie_replay_count)(struct amdgpu_device *adev);
607608
/* device supports BACO */
608-
bool (*supports_baco)(struct amdgpu_device *adev);
609+
int (*supports_baco)(struct amdgpu_device *adev);
609610
/* pre asic_init quirks */
610611
void (*pre_asic_init)(struct amdgpu_device *adev);
611612
/* enter/exit umd stable pstate */
@@ -1407,7 +1408,7 @@ bool amdgpu_device_supports_atpx(struct drm_device *dev);
14071408
bool amdgpu_device_supports_px(struct drm_device *dev);
14081409
bool amdgpu_device_supports_boco(struct drm_device *dev);
14091410
bool amdgpu_device_supports_smart_shift(struct drm_device *dev);
1410-
bool amdgpu_device_supports_baco(struct drm_device *dev);
1411+
int amdgpu_device_supports_baco(struct drm_device *dev);
14111412
bool amdgpu_device_is_peer_accessible(struct amdgpu_device *adev,
14121413
struct amdgpu_device *peer_adev);
14131414
int amdgpu_device_baco_enter(struct drm_device *dev);

0 commit comments

Comments
 (0)