Skip to content

Commit b2ec5ca

Browse files
committed
Merge tag 'amd-drm-next-6.18-2025-09-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.18-2025-09-26: amdgpu: - Misc fixes - Misc cleanups - SMU 13.x fixes - MES fix - VCN 5.0.1 reset fixes - DCN 3.2 watermark fixes - AVI infoframe fixes - PSR fix - Brightness fixes - DCN 3.1.4 fixes - DCN 3.1+ DTM fixes - DCN powergating fixes - DMUB fixes - DCN/SMU cleanup - DCN stutter fixes - DCN 3.5 fixes - GAMMA_LUT fixes - Add UserQ documentation - GC 9.4 reset fixes - Enforce isolation cleanups - UserQ fixes - DC/non-DC common modes cleanup - DCE6-10 fixes amdkfd: - Fix a race in sw_fini - Switch partition fix Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2 parents 62bea0e + df2ba57 commit b2ec5ca

File tree

79 files changed

+1195
-392
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+1195
-392
lines changed

Documentation/gpu/amdgpu/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
1212
module-parameters
1313
gc/index
1414
display/index
15+
userq
1516
flashing
1617
xgmi
1718
ras

Documentation/gpu/amdgpu/userq.rst

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
==================
2+
User Mode Queues
3+
==================
4+
5+
Introduction
6+
============
7+
8+
Similar to the KFD, GPU engine queues move into userspace. The idea is to let
9+
user processes manage their submissions to the GPU engines directly, bypassing
10+
IOCTL calls to the driver to submit work. This reduces overhead and also allows
11+
the GPU to submit work to itself. Applications can set up work graphs of jobs
12+
across multiple GPU engines without needing trips through the CPU.
13+
14+
UMDs directly interface with firmware via per application shared memory areas.
15+
The main vehicle for this is queue. A queue is a ring buffer with a read
16+
pointer (rptr) and a write pointer (wptr). The UMD writes IP specific packets
17+
into the queue and the firmware processes those packets, kicking off work on the
18+
GPU engines. The CPU in the application (or another queue or device) updates
19+
the wptr to tell the firmware how far into the ring buffer to process packets
20+
and the rtpr provides feedback to the UMD on how far the firmware has progressed
21+
in executing those packets. When the wptr and the rptr are equal, the queue is
22+
idle.
23+
24+
Theory of Operation
25+
===================
26+
27+
The various engines on modern AMD GPUs support multiple queues per engine with a
28+
scheduling firmware which handles dynamically scheduling user queues on the
29+
available hardware queue slots. When the number of user queues outnumbers the
30+
available hardware queue slots, the scheduling firmware dynamically maps and
31+
unmaps queues based on priority and time quanta. The state of each user queue
32+
is managed in the kernel driver in an MQD (Memory Queue Descriptor). This is a
33+
buffer in GPU accessible memory that stores the state of a user queue. The
34+
scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
35+
Queue Descriptor) when a user queue is mapped. Each user queue requires a
36+
number of additional buffers which represent the ring buffer and any metadata
37+
needed by the engine for runtime operation. On most engines this consists of
38+
the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
39+
to userspace), a wptr buffer (where the application will write the wptr for the
40+
firmware to fetch it), and a doorbell. A doorbell is a piece of one of the
41+
device's MMIO BARs which can be mapped to specific user queues. When the
42+
application writes to the doorbell, it will signal the firmware to take some
43+
action. Writing to the doorbell wakes the firmware and causes it to fetch the
44+
wptr and start processing the packets in the queue. Each 4K page of the doorbell
45+
BAR supports specific offset ranges for specific engines. The doorbell of a
46+
queue must be mapped into the aperture aligned to the IP used by the queue
47+
(e.g., GFX, VCN, SDMA, etc.). These doorbell apertures are set up via NBIO
48+
registers. Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
49+
the doorbell BAR. A 4K doorbell page provides 512 64-bit doorbells for up to
50+
512 user queues. A subset of each page is reserved for each IP type supported
51+
on the device. The user can query the doorbell ranges for each IP via the INFO
52+
IOCTL. See the IOCTL Interfaces section for more information.
53+
54+
When an application wants to create a user queue, it allocates the necessary
55+
buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
56+
These can be separate buffers or all part of one larger buffer. The application
57+
would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
58+
the areas of memory they want to use for the user queue. They would also
59+
allocate a doorbell page for the doorbells used by the user queues. The
60+
application would then populate the MQD in the USERQ IOCTL structure with the
61+
GPU virtual addresses and doorbell index they want to use. The user can also
62+
specify the attributes for the user queue (priority, whether the queue is secure
63+
for protected content, etc.). The application would then call the USERQ
64+
CREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
65+
The kernel driver then validates the MQD provided by the application and
66+
translates the MQD into the engine specific MQD format for the IP. The IP
67+
specific MQD would be allocated and the queue would be added to the run list
68+
maintained by the scheduling firmware. Once the queue has been created, the
69+
application can write packets directly into the queue, update the wptr, and
70+
write to the doorbell offset to kick off work in the user queue.
71+
72+
When the application is done with the user queue, it would call the USERQ
73+
FREE IOCTL to destroy it. The kernel driver would preempt the queue and
74+
remove it from the scheduling firmware's run list. Then the IP specific MQD
75+
would be freed and the user queue state would be cleaned up.
76+
77+
Some engines may require the aggregated doorbell too if the engine does not
78+
support doorbells from unmapped queues. The aggregated doorbell is a special
79+
page of doorbell space which wakes the scheduler. In cases where the engine may
80+
be oversubscribed, some queues may not be mapped. If the doorbell is rung when
81+
the queue is not mapped, the engine firmware may miss the request. Some
82+
scheduling firmware may work around this by polling wptr shadows when the
83+
hardware is oversubscribed, other engines may support doorbell updates from
84+
unmapped queues. In the event that one of these options is not available, the
85+
kernel driver will map a page of aggregated doorbell space into each GPUVM
86+
space. The UMD will then update the doorbell and wptr as normal and then write
87+
to the aggregated doorbell as well.
88+
89+
Special Packets
90+
---------------
91+
92+
In order to support legacy implicit synchronization, as well as mixed user and
93+
kernel queues, we need a synchronization mechanism that is secure. Because
94+
kernel queues or memory management tasks depend on kernel fences, we need a way
95+
for user queues to update memory that the kernel can use for a fence, that can't
96+
be messed with by a bad actor. To support this, we've added a protected fence
97+
packet. This packet works by writing a monotonically increasing value to
98+
a memory location that only privileged clients have write access to. User
99+
queues only have read access. When this packet is executed, the memory location
100+
is updated and other queues (kernel or user) can see the results. The
101+
user application would submit this packet in their command stream. The actual
102+
packet format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
103+
behavior is the same. The packet submission is handled in userspace. The
104+
kernel driver sets up the privileged memory used for each user queue when it
105+
sets the queues up when the application creates them.
106+
107+
108+
Memory Management
109+
=================
110+
111+
It is assumed that all buffers mapped into the GPUVM space for the process are
112+
valid when engines on the GPU are running. The kernel driver will only allow
113+
user queues to run when all buffers are mapped. If there is a memory event that
114+
requires buffer migration, the kernel driver will preempt the user queues,
115+
migrate buffers to where they need to be, update the GPUVM page tables and
116+
invaldidate the TLB, and then resume the user queues.
117+
118+
Interaction with Kernel Queues
119+
==============================
120+
121+
Depending on the IP and the scheduling firmware, you can enable kernel queues
122+
and user queues at the same time, however, you are limited by the HQD slots.
123+
Kernel queues are always mapped so any work that goes into kernel queues will
124+
take priority. This limits the available HQD slots for user queues.
125+
126+
Not all IPs will support user queues on all GPUs. As such, UMDs will need to
127+
support both user queues and kernel queues depending on the IP. For example, a
128+
GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
129+
and VPE. UMDs need to support both. The kernel driver provides a way to
130+
determine if user queues and kernel queues are supported on a per IP basis.
131+
UMDs can query this information via the INFO IOCTL and determine whether to use
132+
kernel queues or user queues for each IP.
133+
134+
Queue Resets
135+
============
136+
137+
For most engines, queues can be reset individually. GFX, compute, and SDMA
138+
queues can be reset individually. When a hung queue is detected, it can be
139+
reset either via the scheduling firmware or MMIO. Since there are no kernel
140+
fences for most user queues, they will usually only be detected when some other
141+
event happens; e.g., a memory event which requires migration of buffers. When
142+
the queues are preempted, if the queue is hung, the preemption will fail.
143+
Driver will then look up the queues that failed to preempt and reset them and
144+
record which queues are hung.
145+
146+
On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
147+
status. UMD will provide the queue id in the IOCTL and the kernel driver
148+
will check if it has already recorded the queue as hung (e.g., due to failed
149+
peemption) and report back the status.
150+
151+
IOCTL Interfaces
152+
================
153+
154+
GPU virtual addresses used for queues and related data (rptrs, wptrs, context
155+
save areas, etc.) should be validated by the kernel mode driver to prevent the
156+
user from specifying invalid GPU virtual addresses. If the user provides
157+
invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
158+
error message. These buffers should also be tracked in the kernel driver so
159+
that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
160+
would return an error.
161+
162+
INFO
163+
----
164+
There are several new INFO queries related to user queues in order to query the
165+
size of user queue meta data needed for a user queue (e.g., context save areas
166+
or shadow buffers), whether kernel or user queues or both are supported
167+
for each IP type, and the offsets for each IP type in each doorbell page.
168+
169+
USERQ
170+
-----
171+
The USERQ IOCTL is used for creating, freeing, and querying the status of user
172+
queues. It supports 3 opcodes:
173+
174+
1. CREATE - Create a user queue. The application provides an MQD-like structure
175+
that defines the type of queue and associated metadata and flags for that
176+
queue type. Returns the queue id.
177+
2. FREE - Free a user queue.
178+
3. QUERY_STATUS - Query that status of a queue. Used to check if the queue is
179+
healthy or not. E.g., if the queue has been reset. (WIP)
180+
181+
USERQ_SIGNAL
182+
------------
183+
The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.
184+
185+
USERQ_WAIT
186+
----------
187+
The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.
188+
189+
Kernel and User Queues
190+
======================
191+
192+
In order to properly validate and test performance, we have a driver option to
193+
select what type of queues are enabled (kernel queues, user queues or both).
194+
The user_queue driver parameter allows you to enable kernel queues only (0),
195+
user queues and kernel queues (1), and user queues only (2). Enabling user
196+
queues only will free up static queue assignments that would otherwise be used
197+
by kernel queues for use by the scheduling firmware. Some kernel queues are
198+
required for kernel driver operation and they will always be created. When the
199+
kernel queues are not enabled, they are not registered with the drm scheduler
200+
and the CS IOCTL will reject any incoming command submissions which target those
201+
queue types. Kernel queues only mirrors the behavior on all existing GPUs.
202+
Enabling both queues allows for backwards compatibility with old userspace while
203+
still supporting user queues.

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -352,7 +352,7 @@ static int kgd_hqd_dump(struct amdgpu_device *adev,
352352
(*dump)[i++][1] = RREG32_SOC15_IP(GC, addr); \
353353
} while (0)
354354

355-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
355+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
356356
if (*dump == NULL)
357357
return -ENOMEM;
358358

@@ -449,7 +449,7 @@ static int kgd_hqd_sdma_dump(struct amdgpu_device *adev,
449449
#undef HQD_N_REGS
450450
#define HQD_N_REGS (19+6+7+10)
451451

452-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
452+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
453453
if (*dump == NULL)
454454
return -ENOMEM;
455455

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10_3.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,7 @@ static int hqd_dump_v10_3(struct amdgpu_device *adev,
338338
(*dump)[i++][1] = RREG32_SOC15_IP(GC, addr); \
339339
} while (0)
340340

341-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
341+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
342342
if (*dump == NULL)
343343
return -ENOMEM;
344344

@@ -435,7 +435,7 @@ static int hqd_sdma_dump_v10_3(struct amdgpu_device *adev,
435435
#undef HQD_N_REGS
436436
#define HQD_N_REGS (19+6+7+12)
437437

438-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
438+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
439439
if (*dump == NULL)
440440
return -ENOMEM;
441441

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -323,7 +323,7 @@ static int hqd_dump_v11(struct amdgpu_device *adev,
323323
(*dump)[i++][1] = RREG32(addr); \
324324
} while (0)
325325

326-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
326+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
327327
if (*dump == NULL)
328328
return -ENOMEM;
329329

@@ -420,7 +420,7 @@ static int hqd_sdma_dump_v11(struct amdgpu_device *adev,
420420
#undef HQD_N_REGS
421421
#define HQD_N_REGS (7+11+1+12+12)
422422

423-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
423+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
424424
if (*dump == NULL)
425425
return -ENOMEM;
426426

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v12.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ static int hqd_dump_v12(struct amdgpu_device *adev,
115115
(*dump)[i++][1] = RREG32(addr); \
116116
} while (0)
117117

118-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
118+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
119119
if (*dump == NULL)
120120
return -ENOMEM;
121121

@@ -146,7 +146,7 @@ static int hqd_sdma_dump_v12(struct amdgpu_device *adev,
146146
#undef HQD_N_REGS
147147
#define HQD_N_REGS (last_reg - first_reg + 1)
148148

149-
*dump = kmalloc(HQD_N_REGS*2*sizeof(uint32_t), GFP_KERNEL);
149+
*dump = kmalloc_array(HQD_N_REGS, sizeof(**dump), GFP_KERNEL);
150150
if (*dump == NULL)
151151
return -ENOMEM;
152152

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1089,7 +1089,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr,
10891089
return 0;
10901090
}
10911091

1092-
ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages, &range);
1092+
ret = amdgpu_ttm_tt_get_user_pages(bo, &range);
10931093
if (ret) {
10941094
if (ret == -EAGAIN)
10951095
pr_debug("Failed to get user pages, try again\n");
@@ -1103,6 +1103,9 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr,
11031103
pr_err("%s: Failed to reserve BO\n", __func__);
11041104
goto release_out;
11051105
}
1106+
1107+
amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm, range);
1108+
11061109
amdgpu_bo_placement_from_domain(bo, mem->domain);
11071110
ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
11081111
if (ret)
@@ -2565,8 +2568,7 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
25652568
}
25662569

25672570
/* Get updated user pages */
2568-
ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages,
2569-
&mem->range);
2571+
ret = amdgpu_ttm_tt_get_user_pages(bo, &mem->range);
25702572
if (ret) {
25712573
pr_debug("Failed %d to get user pages\n", ret);
25722574

@@ -2595,6 +2597,8 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
25952597
ret = 0;
25962598
}
25972599

2600+
amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm, mem->range);
2601+
25982602
mutex_lock(&process_info->notifier_lock);
25992603

26002604
/* Mark the BO as valid unless it was invalidated

drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ struct amdgpu_bo_list_entry {
3838
struct amdgpu_bo *bo;
3939
struct amdgpu_bo_va *bo_va;
4040
uint32_t priority;
41-
struct page **user_pages;
4241
struct hmm_range *range;
4342
bool user_invalidated;
4443
};

drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c

Lines changed: 19 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -398,30 +398,28 @@ static void amdgpu_connector_add_common_modes(struct drm_encoder *encoder,
398398
struct drm_display_mode *mode = NULL;
399399
struct drm_display_mode *native_mode = &amdgpu_encoder->native_mode;
400400
int i;
401-
static const struct mode_size {
401+
int n;
402+
struct mode_size {
403+
char name[DRM_DISPLAY_MODE_LEN];
402404
int w;
403405
int h;
404-
} common_modes[17] = {
405-
{ 640, 480},
406-
{ 720, 480},
407-
{ 800, 600},
408-
{ 848, 480},
409-
{1024, 768},
410-
{1152, 768},
411-
{1280, 720},
412-
{1280, 800},
413-
{1280, 854},
414-
{1280, 960},
415-
{1280, 1024},
416-
{1440, 900},
417-
{1400, 1050},
418-
{1680, 1050},
419-
{1600, 1200},
420-
{1920, 1080},
421-
{1920, 1200}
406+
} common_modes[] = {
407+
{ "640x480", 640, 480},
408+
{ "800x600", 800, 600},
409+
{ "1024x768", 1024, 768},
410+
{ "1280x720", 1280, 720},
411+
{ "1280x800", 1280, 800},
412+
{"1280x1024", 1280, 1024},
413+
{ "1440x900", 1440, 900},
414+
{"1680x1050", 1680, 1050},
415+
{"1600x1200", 1600, 1200},
416+
{"1920x1080", 1920, 1080},
417+
{"1920x1200", 1920, 1200}
422418
};
423419

424-
for (i = 0; i < 17; i++) {
420+
n = ARRAY_SIZE(common_modes);
421+
422+
for (i = 0; i < n; i++) {
425423
if (amdgpu_encoder->devices & (ATOM_DEVICE_TV_SUPPORT)) {
426424
if (common_modes[i].w > 1024 ||
427425
common_modes[i].h > 768)
@@ -434,12 +432,11 @@ static void amdgpu_connector_add_common_modes(struct drm_encoder *encoder,
434432
common_modes[i].h == native_mode->vdisplay))
435433
continue;
436434
}
437-
if (common_modes[i].w < 320 || common_modes[i].h < 200)
438-
continue;
439435

440436
mode = drm_cvt_mode(dev, common_modes[i].w, common_modes[i].h, 60, false, false, false);
441437
if (!mode)
442438
return;
439+
strscpy(mode->name, common_modes[i].name, DRM_DISPLAY_MODE_LEN);
443440

444441
drm_mode_probed_add(connector, mode);
445442
}

0 commit comments

Comments
 (0)