|
| 1 | +================== |
| 2 | + User Mode Queues |
| 3 | +================== |
| 4 | + |
| 5 | +Introduction |
| 6 | +============ |
| 7 | + |
| 8 | +Similar to the KFD, GPU engine queues move into userspace. The idea is to let |
| 9 | +user processes manage their submissions to the GPU engines directly, bypassing |
| 10 | +IOCTL calls to the driver to submit work. This reduces overhead and also allows |
| 11 | +the GPU to submit work to itself. Applications can set up work graphs of jobs |
| 12 | +across multiple GPU engines without needing trips through the CPU. |
| 13 | + |
| 14 | +UMDs directly interface with firmware via per application shared memory areas. |
| 15 | +The main vehicle for this is queue. A queue is a ring buffer with a read |
| 16 | +pointer (rptr) and a write pointer (wptr). The UMD writes IP specific packets |
| 17 | +into the queue and the firmware processes those packets, kicking off work on the |
| 18 | +GPU engines. The CPU in the application (or another queue or device) updates |
| 19 | +the wptr to tell the firmware how far into the ring buffer to process packets |
| 20 | +and the rtpr provides feedback to the UMD on how far the firmware has progressed |
| 21 | +in executing those packets. When the wptr and the rptr are equal, the queue is |
| 22 | +idle. |
| 23 | + |
| 24 | +Theory of Operation |
| 25 | +=================== |
| 26 | + |
| 27 | +The various engines on modern AMD GPUs support multiple queues per engine with a |
| 28 | +scheduling firmware which handles dynamically scheduling user queues on the |
| 29 | +available hardware queue slots. When the number of user queues outnumbers the |
| 30 | +available hardware queue slots, the scheduling firmware dynamically maps and |
| 31 | +unmaps queues based on priority and time quanta. The state of each user queue |
| 32 | +is managed in the kernel driver in an MQD (Memory Queue Descriptor). This is a |
| 33 | +buffer in GPU accessible memory that stores the state of a user queue. The |
| 34 | +scheduling firmware uses the MQD to load the queue state into an HQD (Hardware |
| 35 | +Queue Descriptor) when a user queue is mapped. Each user queue requires a |
| 36 | +number of additional buffers which represent the ring buffer and any metadata |
| 37 | +needed by the engine for runtime operation. On most engines this consists of |
| 38 | +the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr |
| 39 | +to userspace), a wptr buffer (where the application will write the wptr for the |
| 40 | +firmware to fetch it), and a doorbell. A doorbell is a piece of one of the |
| 41 | +device's MMIO BARs which can be mapped to specific user queues. When the |
| 42 | +application writes to the doorbell, it will signal the firmware to take some |
| 43 | +action. Writing to the doorbell wakes the firmware and causes it to fetch the |
| 44 | +wptr and start processing the packets in the queue. Each 4K page of the doorbell |
| 45 | +BAR supports specific offset ranges for specific engines. The doorbell of a |
| 46 | +queue must be mapped into the aperture aligned to the IP used by the queue |
| 47 | +(e.g., GFX, VCN, SDMA, etc.). These doorbell apertures are set up via NBIO |
| 48 | +registers. Doorbells are 32 bit or 64 bit (depending on the engine) chunks of |
| 49 | +the doorbell BAR. A 4K doorbell page provides 512 64-bit doorbells for up to |
| 50 | +512 user queues. A subset of each page is reserved for each IP type supported |
| 51 | +on the device. The user can query the doorbell ranges for each IP via the INFO |
| 52 | +IOCTL. See the IOCTL Interfaces section for more information. |
| 53 | + |
| 54 | +When an application wants to create a user queue, it allocates the necessary |
| 55 | +buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.). |
| 56 | +These can be separate buffers or all part of one larger buffer. The application |
| 57 | +would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for |
| 58 | +the areas of memory they want to use for the user queue. They would also |
| 59 | +allocate a doorbell page for the doorbells used by the user queues. The |
| 60 | +application would then populate the MQD in the USERQ IOCTL structure with the |
| 61 | +GPU virtual addresses and doorbell index they want to use. The user can also |
| 62 | +specify the attributes for the user queue (priority, whether the queue is secure |
| 63 | +for protected content, etc.). The application would then call the USERQ |
| 64 | +CREATE IOCTL to create the queue using the specified MQD details in the IOCTL. |
| 65 | +The kernel driver then validates the MQD provided by the application and |
| 66 | +translates the MQD into the engine specific MQD format for the IP. The IP |
| 67 | +specific MQD would be allocated and the queue would be added to the run list |
| 68 | +maintained by the scheduling firmware. Once the queue has been created, the |
| 69 | +application can write packets directly into the queue, update the wptr, and |
| 70 | +write to the doorbell offset to kick off work in the user queue. |
| 71 | + |
| 72 | +When the application is done with the user queue, it would call the USERQ |
| 73 | +FREE IOCTL to destroy it. The kernel driver would preempt the queue and |
| 74 | +remove it from the scheduling firmware's run list. Then the IP specific MQD |
| 75 | +would be freed and the user queue state would be cleaned up. |
| 76 | + |
| 77 | +Some engines may require the aggregated doorbell too if the engine does not |
| 78 | +support doorbells from unmapped queues. The aggregated doorbell is a special |
| 79 | +page of doorbell space which wakes the scheduler. In cases where the engine may |
| 80 | +be oversubscribed, some queues may not be mapped. If the doorbell is rung when |
| 81 | +the queue is not mapped, the engine firmware may miss the request. Some |
| 82 | +scheduling firmware may work around this by polling wptr shadows when the |
| 83 | +hardware is oversubscribed, other engines may support doorbell updates from |
| 84 | +unmapped queues. In the event that one of these options is not available, the |
| 85 | +kernel driver will map a page of aggregated doorbell space into each GPUVM |
| 86 | +space. The UMD will then update the doorbell and wptr as normal and then write |
| 87 | +to the aggregated doorbell as well. |
| 88 | + |
| 89 | +Special Packets |
| 90 | +--------------- |
| 91 | + |
| 92 | +In order to support legacy implicit synchronization, as well as mixed user and |
| 93 | +kernel queues, we need a synchronization mechanism that is secure. Because |
| 94 | +kernel queues or memory management tasks depend on kernel fences, we need a way |
| 95 | +for user queues to update memory that the kernel can use for a fence, that can't |
| 96 | +be messed with by a bad actor. To support this, we've added a protected fence |
| 97 | +packet. This packet works by writing a monotonically increasing value to |
| 98 | +a memory location that only privileged clients have write access to. User |
| 99 | +queues only have read access. When this packet is executed, the memory location |
| 100 | +is updated and other queues (kernel or user) can see the results. The |
| 101 | +user application would submit this packet in their command stream. The actual |
| 102 | +packet format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the |
| 103 | +behavior is the same. The packet submission is handled in userspace. The |
| 104 | +kernel driver sets up the privileged memory used for each user queue when it |
| 105 | +sets the queues up when the application creates them. |
| 106 | + |
| 107 | + |
| 108 | +Memory Management |
| 109 | +================= |
| 110 | + |
| 111 | +It is assumed that all buffers mapped into the GPUVM space for the process are |
| 112 | +valid when engines on the GPU are running. The kernel driver will only allow |
| 113 | +user queues to run when all buffers are mapped. If there is a memory event that |
| 114 | +requires buffer migration, the kernel driver will preempt the user queues, |
| 115 | +migrate buffers to where they need to be, update the GPUVM page tables and |
| 116 | +invaldidate the TLB, and then resume the user queues. |
| 117 | + |
| 118 | +Interaction with Kernel Queues |
| 119 | +============================== |
| 120 | + |
| 121 | +Depending on the IP and the scheduling firmware, you can enable kernel queues |
| 122 | +and user queues at the same time, however, you are limited by the HQD slots. |
| 123 | +Kernel queues are always mapped so any work that goes into kernel queues will |
| 124 | +take priority. This limits the available HQD slots for user queues. |
| 125 | + |
| 126 | +Not all IPs will support user queues on all GPUs. As such, UMDs will need to |
| 127 | +support both user queues and kernel queues depending on the IP. For example, a |
| 128 | +GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG, |
| 129 | +and VPE. UMDs need to support both. The kernel driver provides a way to |
| 130 | +determine if user queues and kernel queues are supported on a per IP basis. |
| 131 | +UMDs can query this information via the INFO IOCTL and determine whether to use |
| 132 | +kernel queues or user queues for each IP. |
| 133 | + |
| 134 | +Queue Resets |
| 135 | +============ |
| 136 | + |
| 137 | +For most engines, queues can be reset individually. GFX, compute, and SDMA |
| 138 | +queues can be reset individually. When a hung queue is detected, it can be |
| 139 | +reset either via the scheduling firmware or MMIO. Since there are no kernel |
| 140 | +fences for most user queues, they will usually only be detected when some other |
| 141 | +event happens; e.g., a memory event which requires migration of buffers. When |
| 142 | +the queues are preempted, if the queue is hung, the preemption will fail. |
| 143 | +Driver will then look up the queues that failed to preempt and reset them and |
| 144 | +record which queues are hung. |
| 145 | + |
| 146 | +On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue |
| 147 | +status. UMD will provide the queue id in the IOCTL and the kernel driver |
| 148 | +will check if it has already recorded the queue as hung (e.g., due to failed |
| 149 | +peemption) and report back the status. |
| 150 | + |
| 151 | +IOCTL Interfaces |
| 152 | +================ |
| 153 | + |
| 154 | +GPU virtual addresses used for queues and related data (rptrs, wptrs, context |
| 155 | +save areas, etc.) should be validated by the kernel mode driver to prevent the |
| 156 | +user from specifying invalid GPU virtual addresses. If the user provides |
| 157 | +invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an |
| 158 | +error message. These buffers should also be tracked in the kernel driver so |
| 159 | +that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call |
| 160 | +would return an error. |
| 161 | + |
| 162 | +INFO |
| 163 | +---- |
| 164 | +There are several new INFO queries related to user queues in order to query the |
| 165 | +size of user queue meta data needed for a user queue (e.g., context save areas |
| 166 | +or shadow buffers), whether kernel or user queues or both are supported |
| 167 | +for each IP type, and the offsets for each IP type in each doorbell page. |
| 168 | + |
| 169 | +USERQ |
| 170 | +----- |
| 171 | +The USERQ IOCTL is used for creating, freeing, and querying the status of user |
| 172 | +queues. It supports 3 opcodes: |
| 173 | + |
| 174 | +1. CREATE - Create a user queue. The application provides an MQD-like structure |
| 175 | + that defines the type of queue and associated metadata and flags for that |
| 176 | + queue type. Returns the queue id. |
| 177 | +2. FREE - Free a user queue. |
| 178 | +3. QUERY_STATUS - Query that status of a queue. Used to check if the queue is |
| 179 | + healthy or not. E.g., if the queue has been reset. (WIP) |
| 180 | + |
| 181 | +USERQ_SIGNAL |
| 182 | +------------ |
| 183 | +The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled. |
| 184 | + |
| 185 | +USERQ_WAIT |
| 186 | +---------- |
| 187 | +The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on. |
| 188 | + |
| 189 | +Kernel and User Queues |
| 190 | +====================== |
| 191 | + |
| 192 | +In order to properly validate and test performance, we have a driver option to |
| 193 | +select what type of queues are enabled (kernel queues, user queues or both). |
| 194 | +The user_queue driver parameter allows you to enable kernel queues only (0), |
| 195 | +user queues and kernel queues (1), and user queues only (2). Enabling user |
| 196 | +queues only will free up static queue assignments that would otherwise be used |
| 197 | +by kernel queues for use by the scheduling firmware. Some kernel queues are |
| 198 | +required for kernel driver operation and they will always be created. When the |
| 199 | +kernel queues are not enabled, they are not registered with the drm scheduler |
| 200 | +and the CS IOCTL will reject any incoming command submissions which target those |
| 201 | +queue types. Kernel queues only mirrors the behavior on all existing GPUs. |
| 202 | +Enabling both queues allows for backwards compatibility with old userspace while |
| 203 | +still supporting user queues. |
0 commit comments