|
| 1 | +# Layer: GPU Profile - Developer Documentation |
| 2 | + |
| 3 | +This layer is used to profile Arm GPUs, providing API correlated performance |
| 4 | +data. This page provides documentation for developers working on creating and |
| 5 | +maintaining the layer. |
| 6 | + |
| 7 | +## Measuring performance |
| 8 | + |
| 9 | +Arm GPUs can run multiple workloads in parallel, if the application pipeline |
| 10 | +barriers allow it. This is good for overall frame performance, but it makes |
| 11 | +a mess of profiling data! |
| 12 | + |
| 13 | +## Measuring performance |
| 14 | + |
| 15 | +Arm GPUs can run multiple workloads in parallel, if the application pipeline |
| 16 | +barriers allow it. This is good for overall frame performance, but it makes |
| 17 | +profiling data messy due to cross-talk between unrelated workloads. |
| 18 | + |
| 19 | +For profiling we therefore inject serialization points between workloads to |
| 20 | +ensure that data corresponds to a single workload. Note that we can only |
| 21 | +serialize within our own application process, so data could still be perturbed |
| 22 | +by other processes using the GPU. |
| 23 | + |
| 24 | +### Sampling performance counters |
| 25 | + |
| 26 | +This layer will sample performance counters between each workload but, because |
| 27 | +sampling is a CPU-side operation, it must trap back to the CPU to make the |
| 28 | +counter sample. The correct way to implement this in Vulkan is to split the |
| 29 | +application command buffer into multiple command buffers, each containing a |
| 30 | +single workload. However, rewriting the command stream like this is expensive |
| 31 | +in terms of CPU overhead caused by the state tracking. |
| 32 | + |
| 33 | +Instead use rely on an undocumented extension supported by Arm GPUs which |
| 34 | +allows the CPU to set/wait on events in a submitted but not complete command |
| 35 | +buffer. The layer injects a `vkCmdSetEvent(A)` and `vkCmdWaitEvent(B)` pair |
| 36 | +between each workload, and then has the reverse `vkWaitEvent(A)` and |
| 37 | +`vkSetEvent(B)` pair on the CPU side. The counter sample can be inserted |
| 38 | +in between the two CPU-side operations. Note that there is no blocking wait on |
| 39 | +an event for the CPU, so `vkWaitEvent()` is really a polling loop around |
| 40 | +`vkGetEventStatus()`. |
| 41 | + |
| 42 | +```mermaid |
| 43 | +sequenceDiagram |
| 44 | + actor CPU |
| 45 | + actor GPU |
| 46 | + CPU->>CPU: vkGetEventStatus(A) |
| 47 | + Note over GPU: Run workload |
| 48 | + GPU->>CPU: vkCmdSetEvent(A) |
| 49 | + GPU->>GPU: vkCmdWaitEvent(B) |
| 50 | + Note over CPU: Take sample |
| 51 | + CPU->>GPU: vkSetEvent(B) |
| 52 | + Note over GPU: Start next workload |
| 53 | +``` |
| 54 | + |
| 55 | +### Performance implications |
| 56 | + |
| 57 | +Serializing workloads usually means that individual workloads will run with |
| 58 | +lower completion latency, because they are no longer contending for resources. |
| 59 | +However, loss of overlap means that overall frame latency will increase. |
| 60 | + |
| 61 | +In addition, serializing workloads and then trapping back to the CPU to |
| 62 | +sample performance counters will cause the GPU to go idle waiting for the CPU |
| 63 | +to complete the counter sample. This makes the GPU appear underutilized to the |
| 64 | +system DVFS governor, which may subsequently decide to reduce the GPU clock |
| 65 | +frequency. On pre-production devices we recommend locking CPU, GPU and memory |
| 66 | +clock frequencies to avoid this problem. |
| 67 | + |
| 68 | +```mermaid |
| 69 | +--- |
| 70 | +displayMode: compact |
| 71 | +--- |
| 72 | +gantt |
| 73 | + dateFormat x |
| 74 | + axisFormat %Lms |
| 75 | + section CPU |
| 76 | + Sample: a1, 0, 2ms |
| 77 | + Sample: a2, after w1, 2ms |
| 78 | + section GPU |
| 79 | + Workload 1:w1, after a1, 10ms |
| 80 | + Workload 2:w2, after a2, 10ms |
| 81 | +``` |
| 82 | + |
| 83 | +## Software architecture |
| 84 | + |
| 85 | +The basic architecture for this layer is an extension of the timeline layer, |
| 86 | +using a layer command stream (LCS) recorded alongside each command buffer to |
| 87 | +define the software operations that the layer needs to perform. |
| 88 | + |
| 89 | +Unlike the timeline layer, which only performs operations synchronously at |
| 90 | +submit time, this layer also needs to perform asynchronous sampling operations |
| 91 | +associated with each workload after a command buffer has been submitted. To |
| 92 | +support this approach the layer tracks the number of workloads submitted |
| 93 | +in each command buffer and their debug labels, and hands this over to an |
| 94 | +async handler to process as the workloads complete. |
| 95 | + |
| 96 | +To ensure that the async worker gets a predictable workload stream to |
| 97 | +instrument, all Vulkan queue submits are serialized on the GPU. As with the |
| 98 | +support layer, queue serialization may cause an application to hang if the |
| 99 | +application submits command buffers rely on out-of-order execution to unblock |
| 100 | +commands in a submitted command stream. This is only possible if applications |
| 101 | +are using timeline semaphores, which earlier submits to depend on a later |
| 102 | +submit to make forward progress. |
| 103 | + |
| 104 | +## Event handling |
| 105 | + |
| 106 | +To implement this functionality, the layer allocates three additional sync |
| 107 | +primitives. |
| 108 | + |
| 109 | +* A timeline semaphore is allocated to implement queue serialization. |
| 110 | +* Two events are allocated to support the CPU<->GPU handover for counter |
| 111 | + sampling. These events are reset and reused for all counter samples to avoid |
| 112 | + managing many different events. |
| 113 | + |
| 114 | +```c |
| 115 | +CPU GPU |
| 116 | +=== === |
| 117 | + // Workload 1 |
| 118 | + vkCmdSetEvent(A) |
| 119 | +// Spin test until set |
| 120 | +vkGetEventStatus(A) |
| 121 | +vkResetEvent(A) |
| 122 | + |
| 123 | +// Sample counters |
| 124 | + |
| 125 | +vSetEvent(B) |
| 126 | + // Block until set |
| 127 | + vkCmdWaitEvent(B) |
| 128 | + vkCmdResetEvent(B) |
| 129 | + |
| 130 | + // Workload 2 |
| 131 | +``` |
| 132 | +
|
| 133 | +Due to buggy interaction between the counter sampling and power management in |
| 134 | +some kernel driver versions, Valhall+CSF GPUs prior to r54p0 need a sleep after |
| 135 | +successfully waiting on event A and before sampling any counters. Initial |
| 136 | +investigations seem to show that the shortest reliable sleep is 3ms, so this is |
| 137 | +quite a very overhead for applications with many workloads and therefore should |
| 138 | +be enabled conditionally only for CSF GPUs with a driver older than r54p0. |
| 139 | +
|
| 140 | +- - - |
| 141 | +_Copyright © 2025, Arm Limited and contributors._ |
0 commit comments