|
| 1 | +# Layer: GPU Profile |
| 2 | + |
| 3 | +This layer is a frame profiler that can capture per workload performance |
| 4 | +counters for selected frames running on an Arm GPU. |
| 5 | + |
| 6 | +## What devices are supported? |
| 7 | + |
| 8 | +This layer requires Vulkan 1.0 and an Arm GPU because it uses an Arm-specific |
| 9 | +counter sampling library. |
| 10 | + |
| 11 | +## What data can be collected? |
| 12 | + |
| 13 | +The layer serializes workloads for instrumented frames and injects counter |
| 14 | +samples between them, allowing the layer to measure the hardware cost of |
| 15 | +render passes, compute dispatches, transfers, etc. |
| 16 | + |
| 17 | +The serialization is very invasive to wall-clock performance, due to removal |
| 18 | +of pipeline overlap between workloads and additional GPU idle time waiting for |
| 19 | +the layer to performs each performance counter sampling operation. This will |
| 20 | +have an impact on the counter data being captured! |
| 21 | + |
| 22 | +Derived counters that show queue and functional unit utilization as a |
| 23 | +percentage of the overall "active" time of their parent block will report low |
| 24 | +because of time spent refilling and then draining the GPU pipeline between |
| 25 | +workloads. The overall _GPU Active Cycles_ counter is known to be unreliable, |
| 26 | +because the serialization means that command stream setup and teardown costs |
| 27 | +are not hidden in the shadow of surrounding work. We recommend using the |
| 28 | +individual queue active cycles counters as the main measure of performance. |
| 29 | + |
| 30 | +Note that any counter that measure direct work, such as architectural issue |
| 31 | +cycles, or workload nouns, such as primitives or threads, should be unaffected |
| 32 | +by the loss of pipelining. |
| 33 | + |
| 34 | +Arm GPUs provide a wide range of performance counters covering many different |
| 35 | +aspects of hardware performance. The layer will collect a standard set of |
| 36 | +counters by default but, with source modification, can collect any of the |
| 37 | +hardware counters and derived expressions supported by the |
| 38 | +[libGPUCounters][LGC] library that Arm provides on GitHub. |
| 39 | + |
| 40 | +[LGC]: https://github.com/ARM-software/libGPUCounters |
| 41 | + |
| 42 | +### GPU clock frequency impact |
| 43 | + |
| 44 | +The GPU idle time waiting for the CPU to take a counter sample can cause the |
| 45 | +system DVFS power governor to decide that the GPU is not busy. In production |
| 46 | +devices we commonly see that the GPU will be down-clocked during the |
| 47 | +instrumented frame, which may have an impact on a subset of the available |
| 48 | +performance counters. |
| 49 | + |
| 50 | +When running on a pre-production device we recommend pinning CPU, GPU, and bus |
| 51 | +clock speeds to avoid the performance instability. |
| 52 | + |
| 53 | +## How do I use the layer? |
| 54 | + |
| 55 | +### Prerequisites |
| 56 | + |
| 57 | +Device setup steps: |
| 58 | + |
| 59 | +* Ensure your Android device is in developer mode, with `adb` support enabled |
| 60 | + in developer settings. |
| 61 | +* Ensure the Android device is connected to your development workstation, and |
| 62 | + visible to `adb` with an authorized debug connection. |
| 63 | + |
| 64 | +Application setup steps: |
| 65 | + |
| 66 | +* Build a debuggable build of your application and install it on the Android |
| 67 | + device. |
| 68 | + |
| 69 | +Tooling setup steps |
| 70 | + |
| 71 | +* Install the Android platform tools and ensure `adb` is on your `PATH` |
| 72 | + environment variable. |
| 73 | +* Install the Android NDK and set the `ANDROID_NDK_HOME` environment variable |
| 74 | + to its installation path. |
| 75 | + |
| 76 | +### Layer build |
| 77 | + |
| 78 | +Build the Profile layer for Android using the provided build script, or using |
| 79 | +equivalent manual commands, from the `layer_gpu_profile` directory. For full |
| 80 | +instructions see the _Build an Android layer_ and _Build a Linux layer_ |
| 81 | +sections in the [Build documentation](../docs/building.md). |
| 82 | + |
| 83 | +### Running using the layer |
| 84 | + |
| 85 | +You can configure a device to run a profile by using the Android helper utility |
| 86 | +found in the root directory to configure the layer and manage the application. |
| 87 | +You must enable the profile layer, and provide a configuration file to |
| 88 | +parameterize it. |
| 89 | + |
| 90 | +```sh |
| 91 | +python3 lgl_android_install.py --layer layer_gpu_profile --config <your.json> --profile <out_dir> |
| 92 | +``` |
| 93 | + |
| 94 | +The [`layer_config.json`](layer_config.json) file in this directory is a |
| 95 | +template configuration file you can start from. It defaults to periodic |
| 96 | +sampling every 600 frames, but you can modify this to suit your needs. |
| 97 | + |
| 98 | +The `--profile` option specifies an output directory on the host to contain |
| 99 | +the CSV files written by the tool. One CSV is written for each frame, each CSV |
| 100 | +containing a table with one row per workload profiled in the frame, listed |
| 101 | +in API submit order. |
| 102 | + |
| 103 | +The Android helper utility contains many other options for configuring the |
| 104 | +application under test and the capture process. For full instructions see the |
| 105 | +[Running on Android documentation](../docs/running_android.md). |
| 106 | + |
| 107 | +## Layer configuration |
| 108 | + |
| 109 | +The current layer supports two `sampling_mode` values: |
| 110 | + |
| 111 | +* `periodic_frame`: Sample every N frames. |
| 112 | +* `frame_list`: Sample specific frames. |
| 113 | + |
| 114 | +When `mode` is `periodic_frame` the integer value of the `periodic_frame` key |
| 115 | +defines the frame sampling period. The integer value of the |
| 116 | +`periodic_min_frame` key defines the first possible frame that could be |
| 117 | +profiled, allowing profiles to skip over any loading frames. By default frame 0 |
| 118 | +is ignored. |
| 119 | + |
| 120 | +When `mode` is `frame_list` the value of the `frame_list` key defines a list |
| 121 | +of integers giving the specific frames to capture. |
| 122 | + |
| 123 | +## Layer counters |
| 124 | + |
| 125 | +The current layer uses a hard-coded set of performance counters defined in the |
| 126 | +`Device` class constructor. If you wish to collect different counters you must |
| 127 | +edit the [Device source](./source.device.cpp) and rebuild the layer. |
| 128 | + |
| 129 | +Any counters that are specified but that are not available on the current GPU |
| 130 | +will be ignored. |
| 131 | + |
| 132 | +- - - |
| 133 | + |
| 134 | +_Copyright © 2025, Arm Limited and contributors._ |
0 commit comments