diff --git a/layer_gpu_profile/README_LAYER.md b/layer_gpu_profile/README_LAYER.md index b78f6cc..90b74b2 100644 --- a/layer_gpu_profile/README_LAYER.md +++ b/layer_gpu_profile/README_LAYER.md @@ -6,30 +6,33 @@ counters for selected frames running on an Arm GPU. ## What devices are supported? This layer requires Vulkan 1.0 and an Arm GPU because it uses an Arm-specific -counter sampling library. +performance counter sampling library. ## What data can be collected? The layer serializes workloads for instrumented frames and injects counter -samples between them, allowing the layer to measure the hardware cost of -render passes, compute dispatches, transfers, etc. +samples between them, allowing the layer to measure the hardware metrics for +Vulkan render passes, compute dispatches, transfers, etc. The serialization is very invasive to wall-clock performance, due to removal of pipeline overlap between workloads and additional GPU idle time waiting for -the layer to performs each performance counter sampling operation. This will +the layer to perform each performance counter sampling operation. This will have an impact on the counter data being captured! Derived counters that show queue and functional unit utilization as a -percentage of the overall "active" time of their parent block will report low -because of time spent refilling and then draining the GPU pipeline between -workloads. The overall _GPU Active Cycles_ counter is known to be unreliable, -because the serialization means that command stream setup and teardown costs -are not hidden in the shadow of surrounding work. We recommend using the -individual queue active cycles counters as the main measure of performance. +percentage of the overall "active" time of their parent block will report +too low. This is because of time spent refilling and then draining the GPU +pipeline for each serialized workload. -Note that any counter that measure direct work, such as architectural issue -cycles, or workload nouns, such as primitives or threads, should be unaffected -by the loss of pipelining. +The overall _GPU Active Cycles_ counter is also known to be unreliable, because +the serialization means that command stream setup and teardown costs are not +hidden in the shadow of surrounding work. We recommend using the individual +queue active cycles counters as the main measure of workload cost. + +Note that any counters that measure direct work, such as architectural issue +cycles, or identifiable workload nouns, such as primitives or threads, should +be unaffected by the loss of pipelining as the workload itself is functionally +unaffected by the addition of serialization. Arm GPUs provide a wide range of performance counters covering many different aspects of hardware performance. The layer will collect a standard set of @@ -44,11 +47,14 @@ hardware counters and derived expressions supported by the The GPU idle time waiting for the CPU to take a counter sample can cause the system DVFS power governor to decide that the GPU is not busy. In production devices we commonly see that the GPU will be down-clocked during the -instrumented frame, which may have an impact on a subset of the available -performance counters. - -When running on a pre-production device we recommend pinning CPU, GPU, and bus -clock speeds to avoid the performance instability. +instrumented frame, which may have an impact on a some of the available +performance counters. For example, GPU memory latency may appear lower than +normal if the reduction in GPU clock makes the memory system look faster in +comparison. + +When running on a pre-production device you can minimize the impacts of these +effects by pinning CPU, GPU, and bus clock frequencies. This is not usually +possible on a production device. ## How do I use the layer?