Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 24 additions & 18 deletions layer_gpu_profile/README_LAYER.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,33 @@ counters for selected frames running on an Arm GPU.
## What devices are supported?

This layer requires Vulkan 1.0 and an Arm GPU because it uses an Arm-specific
counter sampling library.
performance counter sampling library.

## What data can be collected?

The layer serializes workloads for instrumented frames and injects counter
samples between them, allowing the layer to measure the hardware cost of
render passes, compute dispatches, transfers, etc.
samples between them, allowing the layer to measure the hardware metrics for
Vulkan render passes, compute dispatches, transfers, etc.

The serialization is very invasive to wall-clock performance, due to removal
of pipeline overlap between workloads and additional GPU idle time waiting for
the layer to performs each performance counter sampling operation. This will
the layer to perform each performance counter sampling operation. This will
have an impact on the counter data being captured!

Derived counters that show queue and functional unit utilization as a
percentage of the overall "active" time of their parent block will report low
because of time spent refilling and then draining the GPU pipeline between
workloads. The overall _GPU Active Cycles_ counter is known to be unreliable,
because the serialization means that command stream setup and teardown costs
are not hidden in the shadow of surrounding work. We recommend using the
individual queue active cycles counters as the main measure of performance.
percentage of the overall "active" time of their parent block will report
too low. This is because of time spent refilling and then draining the GPU
pipeline for each serialized workload.

Note that any counter that measure direct work, such as architectural issue
cycles, or workload nouns, such as primitives or threads, should be unaffected
by the loss of pipelining.
The overall _GPU Active Cycles_ counter is also known to be unreliable, because
the serialization means that command stream setup and teardown costs are not
hidden in the shadow of surrounding work. We recommend using the individual
queue active cycles counters as the main measure of workload cost.

Note that any counters that measure direct work, such as architectural issue
cycles, or identifiable workload nouns, such as primitives or threads, should
be unaffected by the loss of pipelining as the workload itself is functionally
unaffected by the addition of serialization.

Arm GPUs provide a wide range of performance counters covering many different
aspects of hardware performance. The layer will collect a standard set of
Expand All @@ -44,11 +47,14 @@ hardware counters and derived expressions supported by the
The GPU idle time waiting for the CPU to take a counter sample can cause the
system DVFS power governor to decide that the GPU is not busy. In production
devices we commonly see that the GPU will be down-clocked during the
instrumented frame, which may have an impact on a subset of the available
performance counters.

When running on a pre-production device we recommend pinning CPU, GPU, and bus
clock speeds to avoid the performance instability.
instrumented frame, which may have an impact on a some of the available
performance counters. For example, GPU memory latency may appear lower than
normal if the reduction in GPU clock makes the memory system look faster in
comparison.

When running on a pre-production device you can minimize the impacts of these
effects by pinning CPU, GPU, and bus clock frequencies. This is not usually
possible on a production device.

## How do I use the layer?

Expand Down