Skip to content

Commit e77c6c6

Browse files
authored
Profile layer: Improve readme (#135)
1 parent 02f5b26 commit e77c6c6

File tree

1 file changed

+24
-18
lines changed

1 file changed

+24
-18
lines changed

layer_gpu_profile/README_LAYER.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,30 +6,33 @@ counters for selected frames running on an Arm GPU.
66
## What devices are supported?
77

88
This layer requires Vulkan 1.0 and an Arm GPU because it uses an Arm-specific
9-
counter sampling library.
9+
performance counter sampling library.
1010

1111
## What data can be collected?
1212

1313
The layer serializes workloads for instrumented frames and injects counter
14-
samples between them, allowing the layer to measure the hardware cost of
15-
render passes, compute dispatches, transfers, etc.
14+
samples between them, allowing the layer to measure the hardware metrics for
15+
Vulkan render passes, compute dispatches, transfers, etc.
1616

1717
The serialization is very invasive to wall-clock performance, due to removal
1818
of pipeline overlap between workloads and additional GPU idle time waiting for
19-
the layer to performs each performance counter sampling operation. This will
19+
the layer to perform each performance counter sampling operation. This will
2020
have an impact on the counter data being captured!
2121

2222
Derived counters that show queue and functional unit utilization as a
23-
percentage of the overall "active" time of their parent block will report low
24-
because of time spent refilling and then draining the GPU pipeline between
25-
workloads. The overall _GPU Active Cycles_ counter is known to be unreliable,
26-
because the serialization means that command stream setup and teardown costs
27-
are not hidden in the shadow of surrounding work. We recommend using the
28-
individual queue active cycles counters as the main measure of performance.
23+
percentage of the overall "active" time of their parent block will report
24+
too low. This is because of time spent refilling and then draining the GPU
25+
pipeline for each serialized workload.
2926

30-
Note that any counter that measure direct work, such as architectural issue
31-
cycles, or workload nouns, such as primitives or threads, should be unaffected
32-
by the loss of pipelining.
27+
The overall _GPU Active Cycles_ counter is also known to be unreliable, because
28+
the serialization means that command stream setup and teardown costs are not
29+
hidden in the shadow of surrounding work. We recommend using the individual
30+
queue active cycles counters as the main measure of workload cost.
31+
32+
Note that any counters that measure direct work, such as architectural issue
33+
cycles, or identifiable workload nouns, such as primitives or threads, should
34+
be unaffected by the loss of pipelining as the workload itself is functionally
35+
unaffected by the addition of serialization.
3336

3437
Arm GPUs provide a wide range of performance counters covering many different
3538
aspects of hardware performance. The layer will collect a standard set of
@@ -44,11 +47,14 @@ hardware counters and derived expressions supported by the
4447
The GPU idle time waiting for the CPU to take a counter sample can cause the
4548
system DVFS power governor to decide that the GPU is not busy. In production
4649
devices we commonly see that the GPU will be down-clocked during the
47-
instrumented frame, which may have an impact on a subset of the available
48-
performance counters.
49-
50-
When running on a pre-production device we recommend pinning CPU, GPU, and bus
51-
clock speeds to avoid the performance instability.
50+
instrumented frame, which may have an impact on a some of the available
51+
performance counters. For example, GPU memory latency may appear lower than
52+
normal if the reduction in GPU clock makes the memory system look faster in
53+
comparison.
54+
55+
When running on a pre-production device you can minimize the impacts of these
56+
effects by pinning CPU, GPU, and bus clock frequencies. This is not usually
57+
possible on a production device.
5258

5359
## How do I use the layer?
5460

0 commit comments

Comments
 (0)