ARM-software · solidpixel · Jul 24, 2025 · Jul 24, 2025
diff --git a/layer_gpu_profile/README_LAYER.md b/layer_gpu_profile/README_LAYER.md
@@ -6,30 +6,33 @@ counters for selected frames running on an Arm GPU.
 ## What devices are supported?
 
 This layer requires Vulkan 1.0 and an Arm GPU because it uses an Arm-specific
-counter sampling library.
+performance counter sampling library.
 
 ## What data can be collected?
 
 The layer serializes workloads for instrumented frames and injects counter
-samples between them, allowing the layer to measure the hardware cost of
-render passes, compute dispatches, transfers, etc.
+samples between them, allowing the layer to measure the hardware metrics for
+Vulkan render passes, compute dispatches, transfers, etc.
 
 The serialization is very invasive to wall-clock performance, due to removal
 of pipeline overlap between workloads and additional GPU idle time waiting for
-the layer to performs each performance counter sampling operation. This will
+the layer to perform each performance counter sampling operation. This will
 have an impact on the counter data being captured!
 
 Derived counters that show queue and functional unit utilization as a
-percentage of the overall "active" time of their parent block will report low
-because of time spent refilling and then draining the GPU pipeline between
-workloads. The overall _GPU Active Cycles_ counter is known to be unreliable,
-because the serialization means that command stream setup and teardown costs
-are not hidden in the shadow of surrounding work. We recommend using the
-individual queue active cycles counters as the main measure of performance.
+percentage of the overall "active" time of their parent block will report
+too low. This is because of time spent refilling and then draining the GPU
+pipeline for each serialized workload.
 
-Note that any counter that measure direct work, such as architectural issue
-cycles, or workload nouns, such as primitives or threads, should be unaffected
-by the loss of pipelining.
+The overall _GPU Active Cycles_ counter is also known to be unreliable, because
+the serialization means that command stream setup and teardown costs are not
+hidden in the shadow of surrounding work. We recommend using the individual
+queue active cycles counters as the main measure of workload cost.
+
+Note that any counters that measure direct work, such as architectural issue
+cycles, or identifiable workload nouns, such as primitives or threads, should
+be unaffected by the loss of pipelining as the workload itself is functionally
+unaffected by the addition of serialization.
 
 Arm GPUs provide a wide range of performance counters covering many different
 aspects of hardware performance. The layer will collect a standard set of
@@ -44,11 +47,14 @@ hardware counters and derived expressions supported by the
 The GPU idle time waiting for the CPU to take a counter sample can cause the
 system DVFS power governor to decide that the GPU is not busy. In production
 devices we commonly see that the GPU will be down-clocked during the
-instrumented frame, which may have an impact on a subset of the available
-performance counters.
-
-When running on a pre-production device we recommend pinning CPU, GPU, and bus
-clock speeds to avoid the performance instability.
+instrumented frame, which may have an impact on a some of the available
+performance counters. For example, GPU memory latency may appear lower than
+normal if the reduction in GPU clock makes the memory system look faster in
+comparison.
+
+When running on a pre-production device you can minimize the impacts of these
+effects by pinning CPU, GPU, and bus clock frequencies. This is not usually
+possible on a production device.
 
 ## How do I use the layer?