Skip to content

Commit c42abea

Browse files
authored
[rocprofiler-compute] memory bandwidth bottleneck detection metrics - L2 Tests (#4137)
1 parent 3ad5411 commit c42abea

File tree

10 files changed

+1701
-0
lines changed

10 files changed

+1701
-0
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Memory Bandwidth Analysis Test Suite
2+
3+
HIP workloads targeting MI350 (gfx950) L1 and L2 cache/memory metrics for validation with `rocprof-compute --membw-analysis --experimental`.
4+
5+
## L1 Workloads (tables 3001-3003)
6+
7+
| Workload | Baseline (intended effect) | Optimized (intended effect) |
8+
|---|---|---|
9+
| gl2_backpressure | Uncoalesced + large stride -> L2 misses -> fill LFIFO -> TCP stalls. | Shared mem, minimal L2 traffic. |
10+
| L1_stall_microbenchmark | Many scattered loads/stores -> VMEM FIFO fills. | Shared memory reduces VMEM FIFO pressure. |
11+
| utcl1_stall | Rapid page hopping (>32 pages) -> exceed UTCL1 entries -> in-flight stall. | Stay in one page. |
12+
| ta_tcp_stall | Random access, TA waits for TCP. | Sequential access, TCP serves TA quickly. |
13+
14+
> [!NOTE]
15+
> L1 workloads are still WIP. Profiled results may not yet reflect intended results.
16+
17+
### L1 Target Metrics
18+
19+
- **gl2_backpressure**: `TCP_TCR_TCP_STALL_CYCLES / tcp_busy`
20+
- **utcl1_stall**: `TCP_UTCL1_STALL_INFLIGHT_MAX / tcp_busy`
21+
22+
## L2 Workloads (tables 3007-3012)
23+
24+
Each workload targets specific TCC perfmon counters exposed in the new L2 metric tables added in PR 4091. All use runtime L2 cache size detection (`hipDeviceAttributeL2CacheSize`).
25+
26+
| Workload | Target Metrics | Baseline (intended effect) | Optimized (intended effect) |
27+
|---|---|---|---|
28+
| l2_hbm_read_bw_stress | L2-EA read credit stall (HBM), L2-EA read BW, HBM read fraction, L2 hit rate, L2 Cache Efficiency | Streaming reads over 4x L2 buffer. Every access misses L2 -> EA read -> HBM. Exhausts DRAM read credits. | Reads small tile fitting in L2 -> high hit rate, no credit stalls. |
29+
| l2_hbm_write_bw_stress | L2-EA write credit stall (HBM), L2-EA write stall, TOO_MANY_EA_WRREQS_STALL, SRC_FIFO_FULL, writeback rate | Streaming writes over 4x L2 buffer. Write-allocate + eviction of dirty lines -> max EA write traffic. | Writes small region in L2 -> dirty lines stay cached, minimal writeback. |
30+
| l2_cache_thrash | TAG_STALL, IB_STALL, LATENCY_FIFO_FULL, eviction rate, Back Pressure Indicator, Internal Resource Pressure | Scattered RMW on 1.5x L2 buffer (prime stride). Fills latency FIFO, causes tag stalls, IB stalls (backpressure). | Coalesced RMW on small working set -> all hits, no stalls. |
31+
| l2_atomic_stress | EA0_ATOMIC, EA0_ATOMIC_LEVEL (atomic latency), TCC_ATOMIC | High-contention atomicAdd on 1024 cachelines across 2x L2 buffer. Forces EA atomic path with high per-atomic latency. | Regular RMW (no atomics) on L2-fitting tile. Each thread works on its own cacheline. |
32+
| l2_coherence_traffic | NC_REQ, UC_REQ, CC_REQ, PROBE, PROBE_EVICT | `fg` mode: fine-grained memory (CC type) -> coherence protocol, internal probes. `nc` mode: `__builtin_nontemporal_store` -> NC traffic. | Coarse-grained memory with normal cached accesses. |
33+
| l2_multigpu_fabric | EA0_RDREQ_GMI_CREDIT_STALL, EA0_WRREQ_GMI_CREDIT_STALL, Remote Access Pressure (IF) | GPU 0 reads/writes memory on GPU 1 via P2P -> Infinity Fabric (GMI) credit exhaustion. | N/A (requires 2 GPUs) |
34+
| l2_io_stress | EA0_RDREQ_IO_CREDIT_STALL, EA0_WRREQ_IO_CREDIT_STALL | GPU kernel accesses host-pinned memory (hipHostMalloc) -> PCIe/IO credit exhaustion. | Same kernel on device-local memory. |
35+
| l2_normalized_throughput | All Normalized Stall Metrics (table 30.10), All Throughput Metrics (table 30.11), Combined Credit Pressure (table 30.12) | Mixed R+W streaming over 4x L2 buffer, purely memory-bound. High stalls relative to GRBM_GUI_ACTIVE. | Compute-heavy FMA kernel with minimal memory -> near-zero normalized stalls. |
36+
37+
### Expected Validation Results
38+
39+
| Workload | Key Metric | Baseline | Optimized |
40+
|---|---|---|---|
41+
| l2_hbm_read_bw_stress | L2-EA read credit stall (HBM) | >10% | <1% |
42+
| l2_hbm_read_bw_stress | L2 hit rate | <20% | >90% |
43+
| l2_hbm_write_bw_stress | L2-EA write credit stall (HBM) | >10% | <1% |
44+
| l2_hbm_write_bw_stress | L2 writeback rate | >30% | <5% |
45+
| l2_cache_thrash | L2 tag stall rate | >10% | <5% |
46+
| l2_cache_thrash | L2 Back Pressure Indicator | >10% | <5% |
47+
| l2_atomic_stress | L2-EA atomic latency | >100 cyc | N/A (no atomics) |
48+
| l2_coherence_traffic (fg) | Coherent cached req rate | >50% | <5% |
49+
| l2_io_stress | L2-EA read credit stall (IO) | >5% | <1% |
50+
| l2_normalized_throughput | Combined Credit Pressure | >10% | <1% |
51+
52+
## Build
53+
54+
```bash
55+
# Build a single workload
56+
hipcc -g <workload>.hip -o <workload> --offload-arch=gfx950
57+
58+
# Build all L2 workloads
59+
for f in l2_*.hip; do
60+
hipcc -g "$f" -o "${f%.hip}" --offload-arch=gfx950
61+
done
62+
```
63+
64+
## Profiling
65+
66+
```bash
67+
# Profile baseline
68+
rocprof-compute profile -n <name>_baseline --membw-analysis --experimental --no-roof -- ./<workload>
69+
70+
# Profile optimized
71+
rocprof-compute profile -n <name>_optimized --membw-analysis --experimental --no-roof -- ./<workload> opt
72+
```
73+
74+
### Examples
75+
76+
```bash
77+
# HBM read stress
78+
rocprof-compute profile -n hbm_read_baseline --membw-analysis --experimental --no-roof -- ./l2_hbm_read_bw_stress
79+
rocprof-compute profile -n hbm_read_optimized --membw-analysis --experimental --no-roof -- ./l2_hbm_read_bw_stress opt
80+
81+
# Cache thrash
82+
rocprof-compute profile -n thrash_baseline --membw-analysis --experimental --no-roof -- ./l2_cache_thrash
83+
rocprof-compute profile -n thrash_optimized --membw-analysis --experimental --no-roof -- ./l2_cache_thrash opt
84+
85+
# Coherence (fine-grained mode)
86+
rocprof-compute profile -n coherence_fg --membw-analysis --experimental --no-roof -- ./l2_coherence_traffic
87+
rocprof-compute profile -n coherence_nc --membw-analysis --experimental --no-roof -- ./l2_coherence_traffic nc
88+
rocprof-compute profile -n coherence_opt --membw-analysis --experimental --no-roof -- ./l2_coherence_traffic opt
89+
90+
# IO stress (host-pinned vs device-local)
91+
rocprof-compute profile -n io_baseline --membw-analysis --experimental --no-roof -- ./l2_io_stress
92+
rocprof-compute profile -n io_optimized --membw-analysis --experimental --no-roof -- ./l2_io_stress opt
93+
94+
# Multi-GPU (requires 2 GPUs)
95+
rocprof-compute profile -n fabric_read --membw-analysis --experimental --no-roof -- ./l2_multigpu_fabric read
96+
rocprof-compute profile -n fabric_write --membw-analysis --experimental --no-roof -- ./l2_multigpu_fabric write
97+
```
98+
99+
## Analyzing
100+
101+
```bash
102+
rocprof-compute analyze -p <path to profiled result> --membw-analysis --experimental
103+
```
104+
105+
## Hardware Requirements
106+
107+
| Workload | GPUs | Notes |
108+
|---|---|---|
109+
| l2_hbm_read_bw_stress | 1 | Single GPU |
110+
| l2_hbm_write_bw_stress | 1 | Single GPU |
111+
| l2_cache_thrash | 1 | Single GPU |
112+
| l2_atomic_stress | 1 | Single GPU |
113+
| l2_coherence_traffic | 1 | Single GPU; fine-grained alloc may not be supported on all platforms |
114+
| l2_multigpu_fabric | 2 | Requires P2P access between GPUs |
115+
| l2_io_stress | 1 | Single GPU; needs sufficient host memory for 1GB pinned alloc |
116+
| l2_normalized_throughput | 1 | Single GPU |

0 commit comments

Comments
 (0)