|
| 1 | +# Memory Bandwidth Analysis Test Suite |
| 2 | + |
| 3 | +HIP workloads targeting MI350 (gfx950) L1 and L2 cache/memory metrics for validation with `rocprof-compute --membw-analysis --experimental`. |
| 4 | + |
| 5 | +## L1 Workloads (tables 3001-3003) |
| 6 | + |
| 7 | +| Workload | Baseline (intended effect) | Optimized (intended effect) | |
| 8 | +|---|---|---| |
| 9 | +| gl2_backpressure | Uncoalesced + large stride -> L2 misses -> fill LFIFO -> TCP stalls. | Shared mem, minimal L2 traffic. | |
| 10 | +| L1_stall_microbenchmark | Many scattered loads/stores -> VMEM FIFO fills. | Shared memory reduces VMEM FIFO pressure. | |
| 11 | +| utcl1_stall | Rapid page hopping (>32 pages) -> exceed UTCL1 entries -> in-flight stall. | Stay in one page. | |
| 12 | +| ta_tcp_stall | Random access, TA waits for TCP. | Sequential access, TCP serves TA quickly. | |
| 13 | + |
| 14 | +> [!NOTE] |
| 15 | +> L1 workloads are still WIP. Profiled results may not yet reflect intended results. |
| 16 | +
|
| 17 | +### L1 Target Metrics |
| 18 | + |
| 19 | +- **gl2_backpressure**: `TCP_TCR_TCP_STALL_CYCLES / tcp_busy` |
| 20 | +- **utcl1_stall**: `TCP_UTCL1_STALL_INFLIGHT_MAX / tcp_busy` |
| 21 | + |
| 22 | +## L2 Workloads (tables 3007-3012) |
| 23 | + |
| 24 | +Each workload targets specific TCC perfmon counters exposed in the new L2 metric tables added in PR 4091. All use runtime L2 cache size detection (`hipDeviceAttributeL2CacheSize`). |
| 25 | + |
| 26 | +| Workload | Target Metrics | Baseline (intended effect) | Optimized (intended effect) | |
| 27 | +|---|---|---|---| |
| 28 | +| l2_hbm_read_bw_stress | L2-EA read credit stall (HBM), L2-EA read BW, HBM read fraction, L2 hit rate, L2 Cache Efficiency | Streaming reads over 4x L2 buffer. Every access misses L2 -> EA read -> HBM. Exhausts DRAM read credits. | Reads small tile fitting in L2 -> high hit rate, no credit stalls. | |
| 29 | +| l2_hbm_write_bw_stress | L2-EA write credit stall (HBM), L2-EA write stall, TOO_MANY_EA_WRREQS_STALL, SRC_FIFO_FULL, writeback rate | Streaming writes over 4x L2 buffer. Write-allocate + eviction of dirty lines -> max EA write traffic. | Writes small region in L2 -> dirty lines stay cached, minimal writeback. | |
| 30 | +| l2_cache_thrash | TAG_STALL, IB_STALL, LATENCY_FIFO_FULL, eviction rate, Back Pressure Indicator, Internal Resource Pressure | Scattered RMW on 1.5x L2 buffer (prime stride). Fills latency FIFO, causes tag stalls, IB stalls (backpressure). | Coalesced RMW on small working set -> all hits, no stalls. | |
| 31 | +| l2_atomic_stress | EA0_ATOMIC, EA0_ATOMIC_LEVEL (atomic latency), TCC_ATOMIC | High-contention atomicAdd on 1024 cachelines across 2x L2 buffer. Forces EA atomic path with high per-atomic latency. | Regular RMW (no atomics) on L2-fitting tile. Each thread works on its own cacheline. | |
| 32 | +| l2_coherence_traffic | NC_REQ, UC_REQ, CC_REQ, PROBE, PROBE_EVICT | `fg` mode: fine-grained memory (CC type) -> coherence protocol, internal probes. `nc` mode: `__builtin_nontemporal_store` -> NC traffic. | Coarse-grained memory with normal cached accesses. | |
| 33 | +| l2_multigpu_fabric | EA0_RDREQ_GMI_CREDIT_STALL, EA0_WRREQ_GMI_CREDIT_STALL, Remote Access Pressure (IF) | GPU 0 reads/writes memory on GPU 1 via P2P -> Infinity Fabric (GMI) credit exhaustion. | N/A (requires 2 GPUs) | |
| 34 | +| l2_io_stress | EA0_RDREQ_IO_CREDIT_STALL, EA0_WRREQ_IO_CREDIT_STALL | GPU kernel accesses host-pinned memory (hipHostMalloc) -> PCIe/IO credit exhaustion. | Same kernel on device-local memory. | |
| 35 | +| l2_normalized_throughput | All Normalized Stall Metrics (table 30.10), All Throughput Metrics (table 30.11), Combined Credit Pressure (table 30.12) | Mixed R+W streaming over 4x L2 buffer, purely memory-bound. High stalls relative to GRBM_GUI_ACTIVE. | Compute-heavy FMA kernel with minimal memory -> near-zero normalized stalls. | |
| 36 | + |
| 37 | +### Expected Validation Results |
| 38 | + |
| 39 | +| Workload | Key Metric | Baseline | Optimized | |
| 40 | +|---|---|---|---| |
| 41 | +| l2_hbm_read_bw_stress | L2-EA read credit stall (HBM) | >10% | <1% | |
| 42 | +| l2_hbm_read_bw_stress | L2 hit rate | <20% | >90% | |
| 43 | +| l2_hbm_write_bw_stress | L2-EA write credit stall (HBM) | >10% | <1% | |
| 44 | +| l2_hbm_write_bw_stress | L2 writeback rate | >30% | <5% | |
| 45 | +| l2_cache_thrash | L2 tag stall rate | >10% | <5% | |
| 46 | +| l2_cache_thrash | L2 Back Pressure Indicator | >10% | <5% | |
| 47 | +| l2_atomic_stress | L2-EA atomic latency | >100 cyc | N/A (no atomics) | |
| 48 | +| l2_coherence_traffic (fg) | Coherent cached req rate | >50% | <5% | |
| 49 | +| l2_io_stress | L2-EA read credit stall (IO) | >5% | <1% | |
| 50 | +| l2_normalized_throughput | Combined Credit Pressure | >10% | <1% | |
| 51 | + |
| 52 | +## Build |
| 53 | + |
| 54 | +```bash |
| 55 | +# Build a single workload |
| 56 | +hipcc -g <workload>.hip -o <workload> --offload-arch=gfx950 |
| 57 | + |
| 58 | +# Build all L2 workloads |
| 59 | +for f in l2_*.hip; do |
| 60 | + hipcc -g "$f" -o "${f%.hip}" --offload-arch=gfx950 |
| 61 | +done |
| 62 | +``` |
| 63 | + |
| 64 | +## Profiling |
| 65 | + |
| 66 | +```bash |
| 67 | +# Profile baseline |
| 68 | +rocprof-compute profile -n <name>_baseline --membw-analysis --experimental --no-roof -- ./<workload> |
| 69 | + |
| 70 | +# Profile optimized |
| 71 | +rocprof-compute profile -n <name>_optimized --membw-analysis --experimental --no-roof -- ./<workload> opt |
| 72 | +``` |
| 73 | + |
| 74 | +### Examples |
| 75 | + |
| 76 | +```bash |
| 77 | +# HBM read stress |
| 78 | +rocprof-compute profile -n hbm_read_baseline --membw-analysis --experimental --no-roof -- ./l2_hbm_read_bw_stress |
| 79 | +rocprof-compute profile -n hbm_read_optimized --membw-analysis --experimental --no-roof -- ./l2_hbm_read_bw_stress opt |
| 80 | + |
| 81 | +# Cache thrash |
| 82 | +rocprof-compute profile -n thrash_baseline --membw-analysis --experimental --no-roof -- ./l2_cache_thrash |
| 83 | +rocprof-compute profile -n thrash_optimized --membw-analysis --experimental --no-roof -- ./l2_cache_thrash opt |
| 84 | + |
| 85 | +# Coherence (fine-grained mode) |
| 86 | +rocprof-compute profile -n coherence_fg --membw-analysis --experimental --no-roof -- ./l2_coherence_traffic |
| 87 | +rocprof-compute profile -n coherence_nc --membw-analysis --experimental --no-roof -- ./l2_coherence_traffic nc |
| 88 | +rocprof-compute profile -n coherence_opt --membw-analysis --experimental --no-roof -- ./l2_coherence_traffic opt |
| 89 | + |
| 90 | +# IO stress (host-pinned vs device-local) |
| 91 | +rocprof-compute profile -n io_baseline --membw-analysis --experimental --no-roof -- ./l2_io_stress |
| 92 | +rocprof-compute profile -n io_optimized --membw-analysis --experimental --no-roof -- ./l2_io_stress opt |
| 93 | + |
| 94 | +# Multi-GPU (requires 2 GPUs) |
| 95 | +rocprof-compute profile -n fabric_read --membw-analysis --experimental --no-roof -- ./l2_multigpu_fabric read |
| 96 | +rocprof-compute profile -n fabric_write --membw-analysis --experimental --no-roof -- ./l2_multigpu_fabric write |
| 97 | +``` |
| 98 | + |
| 99 | +## Analyzing |
| 100 | + |
| 101 | +```bash |
| 102 | +rocprof-compute analyze -p <path to profiled result> --membw-analysis --experimental |
| 103 | +``` |
| 104 | + |
| 105 | +## Hardware Requirements |
| 106 | + |
| 107 | +| Workload | GPUs | Notes | |
| 108 | +|---|---|---| |
| 109 | +| l2_hbm_read_bw_stress | 1 | Single GPU | |
| 110 | +| l2_hbm_write_bw_stress | 1 | Single GPU | |
| 111 | +| l2_cache_thrash | 1 | Single GPU | |
| 112 | +| l2_atomic_stress | 1 | Single GPU | |
| 113 | +| l2_coherence_traffic | 1 | Single GPU; fine-grained alloc may not be supported on all platforms | |
| 114 | +| l2_multigpu_fabric | 2 | Requires P2P access between GPUs | |
| 115 | +| l2_io_stress | 1 | Single GPU; needs sufficient host memory for 1GB pinned alloc | |
| 116 | +| l2_normalized_throughput | 1 | Single GPU | |
0 commit comments