Skip to content

Add SM->GPC mapping query and ubench tool#87

Open
William-An wants to merge 2 commits into
accel-sim:devfrom
purdue-aalp:add-sm-gpc-mapping-tool
Open

Add SM->GPC mapping query and ubench tool#87
William-An wants to merge 2 commits into
accel-sim:devfrom
purdue-aalp:add-sm-gpc-mapping-tool

Conversation

@William-An
Copy link
Copy Markdown
Contributor

@William-An William-An commented May 6, 2026

Summary

  • Refactor queryGrInfo() in hw_def/common/gpuConfig.h so the RM ioctl scaffold (/dev/nvidiactl open, NV01_ROOT_CLIENT → NV01_DEVICE_0 → NV20_SUBDEVICE_0 alloc chain, control, teardown) is extracted into a single rmSubdeviceControl() helper. Both the existing LITTER_NUM_* query and any new RM control share it.
  • Add querySmToGpcMapping() exposing NV2080_CTRL_CMD_GR_GET_SM_TO_GPC_TPC_MAPPINGS (control id 0x2080120f, defined in open-gpu-kernel-modules/.../ctrl2080gr.h:752-776). Returns a std::vector<SmGpcTpcEntry> indexed by physical SM id (matches PTX %smid).
  • Add NUM_GPCS to GpuConfig, populated via NV2080_CTRL_GR_INFO_INDEX_LITTER_NUM_GPCS (0x14).
  • Add new ubench ubench/system/sm_gpc_mapping/ that dumps the per-SM (GPC, TPC) table, captures %smid from a kernel for cross-validation, and sweeps thread-block cluster shapes (1x1x1, 1x2x1, 2x1x1, 2x2x1, 1x4x1, 4x1x1, 2x4x1, 4x2x1, 8x1x1, 1x8x1) using cudaLaunchKernelEx + cudaLaunchAttributeClusterDimension. Verifies single-GPC-per-cluster invariant and prints both per-cluster summary and a stable CSV.

Why

Need an SM→GPC table to translate %smid captures from NVBit-instrumented kernels into GPC ids for exact SM scheduling as hardware. The RM ioctl path already used by this repo for FBP_COUNT / L2_BANKS is the right place to extend.

Tested on

  • Driver 575.51.03 / CUDA 12.8 toolkit
  • NVIDIA H100 80GB HBM3 (sm_90)
  • Result: 132 SMs across 8 GPCs (6×16 + 2×18), all sanity invariants pass, full kernel-side %smid coverage, single-GPC-per-cluster holds for all 10 cluster shapes.

Known caveats

  • RM struct layouts mirror open-gpu-kernel-modules branch 580.95.05; tested working against 575.51.03 driver, but no wider sweep yet.
  • Cluster kernels and %cluster_ctarank reads need sm_90+; the new ubench's Makefile sets NVCC_FLAGS += -arch=sm_90.
  • Hardcoded RM client handles (0xCAFE0001..3) inherited from existing pattern; could be hardened to kernel-generated handles.

Test plan

  • Run bin/sm_gpc_mapping on additional SM_90+ GPUs (H100 SXM4, GH200, B100/B200) and confirm # CHECK lines pass
  • Re-run bin/system_config on each and confirm FBP_COUNT / L2_BANKS outputs unchanged
  • Cross-check SM→GPC distribution against published topology
  • Decide on Accel-Sim integration point — to be filled in before un-drafting

🤖 Generated with Claude Code

William-An and others added 2 commits May 5, 2026 20:31
Refactor queryGrInfo() in hw_def/common/gpuConfig.h to share the RM ioctl
scaffold (rmSubdeviceControl) so additional NV2080 control queries reuse
the alloc/control/free chain. Add NV2080_CTRL_CMD_GR_GET_SM_TO_GPC_TPC_MAPPINGS
and querySmToGpcMapping(), plus NUM_GPCS exposed in GpuConfig via
NV2080_CTRL_GR_INFO_INDEX_LITTER_NUM_GPCS.

Add ubench/system/sm_gpc_mapping which dumps the per-SM (gpcId, tpcId)
table, validates against runtime %smid captures, and sweeps thread-block
cluster shapes (1x1x1..8x1x1 and 1x8x1) with cudaLaunchKernelEx +
cudaLaunchAttributeClusterDimension to expose cluster->GPC rasterization
order on Hopper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add visualize.py post-processor (stdlib Python 3) that renders the
sm_gpc_mapping cluster-shape sweep as nested ASCII boxes (GPC > TPC > SM),
with each SM cell labeled by every (cluster_id, rank_in_cluster) that
landed on it in dispatch order. Supports --shape filter, --style
boxed|compact, optional --color, and reads from --input or stdin.

Also report `active SMs (touched by >=1 block) / SM_NUMBER` per shape in
sm_gpc_mapping.cu so the C++ output and Python rendering both surface the
H100 CPC-exclusion behavior (sizes >=4 only touch 120/132 SMs because
GPC 0's high CPC and GPC 6/7's 9th TPC are excluded from cluster placement).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@William-An
Copy link
Copy Markdown
Contributor Author

William-An commented May 8, 2026

FYI for reviewers — pushed an additional commit (8b6d4a07) with a small post-processor visualize.py next to the binary that turns the cluster-sweep output into nested ASCII boxes (GPC ⊃ TPC ⊃ SM). It's stdlib-only Python 3, no extra deps.

Also added a # active SMs (touched by >=1 block) = N / SM_NUMBER line per shape in the C++ tool so the GPU coverage of each launch is visible at a glance.

Usage

# Pipe live
bin/sm_gpc_mapping | ubench/system/sm_gpc_mapping/visualize.py

# Or post-process a saved log
visualize.py --input run.log [--shape 4x2x1] [--style boxed|compact] [--color]

Example: 4x2x1 cluster shape on H100 (boxed style, default)

The CUDA programming guide guarantees all blocks of a cluster co-locate on one GPC, but it doesn't say which SMs within the GPC, and the reality on H100 is nontrivial. Here's GPCs 0-2 for the 4x2x1 sweep:

=== Cluster shape 4x2x1 (size=8, 17 clusters, 136 blocks) ===
--- active SMs (touched by >=1 block) = 120 / 132; max occupants per SM in this launch = 2 ---

+-- GPC 0 [16 SMs] -------------------------------+
| +- TPC 0 -+ +- TPC 2 -+ +- TPC 5 -+ +- TPC 1 -+ |
| |  sm  0  | |  sm 16  | |  sm 32  | |  sm 48  | |
| |  c6:0   | |  c6:2   | |  c6:4   | |  c6:6   | |
| |  c15:0  | |  c15:2  | |  c15:4  | |  c15:6  | |
| |---------| |---------| |---------| |---------| |
| |  sm  1  | |  sm 17  | |  sm 33  | |  sm 49  | |
| |  c6:1   | |  c6:3   | |  c6:5   | |  c6:7   | |
| |  c15:1  | |  c15:3  | |  c15:5  | |  c15:7  | |
| +---------+ +---------+ +---------+ +---------+ |
| +- TPC 3 -+ +- TPC 6 -+ +- TPC 4 -+ +- TPC 7 -+ |
| |  sm124  | |  sm126  | |  sm128  | |  sm130  | |
| |    -    | |    -    | |    -    | |    -    | |
| |         | |         | |         | |         | |
| |---------| |---------| |---------| |---------| |
| |  sm125  | |  sm127  | |  sm129  | |  sm131  | |
| |    -    | |    -    | |    -    | |    -    | |
| |         | |         | |         | |         | |
| +---------+ +---------+ +---------+ +---------+ |
+-------------------------------------------------+

+-- GPC 1 [16 SMs] -------------------------------+
| +- TPC 0 -+ +- TPC 3 -+ +- TPC 6 -+ +- TPC 1 -+ |
| |  sm  2  | |  sm 18  | |  sm 34  | |  sm 50  | |
| |  c7:0   | |  c7:2   | |  c7:4   | |  c7:6   | |
| |  c16:0  | |  c16:2  | |  c16:4  | |  c16:6  | |
| |---------| |---------| |---------| |---------| |
| |  sm  3  | |  sm 19  | |  sm 35  | |  sm 51  | |
| |  c7:1   | |  c7:3   | |  c7:5   | |  c7:7   | |
| |  c16:1  | |  c16:3  | |  c16:5  | |  c16:7  | |
| +---------+ +---------+ +---------+ +---------+ |
| +- TPC 4 -+ +- TPC 7 -+ +- TPC 2 -+ +- TPC 5 -+ |
| |  sm 64  | |  sm 78  | |  sm 92  | |  sm106  | |
| |  c14:0  | |  c14:2  | |  c14:4  | |  c14:6  | |
| |         | |         | |         | |         | |
| |---------| |---------| |---------| |---------| |
| |  sm 65  | |  sm 79  | |  sm 93  | |  sm107  | |
| |  c14:1  | |  c14:3  | |  c14:5  | |  c14:7  | |
| |         | |         | |         | |         | |
| +---------+ +---------+ +---------+ +---------+ |
+-------------------------------------------------+

+-- GPC 2 [16 SMs] -------------------------------+
| +- TPC 0 -+ +- TPC 3 -+ +- TPC 6 -+ +- TPC 1 -+ |
| |  sm  4  | |  sm 20  | |  sm 36  | |  sm 52  | |
| |  c0:0   | |  c0:2   | |  c0:4   | |  c0:6   | |
| |         | |         | |         | |         | |
| |---------| |---------| |---------| |---------| |
| |  sm  5  | |  sm 21  | |  sm 37  | |  sm 53  | |
| |  c0:1   | |  c0:3   | |  c0:5   | |  c0:7   | |
| |         | |         | |         | |         | |
| +---------+ +---------+ +---------+ +---------+ |
| +- TPC 4 -+ +- TPC 7 -+ +- TPC 2 -+ +- TPC 5 -+ |
| |  sm 66  | |  sm 80  | |  sm 94  | |  sm108  | |
| |  c8:0   | |  c8:2   | |  c8:4   | |  c8:6   | |
| |         | |         | |         | |         | |
| |---------| |---------| |---------| |---------| |
| |  sm 67  | |  sm 81  | |  sm 95  | |  sm109  | |
| |  c8:1   | |  c8:3   | |  c8:5   | |  c8:7   | |
| |         | |         | |         | |         | |
| +---------+ +---------+ +---------+ +---------+ |
+-------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant