Add SM->GPC mapping query and ubench tool#87
Open
William-An wants to merge 2 commits into
Open
Conversation
Refactor queryGrInfo() in hw_def/common/gpuConfig.h to share the RM ioctl scaffold (rmSubdeviceControl) so additional NV2080 control queries reuse the alloc/control/free chain. Add NV2080_CTRL_CMD_GR_GET_SM_TO_GPC_TPC_MAPPINGS and querySmToGpcMapping(), plus NUM_GPCS exposed in GpuConfig via NV2080_CTRL_GR_INFO_INDEX_LITTER_NUM_GPCS. Add ubench/system/sm_gpc_mapping which dumps the per-SM (gpcId, tpcId) table, validates against runtime %smid captures, and sweeps thread-block cluster shapes (1x1x1..8x1x1 and 1x8x1) with cudaLaunchKernelEx + cudaLaunchAttributeClusterDimension to expose cluster->GPC rasterization order on Hopper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add visualize.py post-processor (stdlib Python 3) that renders the sm_gpc_mapping cluster-shape sweep as nested ASCII boxes (GPC > TPC > SM), with each SM cell labeled by every (cluster_id, rank_in_cluster) that landed on it in dispatch order. Supports --shape filter, --style boxed|compact, optional --color, and reads from --input or stdin. Also report `active SMs (touched by >=1 block) / SM_NUMBER` per shape in sm_gpc_mapping.cu so the C++ output and Python rendering both surface the H100 CPC-exclusion behavior (sizes >=4 only touch 120/132 SMs because GPC 0's high CPC and GPC 6/7's 9th TPC are excluded from cluster placement). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
FYI for reviewers — pushed an additional commit (8b6d4a07) with a small post-processor Also added a Usage# Pipe live
bin/sm_gpc_mapping | ubench/system/sm_gpc_mapping/visualize.py
# Or post-process a saved log
visualize.py --input run.log [--shape 4x2x1] [--style boxed|compact] [--color]Example:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
queryGrInfo()inhw_def/common/gpuConfig.hso the RM ioctl scaffold (/dev/nvidiactlopen,NV01_ROOT_CLIENT → NV01_DEVICE_0 → NV20_SUBDEVICE_0alloc chain, control, teardown) is extracted into a singlermSubdeviceControl()helper. Both the existingLITTER_NUM_*query and any new RM control share it.querySmToGpcMapping()exposingNV2080_CTRL_CMD_GR_GET_SM_TO_GPC_TPC_MAPPINGS(control id0x2080120f, defined inopen-gpu-kernel-modules/.../ctrl2080gr.h:752-776). Returns astd::vector<SmGpcTpcEntry>indexed by physical SM id (matches PTX%smid).NUM_GPCStoGpuConfig, populated viaNV2080_CTRL_GR_INFO_INDEX_LITTER_NUM_GPCS(0x14).ubench/system/sm_gpc_mapping/that dumps the per-SM (GPC, TPC) table, captures%smidfrom a kernel for cross-validation, and sweeps thread-block cluster shapes (1x1x1,1x2x1,2x1x1,2x2x1,1x4x1,4x1x1,2x4x1,4x2x1,8x1x1,1x8x1) usingcudaLaunchKernelEx+cudaLaunchAttributeClusterDimension. Verifies single-GPC-per-cluster invariant and prints both per-cluster summary and a stable CSV.Why
Need an SM→GPC table to translate
%smidcaptures from NVBit-instrumented kernels into GPC ids for exact SM scheduling as hardware. The RM ioctl path already used by this repo forFBP_COUNT/L2_BANKSis the right place to extend.Tested on
%smidcoverage, single-GPC-per-cluster holds for all 10 cluster shapes.Known caveats
%cluster_ctarankreads need sm_90+; the new ubench's Makefile setsNVCC_FLAGS += -arch=sm_90.0xCAFE0001..3) inherited from existing pattern; could be hardened to kernel-generated handles.Test plan
bin/sm_gpc_mappingon additional SM_90+ GPUs (H100 SXM4, GH200, B100/B200) and confirm# CHECKlines passbin/system_configon each and confirmFBP_COUNT/L2_BANKSoutputs unchanged🤖 Generated with Claude Code