[Multidevice] Tma bulk copy p2p runtime examples#6011
Conversation
|
Review updated until commit ae0c760 Description
|
| Relevant files | |||
|---|---|---|---|
| Tests |
| ||
| Enhancement |
| ||
| Configuration changes |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Kernel Robustness
|
Greptile SummaryThis PR adds a Hopper TMA ( Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Host as Host (CPU)
participant T0 as Thread 0 (SM)
participant SMEM as Shared Memory
participant SRC as GMEM src
participant DST as GMEM dst
Host->>T0: cuLaunchKernel(tma_copy_1d, smem=num_bytes+8)
T0->>SMEM: mbarrier.init(arrival_count=1)
T0->>T0: fence.mbarrier_init + __syncwarp()
T0->>SMEM: mbarrier.arrive.expect_tx(num_bytes)
T0->>SRC: cp.async.bulk.shared::cluster.global [SMEM], [src], num_bytes, [mbar]
SRC-->>SMEM: TMA Load (async, GMEM→SMEM)
T0->>T0: mbarrier.try_wait.parity(parity=0) [spin until load done]
SMEM-->>T0: mbarrier complete (phase flips 0→1)
T0->>DST: cp.async.bulk.global.shared::cta [dst], [SMEM], num_bytes
T0->>T0: cp.async.bulk.commit_group
T0->>T0: cp.async.bulk.wait_group.read 0 [wait store done]
SMEM-->>DST: TMA Store committed (SMEM→GMEM)
T0->>SMEM: mbarrier.inval
T0-->>Host: kernel complete
Reviews (3): Last reviewed commit: "move files to runtime/" | Re-trigger Greptile |
|
!test |
They are mostly just wrappers around some PTX instructions. We could add IR nodes to the Kernel IR and still use them for simpler final codegen ( The overall design philosophy is to generate the Kernel IR that explicitly represents the final CUDA kernel and minimize the logic necessary in |
|
Ok regarding code gen, however, this pr is not about code gen. The present tma kernel is used as a "host op" to perform inter-GPU comms, similarly to a cudaMemcpyAsync. This PR provides a reference implementation and the next one adds this transport as a possible p2p backend. I am not sure to understand -- are you ok with the pr's current implementation or do you suggest something else? |
|
@naoyam @wujingyue |
There was a problem hiding this comment.
@naoyam organization-wise, do you prefer to move this (and alltoallv.cu) to runtime/tma_copy.cu?
There was a problem hiding this comment.
Yes, since that directory is the one where we hold all runtime code.
|
!test |
| nvrtcResult res = nvrtcCompileProgram(prog, (int)opts.size(), opts.data()); | ||
| if (res != NVRTC_SUCCESS) { | ||
| size_t logSize; | ||
| NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize)); | ||
| std::vector<char> log(logSize); | ||
| NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log.data())); | ||
| NVF_ERROR( | ||
| false, | ||
| "NVRTC compilation of '", | ||
| source_name, | ||
| "' failed:\n", | ||
| log.data()); | ||
| } |
There was a problem hiding this comment.
nvrtcDestroyProgram leaked on compilation error
When nvrtcCompileProgram fails, the error path reads the log and then calls NVF_ERROR which throws. nvrtcDestroyProgram(&prog) is never called on this path, leaking the NVRTC program object. While NVRTC programs are small and this only triggers on failure, a guard ensures clean teardown:
nvrtcResult res = nvrtcCompileProgram(prog, (int)opts.size(), opts.data());
if (res != NVRTC_SUCCESS) {
size_t logSize;
NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
std::vector<char> log(logSize);
NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log.data()));
nvrtcDestroyProgram(&prog);
NVF_ERROR(
false,
"NVRTC compilation of '",
source_name,
"' failed:\n",
log.data());
}| void launchTmaCopy1D( | ||
| void* dst, | ||
| const void* src, | ||
| int num_bytes, | ||
| CUstream stream = nullptr) { | ||
| NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0); | ||
| CUfunction tma_kernel = getTmaCopy1dKernel(); | ||
| int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t)); | ||
| void* args[] = {&dst, &src, &num_bytes}; | ||
| NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel( | ||
| tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr)); | ||
| } |
There was a problem hiding this comment.
Missing GMEM pointer alignment check
cp.async.bulk (both load and store forms) requires the global memory address to be 16-byte aligned. The function checks num_bytes % 16 == 0 but neither src nor dst alignment is verified. In the current tests all pointers come from PyTorch/VMM allocations that are always aligned, but an explicit assertion would guard against future callers:
| void launchTmaCopy1D( | |
| void* dst, | |
| const void* src, | |
| int num_bytes, | |
| CUstream stream = nullptr) { | |
| NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0); | |
| CUfunction tma_kernel = getTmaCopy1dKernel(); | |
| int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t)); | |
| void* args[] = {&dst, &src, &num_bytes}; | |
| NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel( | |
| tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr)); | |
| } | |
| void launchTmaCopy1D( | |
| void* dst, | |
| const void* src, | |
| int num_bytes, | |
| CUstream stream = nullptr) { | |
| NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0); | |
| NVF_CHECK( | |
| reinterpret_cast<uintptr_t>(src) % 16 == 0 && | |
| reinterpret_cast<uintptr_t>(dst) % 16 == 0, | |
| "TMA cp.async.bulk requires 16-byte aligned GMEM addresses"); | |
| CUfunction tma_kernel = getTmaCopy1dKernel(); | |
| int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t)); | |
| void* args[] = {&dst, &src, &num_bytes}; | |
| NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel( | |
| tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr)); | |
| } |
What
Add a Hopper TMA (
cp.async.bulk) copy kernel incsrc/multidevice/tma_copy.cuand validate it across three memory source/destination types:Those behavior are demonstrated through three unit tests at
tests/cpp/test_multidevice_tma.cpp. The tests reuse theSymmetricTensorabstraction for VMM allocation, IPC handle exchange, and multicast setup, keeping the test bodies focused on the TMA transfer itself.Why
The CUDA backend for multi-device communication (
csrc/multidevice/cuda_p2p.cpp) currently uses SM-based copies (regular threads load/store ormultimem.st) and copy-engine copies (cudaMemcpyAsync/cudaMemcpyBatchAsync). TMA offers a third transport option that is GPU-initiated, lightweight (single-thread issue), fully asynchronous, and frees SM resources for overlapping compute. This transport is leveraged by DeepEP for intra-node MoE dispatch. This PR validates that TMA works correctly on the memory types used by nvFuser's multi-device infrastructure.This lays the groundwork for a follow-up PR that integrates TMA as a transport option for P2P and multicast communications alongside the existing SM-based copies and copy-engine transports.
How
csrc/multidevice/tma_copy.cu. It is a single-warp kernel where thread 0 performs a two-phase TMA transfer through shared memory (GMEM(src) --[TMA load]--> SMEM --[TMA store]--> GMEM(dst)), usingmbarrierfor async completion tracking. TMA is a GMEM-SMEM engine — there is no GMEM-to-GMEM variant, so shared memory staging is inherent to the hardware.alltoallv.cu,multicast.cukernels incuda_p2p.cpp, and other kernels inruntime/) and stringified at build time through the existingNVFUSER_RUNTIME_FILESpipeline.