Commit 617fa07
authored
[Multidevice] Tma bulk copy p2p runtime examples (#6011)
## What
Add a Hopper TMA (`cp.async.bulk`) copy kernel in
`csrc/multidevice/tma_copy.cu` and validate it across three memory
source/destination types:
- local GMEM
- peer symmetric memory. It means TMA can write from local shared memory
to remote global memory.
- NVLS multicast pointers. It means that by using the multicast ptr as
the destination of the TMA request, data can be broadcast to the whole
NVL domain in one shot at line rate. Note, however, that this is not
officially supported according to the CUDA doc.
Those behavior are demonstrated through three unit tests at
`tests/cpp/test_multidevice_tma.cpp`. The tests reuse the
`SymmetricTensor` abstraction for VMM allocation, IPC handle exchange,
and multicast setup, keeping the test bodies focused on the TMA transfer
itself.
## Why
The CUDA backend for multi-device communication
(`csrc/multidevice/cuda_p2p.cpp`) currently uses SM-based copies
(regular threads load/store or `multimem.st`) and copy-engine copies
(`cudaMemcpyAsync` / `cudaMemcpyBatchAsync`). TMA offers a third
transport option that is GPU-initiated, lightweight (single-thread
issue), fully asynchronous, and frees SM resources for overlapping
compute. This transport is leveraged by DeepEP for intra-node MoE
dispatch. This PR validates that TMA works correctly on the memory types
used by nvFuser's multi-device infrastructure.
This lays the groundwork for a follow-up PR that integrates TMA as a
transport option for P2P and multicast communications alongside the
existing SM-based copies and copy-engine transports.
## How
- The kernel is implemented in `csrc/multidevice/tma_copy.cu`. It is a
single-warp kernel where thread 0 performs a two-phase TMA transfer
through shared memory (`GMEM(src) --[TMA load]--> SMEM --[TMA store]-->
GMEM(dst)`), using `mbarrier` for async completion tracking. TMA is a
GMEM-SMEM engine — there is no GMEM-to-GMEM variant, so shared memory
staging is inherent to the hardware.
- The kernel is compiled at runtime via NVRTC (same pattern as the
existing `alltoallv.cu`, `multicast.cu` kernels in `cuda_p2p.cpp`, and
other kernels in `runtime/`) and stringified at build time through the
existing `NVFUSER_RUNTIME_FILES` pipeline.1 parent fe948b6 commit 617fa07
File tree
4 files changed
+378
-1
lines changed- runtime
- tests/cpp
4 files changed
+378
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
998 | 998 | | |
999 | 999 | | |
1000 | 1000 | | |
| 1001 | + | |
1001 | 1002 | | |
1002 | 1003 | | |
1003 | 1004 | | |
| |||
1008 | 1009 | | |
1009 | 1010 | | |
1010 | 1011 | | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
1011 | 1015 | | |
1012 | 1016 | | |
1013 | 1017 | | |
| |||
1239 | 1243 | | |
1240 | 1244 | | |
1241 | 1245 | | |
1242 | | - | |
| 1246 | + | |
| 1247 | + | |
1243 | 1248 | | |
1244 | 1249 | | |
1245 | 1250 | | |
| |||
File renamed without changes.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
0 commit comments