[Multidevice] Tma bulk copy p2p runtime examples by samnordmann · Pull Request #6011 · NVIDIA/Fuser

samnordmann · 2026-02-25T10:03:00Z

What

Add a Hopper TMA (cp.async.bulk) copy kernel in csrc/multidevice/tma_copy.cu and validate it across three memory source/destination types:

local GMEM
peer symmetric memory. It means TMA can write from local shared memory to remote global memory.
NVLS multicast pointers. It means that by using the multicast ptr as the destination of the TMA request, data can be broadcast to the whole NVL domain in one shot at line rate. Note, however, that this is not officially supported according to the CUDA doc.

Those behavior are demonstrated through three unit tests at tests/cpp/test_multidevice_tma.cpp. The tests reuse the SymmetricTensor abstraction for VMM allocation, IPC handle exchange, and multicast setup, keeping the test bodies focused on the TMA transfer itself.

Why

The CUDA backend for multi-device communication (csrc/multidevice/cuda_p2p.cpp) currently uses SM-based copies (regular threads load/store or multimem.st) and copy-engine copies (cudaMemcpyAsync / cudaMemcpyBatchAsync). TMA offers a third transport option that is GPU-initiated, lightweight (single-thread issue), fully asynchronous, and frees SM resources for overlapping compute. This transport is leveraged by DeepEP for intra-node MoE dispatch. This PR validates that TMA works correctly on the memory types used by nvFuser's multi-device infrastructure.

This lays the groundwork for a follow-up PR that integrates TMA as a transport option for P2P and multicast communications alongside the existing SM-based copies and copy-engine transports.

How

The kernel is implemented in csrc/multidevice/tma_copy.cu. It is a single-warp kernel where thread 0 performs a two-phase TMA transfer through shared memory (GMEM(src) --[TMA load]--> SMEM --[TMA store]--> GMEM(dst)), using mbarrier for async completion tracking. TMA is a GMEM-SMEM engine — there is no GMEM-to-GMEM variant, so shared memory staging is inherent to the hardware.
The kernel is compiled at runtime via NVRTC (same pattern as the existing alltoallv.cu, multicast.cu kernels in cuda_p2p.cpp, and other kernels in runtime/) and stringified at build time through the existing NVFUSER_RUNTIME_FILES pipeline.

github-actions · 2026-02-25T10:05:01Z

Review updated until commit ae0c760

Description

Add Hopper TMA copy kernel using cp.async.bulk for GMEM->SMEM->GMEM transfers
Implement three test scenarios: local GMEM, peer symmetric memory, and NVLS multicast
Use NVRTC runtime compilation with dynamic shared memory and mbarrier synchronization
Integrate with existing SymmetricTensor infrastructure for VMM and multicast setup

Changes walkthrough

Relevant files

Tests

test_multidevice_tma.cpp `TMA copy kernel tests for multidevice scenarios` tests/cpp/test_multidevice_tma.cpp Add comprehensive TMA copy tests with NVRTC runtime compilation Test local GMEM copy, peer device memory, and NVLS multicast scenarios Include SM90+ capability checks and proper error handling Reuse SymmetricTensor abstraction for VMM and multicast setup	+271/-0

Enhancement

tma_copy.cu `Hopper TMA bulk copy kernel implementation` csrc/multidevice/tma_copy.cu Implement single-warp TMA kernel with thread 0 driving transfers Use two-phase GMEM->SMEM->GMEM copy with mbarrier synchronization Handle dynamic shared memory allocation and alignment Include inline PTX assembly for TMA operations and barriers	+101/-0

Configuration changes

CMakeLists.txt `Build configuration for TMA tests and resources` CMakeLists.txt Add test_multidevice_tma.cpp to multidevice test sources Include nvfuser_rt_tma_copy dependency for runtime resources Add binary directory include path for generated headers	+5/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Kernel Robustness

The kernel assumes num_bytes > 0 and divisible by 16, but lacks runtime validation. While the test validates this, the kernel itself could be more defensive against invalid inputs to prevent undefined behavior.

extern "C" __global__ void __launch_bounds__(32, 1) tma_copy_1d(
    void* __restrict__ dst,
    const void* __restrict__ src,
    int num_bytes) {
  extern __shared__ __align__(128) unsigned char smem[];

  unsigned long long* mbar =
      reinterpret_cast<unsigned long long*>(smem + num_bytes);
  unsigned int smem_addr =
      static_cast<unsigned int>(__cvta_generic_to_shared(smem));
  unsigned int mbar_addr =
      static_cast<unsigned int>(__cvta_generic_to_shared(mbar));

  if (threadIdx.x == 0) {
    asm volatile(
        "mbarrier.init.shared::cta.b64 [%0], %1;" ::"r"(mbar_addr), "r"(1));
    asm volatile("fence.mbarrier_init.release.cluster;" :::);
  }
  __syncwarp();

  if (threadIdx.x == 0) {
    // Announce expected transaction bytes on the mbarrier
    asm volatile(
        "mbarrier.arrive.expect_tx.shared::cta.b64 _, [%0], %1;" ::"r"(
            mbar_addr),
        "r"(num_bytes));

    // TMA Load: GMEM -> SMEM (async, completed via mbarrier)
    asm volatile(
        "cp.async.bulk.shared::cluster.global"
        ".mbarrier::complete_tx::bytes"
        " [%0], [%1], %2, [%3];\n" ::"r"(smem_addr),
        "l"(src),
        "r"(num_bytes),
        "r"(mbar_addr)
        : "memory");

    // Block until the mbarrier phase flips (TMA load completed)
    asm volatile(
        "{\n"
        ".reg .pred P1;\n"
        "TMA_COPY_WAIT_LOAD:\n"
        "mbarrier.try_wait.parity.shared::cta.b64"
        " P1, [%0], %1;\n"
        "@P1 bra TMA_COPY_LOAD_DONE;\n"
        "bra TMA_COPY_WAIT_LOAD;\n"
        "TMA_COPY_LOAD_DONE:\n"
        "}" ::"r"(mbar_addr),
        "r"(0));

    // TMA Store: SMEM -> GMEM
    asm volatile(
        "cp.async.bulk.global.shared::cta.bulk_group"
        " [%0], [%1], %2;\n" ::"l"(dst),
        "r"(smem_addr),
        "r"(num_bytes)
        : "memory");
    asm volatile("cp.async.bulk.commit_group;");
    asm volatile("cp.async.bulk.wait_group.read 0;" ::: "memory");

    asm volatile("mbarrier.inval.shared::cta.b64 [%0];" ::"r"(mbar_addr));
  }
}

Inline Assembly Safety

Heavy reliance on inline PTX assembly for TMA operations. While functionally correct, this approach lacks compile-time safety and could benefit from additional validation or abstraction to prevent potential issues with register allocation or instruction encoding.

  asm volatile(
      "mbarrier.init.shared::cta.b64 [%0], %1;" ::"r"(mbar_addr), "r"(1));
  asm volatile("fence.mbarrier_init.release.cluster;" :::);
}
__syncwarp();

if (threadIdx.x == 0) {
  // Announce expected transaction bytes on the mbarrier
  asm volatile(
      "mbarrier.arrive.expect_tx.shared::cta.b64 _, [%0], %1;" ::"r"(
          mbar_addr),
      "r"(num_bytes));

  // TMA Load: GMEM -> SMEM (async, completed via mbarrier)
  asm volatile(
      "cp.async.bulk.shared::cluster.global"
      ".mbarrier::complete_tx::bytes"
      " [%0], [%1], %2, [%3];\n" ::"r"(smem_addr),
      "l"(src),
      "r"(num_bytes),
      "r"(mbar_addr)
      : "memory");

  // Block until the mbarrier phase flips (TMA load completed)
  asm volatile(
      "{\n"
      ".reg .pred P1;\n"
      "TMA_COPY_WAIT_LOAD:\n"
      "mbarrier.try_wait.parity.shared::cta.b64"
      " P1, [%0], %1;\n"
      "@P1 bra TMA_COPY_LOAD_DONE;\n"
      "bra TMA_COPY_WAIT_LOAD;\n"
      "TMA_COPY_LOAD_DONE:\n"
      "}" ::"r"(mbar_addr),
      "r"(0));

  // TMA Store: SMEM -> GMEM
  asm volatile(
      "cp.async.bulk.global.shared::cta.bulk_group"
      " [%0], [%1], %2;\n" ::"l"(dst),
      "r"(smem_addr),
      "r"(num_bytes)
      : "memory");
  asm volatile("cp.async.bulk.commit_group;");
  asm volatile("cp.async.bulk.wait_group.read 0;" ::: "memory");

  asm volatile("mbarrier.inval.shared::cta.b64 [%0];" ::"r"(mbar_addr));

Test Coverage Limitations

The multicast test has CUDA version gating (>= 13000) which may limit testing coverage. Consider if there are alternative ways to validate multicast functionality or if this limitation is acceptable for the current scope.

#if (CUDA_VERSION >= 13000)

// Verify TMA 1D bulk copy writing TO an NVLS multicast pointer.
// Root uses TMA to write data to the MC pointer, which broadcasts
// via NVLS hardware. All ranks then verify the data arrived by
// reading from their local UC view with a normal copy.
TEST_F(TmaTest, TmaMulticastWrite) {
  if (communicator_->size() == 1) {
    GTEST_SKIP() << "Skipping test for single device";
  }

  const int64_t rank = communicator_->deviceId();
  const int64_t local_rank = communicator_->local_rank();

  int major;
  NVFUSER_CUDA_RT_SAFE_CALL(cudaDeviceGetAttribute(
      &major, cudaDevAttrComputeCapabilityMajor, local_rank));
  if (major < 9) {
    GTEST_SKIP() << "Requires Hopper (SM90+)";
  }

  int is_multicast_supported;
  NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
      &is_multicast_supported,
      CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED,
      local_rank));
  if (is_multicast_supported == 0) {
    GTEST_SKIP() << "Device does not support Multicast Objects; skipping.";
  }

  constexpr int64_t kNumElems = 524288; // 2 MB / sizeof(int32_t)
  constexpr int64_t root = 0;

  // cp.async.bulk transfer size is limited by shared memory,
  // so we broadcast a 4 KB slice via TMA.
  constexpr int kTmaBytes = 4096;
  static_assert(kTmaBytes % 16 == 0);
  constexpr int kTmaElems = kTmaBytes / sizeof(int32_t);

  at::Tensor local =
      SymmetricTensor::allocate({kNumElems}, at::kInt, communicator_->device());
  local.zero_();
  SymmetricTensor sym(local);
  sym.setupMulticast(root, "tma_mcast");

  auto opts = at::TensorOptions().dtype(at::kInt).device(at::kCUDA, local_rank);

  // Root: TMA-write source data to MC pointer (NVLS broadcasts it)
  if (rank == root) {
    at::Tensor src = at::arange(kTmaElems, opts);
    launchTmaCopy1D(sym.multicastPtr(), src.data_ptr(), kTmaBytes);
    NVFUSER_CUDA_RT_SAFE_CALL(cudaDeviceSynchronize());
  }

  communicator_->barrier();

  // All ranks: verify data arrived via normal read of local UC tensor
  at::Tensor readback = sym.localTensor().slice(0, 0, kTmaElems).clone();
  at::Tensor expected = at::arange(kTmaElems, opts);
  EXPECT_TRUE(readback.equal(expected))
      << "Rank " << rank << " did not receive multicast data written by TMA";
}

#endif // CUDA_VERSION >= 13000

greptile-apps · 2026-02-25T10:06:03Z

Greptile Summary

This PR adds a Hopper TMA (cp.async.bulk) bulk copy kernel and three multi-device unit tests that validate it against local GMEM, VMM-mapped peer (P2P) memory, and NVLS multicast pointers. It also moves alltoallv.cu from csrc/multidevice/ into runtime/ for consistency, and wires tma_copy.cu into the existing NVRTC stringification pipeline.

Key changes:

runtime/tma_copy.cu: New single-warp kernel performing a two-phase TMA copy (GMEM→SMEM via cp.async.bulk + mbarrier, then SMEM→GMEM via cp.async.bulk.global). PTX logic, mbarrier parity, and shared memory layout are all correct.
tests/cpp/test_multidevice_tma.cpp: NVRTC helper compiles and caches the kernel at runtime; three tests cover local, inter-device P2P, and multicast write scenarios. Minor: nvrtcProgram is leaked if compilation fails (before nvrtcDestroyProgram can be called), and launchTmaCopy1D does not assert 16-byte GMEM pointer alignment required by TMA.
CMakeLists.txt: Correct integration of the new test source and runtime file into the build system, matching the existing pattern used for multicast.cu and alltoallv.cu.

Confidence Score: 5/5

Safe to merge — kernel logic and synchronization are correct, build integration is consistent with existing patterns, and the two minor issues are non-blocking cleanup items.
The TMA kernel PTX is correct (mbarrier parity, smem alignment, two-phase copy protocol). The build changes follow established patterns. The two P2 comments (nvrtcDestroyProgram on error path, missing GMEM alignment assertion) are quality-of-life improvements that do not affect correctness in the current tests. The PR is experimental/validation code with no production path yet, consistent with the stated intent.
No files require special attention — the only items are optional cleanups in tests/cpp/test_multidevice_tma.cpp.

Important Files Changed

Filename	Overview
runtime/tma_copy.cu	New TMA 1D bulk copy kernel: implements the two-phase GMEM→SMEM→GMEM copy using `cp.async.bulk` with `mbarrier` synchronization. The PTX is correct — parity=0 in `try_wait.parity` is the right initial phase, smem layout (128-byte aligned buffer + 8-byte mbarrier at `smem+num_bytes`) satisfies all alignment requirements, and the warp-level synchronisation around the `mbarrier.init` is sound.
tests/cpp/test_multidevice_tma.cpp	Three multidevice tests for TMA copy (local, P2P, multicast). The NVRTC helper leaks `nvrtcProgram` on the error path, and `launchTmaCopy1D` does not assert 16-byte GMEM pointer alignment required by TMA. Both are minor; barrier/synchronization ordering across tests is correct.
CMakeLists.txt	Adds `test_multidevice_tma.cpp` to the multidevice test target, registers `runtime/tma_copy.cu` in `NVFUSER_RUNTIME_FILES` (stringification pipeline), moves `alltoallv.cu` from `csrc/multidevice/` to `runtime/` to be consistent with other runtime files, and adds a `CMAKE_BINARY_DIR/include` include path + `nvfuser_rt_tma_copy` dependency for the generated header. All changes look correct and consistent with the existing build patterns.
runtime/alltoallv.cu	Pure rename from `csrc/multidevice/alltoallv.cu` to `runtime/alltoallv.cu` with no content changes; aligns with the convention of keeping NVRTC-compiled kernel sources under `runtime/`.

Sequence Diagram

sequenceDiagram
    participant Host as Host (CPU)
    participant T0 as Thread 0 (SM)
    participant SMEM as Shared Memory
    participant SRC as GMEM src
    participant DST as GMEM dst

    Host->>T0: cuLaunchKernel(tma_copy_1d, smem=num_bytes+8)
    T0->>SMEM: mbarrier.init(arrival_count=1)
    T0->>T0: fence.mbarrier_init + __syncwarp()
    T0->>SMEM: mbarrier.arrive.expect_tx(num_bytes)
    T0->>SRC: cp.async.bulk.shared::cluster.global [SMEM], [src], num_bytes, [mbar]
    SRC-->>SMEM: TMA Load (async, GMEM→SMEM)
    T0->>T0: mbarrier.try_wait.parity(parity=0) [spin until load done]
    SMEM-->>T0: mbarrier complete (phase flips 0→1)
    T0->>DST: cp.async.bulk.global.shared::cta [dst], [SMEM], num_bytes
    T0->>T0: cp.async.bulk.commit_group
    T0->>T0: cp.async.bulk.wait_group.read 0 [wait store done]
    SMEM-->>DST: TMA Store committed (SMEM→GMEM)
    T0->>SMEM: mbarrier.inval
    T0-->>Host: kernel complete

_{Reviews (3): Last reviewed commit: "move files to runtime/" | Re-trigger Greptile}

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

samnordmann · 2026-02-25T12:24:52Z

!test

wujingyue

@naoyam I noticed

Fuser/runtime/memory.cu

Lines 86 to 96 in 005f7e3

    
           // References: 
        
           // 
        
           // TMA: 
        
           // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor 
        
           // https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp 
        
           // 
        
           // Tensor map: 
        
           // https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html 
        
           // 1D TMA load: 
        
           // https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp#L1400

for codegen. Would you recommend using those building blocks or adding some so it's easier for nvFuser to generate fused comm/gemm in the future?

naoyam · 2026-02-26T21:37:14Z

@naoyam I noticed

Fuser/runtime/memory.cu

Lines 86 to 96 in 005f7e3

// References:

//

// TMA:

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor

// https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp

//

// Tensor map:

// https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

// 1D TMA load:

// https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp#L1400

for codegen. Would you recommend using those building blocks or adding some so it's easier for nvFuser to generate fused comm/gemm in the future?

They are mostly just wrappers around some PTX instructions. We could add IR nodes to the Kernel IR and still use them for simpler final codegen (CudaKernelGenerator). For example, there's a IR node for mbarrier.

The overall design philosophy is to generate the Kernel IR that explicitly represents the final CUDA kernel and minimize the logic necessary in CudaKernelGenerator. That is not the case in not so small number of cases, unfortunately, but I'd vote for adding explicit TMA nodes to Kernel to make it explicit.

samnordmann · 2026-02-26T22:53:24Z

Ok regarding code gen, however, this pr is not about code gen. The present tma kernel is used as a "host op" to perform inter-GPU comms, similarly to a cudaMemcpyAsync. This PR provides a reference implementation and the next one adds this transport as a possible p2p backend.

I am not sure to understand -- are you ok with the pr's current implementation or do you suggest something else?

samnordmann · 2026-03-09T14:28:41Z

@naoyam @wujingyue
Gentle reminder, do you request any change?

wujingyue

however, this pr is not about code gen

I understood that. I may have overread your intention to reuse csrc/multidevice/tma_copy.cu for codegen as well. For that purpose, I agree with @naoyam on kernel IR.

wujingyue · 2026-03-09T17:31:28Z

runtime/tma_copy.cu

@naoyam organization-wise, do you prefer to move this (and alltoallv.cu) to runtime/tma_copy.cu?

Yes, since that directory is the one where we hold all runtime code.

samnordmann · 2026-03-23T15:40:56Z

!test

greptile-apps · 2026-03-23T15:44:27Z

tests/cpp/test_multidevice_tma.cpp

+  nvrtcResult res = nvrtcCompileProgram(prog, (int)opts.size(), opts.data());
+  if (res != NVRTC_SUCCESS) {
+    size_t logSize;
+    NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
+    std::vector<char> log(logSize);
+    NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log.data()));
+    NVF_ERROR(
+        false,
+        "NVRTC compilation of '",
+        source_name,
+        "' failed:\n",
+        log.data());
+  }


nvrtcDestroyProgram leaked on compilation error

When nvrtcCompileProgram fails, the error path reads the log and then calls NVF_ERROR which throws. nvrtcDestroyProgram(&prog) is never called on this path, leaking the NVRTC program object. While NVRTC programs are small and this only triggers on failure, a guard ensures clean teardown:

nvrtcResult res = nvrtcCompileProgram(prog, (int)opts.size(), opts.data()); if (res != NVRTC_SUCCESS) { size_t logSize; NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize)); std::vector<char> log(logSize); NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log.data())); nvrtcDestroyProgram(&prog); NVF_ERROR( false, "NVRTC compilation of '", source_name, "' failed:\n", log.data()); }

greptile-apps · 2026-03-23T15:44:28Z

tests/cpp/test_multidevice_tma.cpp

+void launchTmaCopy1D(
+    void* dst,
+    const void* src,
+    int num_bytes,
+    CUstream stream = nullptr) {
+  NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0);
+  CUfunction tma_kernel = getTmaCopy1dKernel();
+  int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t));
+  void* args[] = {&dst, &src, &num_bytes};
+  NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel(
+      tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr));
+}


Missing GMEM pointer alignment check

cp.async.bulk (both load and store forms) requires the global memory address to be 16-byte aligned. The function checks num_bytes % 16 == 0 but neither src nor dst alignment is verified. In the current tests all pointers come from PyTorch/VMM allocations that are always aligned, but an explicit assertion would guard against future callers:

Suggested change

void launchTmaCopy1D(

void* dst,

const void* src,

int num_bytes,

CUstream stream = nullptr) {

NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0);

CUfunction tma_kernel = getTmaCopy1dKernel();

int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t));

void* args[] = {&dst, &src, &num_bytes};

NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel(

tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr));

}

void launchTmaCopy1D(

void* dst,

const void* src,

int num_bytes,

CUstream stream = nullptr) {

NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0);

NVF_CHECK(

reinterpret_cast<uintptr_t>(src) % 16 == 0 &&

reinterpret_cast<uintptr_t>(dst) % 16 == 0,

"TMA cp.async.bulk requires 16-byte aligned GMEM addresses");

CUfunction tma_kernel = getTmaCopy1dKernel();

int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t));

void* args[] = {&dst, &src, &num_bytes};

NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel(

tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr));

}

samnordmann added 4 commits February 24, 2026 06:18

add multidevice tma p2p standalone tests

c2941a0

clean and fix the tests

64c6162

add comment

a852aad

rename kernel file

c32c130

samnordmann requested a review from wujingyue February 25, 2026 10:04

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

lint

ae0c760

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

wujingyue requested a review from naoyam February 25, 2026 19:57

wujingyue reviewed Feb 25, 2026

View reviewed changes

wujingyue reviewed Mar 9, 2026

View reviewed changes

wujingyue approved these changes Mar 9, 2026

View reviewed changes

samnordmann added 2 commits March 23, 2026 08:32

Merge branch 'main' of github.com:NVIDIA/Fuser into tma_p2p

52c2258

move files to runtime/

9b95f48

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

samnordmann merged commit 617fa07 into main Mar 23, 2026
50 of 51 checks passed

samnordmann deleted the tma_p2p branch March 23, 2026 17:02

	// References:
	//
	// TMA:
	// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
	// https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp
	//
	// Tensor map:
	// https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

	// 1D TMA load:
	// https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp#L1400

Conversation

samnordmann commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Uh oh!

github-actions bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Feb 25, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam commented Feb 26, 2026

Uh oh!

samnordmann commented Feb 26, 2026

Uh oh!

samnordmann commented Mar 9, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

naoyam Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samnordmann commented Feb 25, 2026 •

edited

Loading

github-actions bot commented Feb 25, 2026 •

edited

Loading

greptile-apps bot commented Feb 25, 2026 •

edited

Loading