Skip to content

[Multidevice] Tma bulk copy p2p runtime examples#6011

Merged
samnordmann merged 7 commits intomainfrom
tma_p2p
Mar 23, 2026
Merged

[Multidevice] Tma bulk copy p2p runtime examples#6011
samnordmann merged 7 commits intomainfrom
tma_p2p

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Feb 25, 2026

What

Add a Hopper TMA (cp.async.bulk) copy kernel in csrc/multidevice/tma_copy.cu and validate it across three memory source/destination types:

  • local GMEM
  • peer symmetric memory. It means TMA can write from local shared memory to remote global memory.
  • NVLS multicast pointers. It means that by using the multicast ptr as the destination of the TMA request, data can be broadcast to the whole NVL domain in one shot at line rate. Note, however, that this is not officially supported according to the CUDA doc.

Those behavior are demonstrated through three unit tests at tests/cpp/test_multidevice_tma.cpp. The tests reuse the SymmetricTensor abstraction for VMM allocation, IPC handle exchange, and multicast setup, keeping the test bodies focused on the TMA transfer itself.

Why

The CUDA backend for multi-device communication (csrc/multidevice/cuda_p2p.cpp) currently uses SM-based copies (regular threads load/store or multimem.st) and copy-engine copies (cudaMemcpyAsync / cudaMemcpyBatchAsync). TMA offers a third transport option that is GPU-initiated, lightweight (single-thread issue), fully asynchronous, and frees SM resources for overlapping compute. This transport is leveraged by DeepEP for intra-node MoE dispatch. This PR validates that TMA works correctly on the memory types used by nvFuser's multi-device infrastructure.

This lays the groundwork for a follow-up PR that integrates TMA as a transport option for P2P and multicast communications alongside the existing SM-based copies and copy-engine transports.

How

  • The kernel is implemented in csrc/multidevice/tma_copy.cu. It is a single-warp kernel where thread 0 performs a two-phase TMA transfer through shared memory (GMEM(src) --[TMA load]--> SMEM --[TMA store]--> GMEM(dst)), using mbarrier for async completion tracking. TMA is a GMEM-SMEM engine — there is no GMEM-to-GMEM variant, so shared memory staging is inherent to the hardware.
  • The kernel is compiled at runtime via NVRTC (same pattern as the existing alltoallv.cu, multicast.cu kernels in cuda_p2p.cpp, and other kernels in runtime/) and stringified at build time through the existing NVFUSER_RUNTIME_FILES pipeline.

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

Review updated until commit ae0c760

Description

  • Add Hopper TMA copy kernel using cp.async.bulk for GMEM->SMEM->GMEM transfers

  • Implement three test scenarios: local GMEM, peer symmetric memory, and NVLS multicast

  • Use NVRTC runtime compilation with dynamic shared memory and mbarrier synchronization

  • Integrate with existing SymmetricTensor infrastructure for VMM and multicast setup

Changes walkthrough

Relevant files
Tests
test_multidevice_tma.cpp
TMA copy kernel tests for multidevice scenarios                   

tests/cpp/test_multidevice_tma.cpp

  • Add comprehensive TMA copy tests with NVRTC runtime compilation
  • Test local GMEM copy, peer device memory, and NVLS multicast scenarios
  • Include SM90+ capability checks and proper error handling
  • Reuse SymmetricTensor abstraction for VMM and multicast setup
  • +271/-0 
    Enhancement
    tma_copy.cu
    Hopper TMA bulk copy kernel implementation                             

    csrc/multidevice/tma_copy.cu

  • Implement single-warp TMA kernel with thread 0 driving transfers
  • Use two-phase GMEM->SMEM->GMEM copy with mbarrier synchronization
  • Handle dynamic shared memory allocation and alignment
  • Include inline PTX assembly for TMA operations and barriers
  • +101/-0 
    Configuration changes
    CMakeLists.txt
    Build configuration for TMA tests and resources                   

    CMakeLists.txt

  • Add test_multidevice_tma.cpp to multidevice test sources
  • Include nvfuser_rt_tma_copy dependency for runtime resources
  • Add binary directory include path for generated headers
  • +5/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Kernel Robustness

    The kernel assumes num_bytes > 0 and divisible by 16, but lacks runtime validation. While the test validates this, the kernel itself could be more defensive against invalid inputs to prevent undefined behavior.

    extern "C" __global__ void __launch_bounds__(32, 1) tma_copy_1d(
        void* __restrict__ dst,
        const void* __restrict__ src,
        int num_bytes) {
      extern __shared__ __align__(128) unsigned char smem[];
    
      unsigned long long* mbar =
          reinterpret_cast<unsigned long long*>(smem + num_bytes);
      unsigned int smem_addr =
          static_cast<unsigned int>(__cvta_generic_to_shared(smem));
      unsigned int mbar_addr =
          static_cast<unsigned int>(__cvta_generic_to_shared(mbar));
    
      if (threadIdx.x == 0) {
        asm volatile(
            "mbarrier.init.shared::cta.b64 [%0], %1;" ::"r"(mbar_addr), "r"(1));
        asm volatile("fence.mbarrier_init.release.cluster;" :::);
      }
      __syncwarp();
    
      if (threadIdx.x == 0) {
        // Announce expected transaction bytes on the mbarrier
        asm volatile(
            "mbarrier.arrive.expect_tx.shared::cta.b64 _, [%0], %1;" ::"r"(
                mbar_addr),
            "r"(num_bytes));
    
        // TMA Load: GMEM -> SMEM (async, completed via mbarrier)
        asm volatile(
            "cp.async.bulk.shared::cluster.global"
            ".mbarrier::complete_tx::bytes"
            " [%0], [%1], %2, [%3];\n" ::"r"(smem_addr),
            "l"(src),
            "r"(num_bytes),
            "r"(mbar_addr)
            : "memory");
    
        // Block until the mbarrier phase flips (TMA load completed)
        asm volatile(
            "{\n"
            ".reg .pred P1;\n"
            "TMA_COPY_WAIT_LOAD:\n"
            "mbarrier.try_wait.parity.shared::cta.b64"
            " P1, [%0], %1;\n"
            "@P1 bra TMA_COPY_LOAD_DONE;\n"
            "bra TMA_COPY_WAIT_LOAD;\n"
            "TMA_COPY_LOAD_DONE:\n"
            "}" ::"r"(mbar_addr),
            "r"(0));
    
        // TMA Store: SMEM -> GMEM
        asm volatile(
            "cp.async.bulk.global.shared::cta.bulk_group"
            " [%0], [%1], %2;\n" ::"l"(dst),
            "r"(smem_addr),
            "r"(num_bytes)
            : "memory");
        asm volatile("cp.async.bulk.commit_group;");
        asm volatile("cp.async.bulk.wait_group.read 0;" ::: "memory");
    
        asm volatile("mbarrier.inval.shared::cta.b64 [%0];" ::"r"(mbar_addr));
      }
    }
    Inline Assembly Safety

    Heavy reliance on inline PTX assembly for TMA operations. While functionally correct, this approach lacks compile-time safety and could benefit from additional validation or abstraction to prevent potential issues with register allocation or instruction encoding.

      asm volatile(
          "mbarrier.init.shared::cta.b64 [%0], %1;" ::"r"(mbar_addr), "r"(1));
      asm volatile("fence.mbarrier_init.release.cluster;" :::);
    }
    __syncwarp();
    
    if (threadIdx.x == 0) {
      // Announce expected transaction bytes on the mbarrier
      asm volatile(
          "mbarrier.arrive.expect_tx.shared::cta.b64 _, [%0], %1;" ::"r"(
              mbar_addr),
          "r"(num_bytes));
    
      // TMA Load: GMEM -> SMEM (async, completed via mbarrier)
      asm volatile(
          "cp.async.bulk.shared::cluster.global"
          ".mbarrier::complete_tx::bytes"
          " [%0], [%1], %2, [%3];\n" ::"r"(smem_addr),
          "l"(src),
          "r"(num_bytes),
          "r"(mbar_addr)
          : "memory");
    
      // Block until the mbarrier phase flips (TMA load completed)
      asm volatile(
          "{\n"
          ".reg .pred P1;\n"
          "TMA_COPY_WAIT_LOAD:\n"
          "mbarrier.try_wait.parity.shared::cta.b64"
          " P1, [%0], %1;\n"
          "@P1 bra TMA_COPY_LOAD_DONE;\n"
          "bra TMA_COPY_WAIT_LOAD;\n"
          "TMA_COPY_LOAD_DONE:\n"
          "}" ::"r"(mbar_addr),
          "r"(0));
    
      // TMA Store: SMEM -> GMEM
      asm volatile(
          "cp.async.bulk.global.shared::cta.bulk_group"
          " [%0], [%1], %2;\n" ::"l"(dst),
          "r"(smem_addr),
          "r"(num_bytes)
          : "memory");
      asm volatile("cp.async.bulk.commit_group;");
      asm volatile("cp.async.bulk.wait_group.read 0;" ::: "memory");
    
      asm volatile("mbarrier.inval.shared::cta.b64 [%0];" ::"r"(mbar_addr));
    Test Coverage Limitations

    The multicast test has CUDA version gating (>= 13000) which may limit testing coverage. Consider if there are alternative ways to validate multicast functionality or if this limitation is acceptable for the current scope.

    #if (CUDA_VERSION >= 13000)
    
    // Verify TMA 1D bulk copy writing TO an NVLS multicast pointer.
    // Root uses TMA to write data to the MC pointer, which broadcasts
    // via NVLS hardware. All ranks then verify the data arrived by
    // reading from their local UC view with a normal copy.
    TEST_F(TmaTest, TmaMulticastWrite) {
      if (communicator_->size() == 1) {
        GTEST_SKIP() << "Skipping test for single device";
      }
    
      const int64_t rank = communicator_->deviceId();
      const int64_t local_rank = communicator_->local_rank();
    
      int major;
      NVFUSER_CUDA_RT_SAFE_CALL(cudaDeviceGetAttribute(
          &major, cudaDevAttrComputeCapabilityMajor, local_rank));
      if (major < 9) {
        GTEST_SKIP() << "Requires Hopper (SM90+)";
      }
    
      int is_multicast_supported;
      NVFUSER_CUDA_SAFE_CALL(cuDeviceGetAttribute(
          &is_multicast_supported,
          CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED,
          local_rank));
      if (is_multicast_supported == 0) {
        GTEST_SKIP() << "Device does not support Multicast Objects; skipping.";
      }
    
      constexpr int64_t kNumElems = 524288; // 2 MB / sizeof(int32_t)
      constexpr int64_t root = 0;
    
      // cp.async.bulk transfer size is limited by shared memory,
      // so we broadcast a 4 KB slice via TMA.
      constexpr int kTmaBytes = 4096;
      static_assert(kTmaBytes % 16 == 0);
      constexpr int kTmaElems = kTmaBytes / sizeof(int32_t);
    
      at::Tensor local =
          SymmetricTensor::allocate({kNumElems}, at::kInt, communicator_->device());
      local.zero_();
      SymmetricTensor sym(local);
      sym.setupMulticast(root, "tma_mcast");
    
      auto opts = at::TensorOptions().dtype(at::kInt).device(at::kCUDA, local_rank);
    
      // Root: TMA-write source data to MC pointer (NVLS broadcasts it)
      if (rank == root) {
        at::Tensor src = at::arange(kTmaElems, opts);
        launchTmaCopy1D(sym.multicastPtr(), src.data_ptr(), kTmaBytes);
        NVFUSER_CUDA_RT_SAFE_CALL(cudaDeviceSynchronize());
      }
    
      communicator_->barrier();
    
      // All ranks: verify data arrived via normal read of local UC tensor
      at::Tensor readback = sym.localTensor().slice(0, 0, kTmaElems).clone();
      at::Tensor expected = at::arange(kTmaElems, opts);
      EXPECT_TRUE(readback.equal(expected))
          << "Rank " << rank << " did not receive multicast data written by TMA";
    }
    
    #endif // CUDA_VERSION >= 13000

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Greptile Summary

    This PR adds a Hopper TMA (cp.async.bulk) bulk copy kernel and three multi-device unit tests that validate it against local GMEM, VMM-mapped peer (P2P) memory, and NVLS multicast pointers. It also moves alltoallv.cu from csrc/multidevice/ into runtime/ for consistency, and wires tma_copy.cu into the existing NVRTC stringification pipeline.

    Key changes:

    • runtime/tma_copy.cu: New single-warp kernel performing a two-phase TMA copy (GMEM→SMEM via cp.async.bulk + mbarrier, then SMEM→GMEM via cp.async.bulk.global). PTX logic, mbarrier parity, and shared memory layout are all correct.
    • tests/cpp/test_multidevice_tma.cpp: NVRTC helper compiles and caches the kernel at runtime; three tests cover local, inter-device P2P, and multicast write scenarios. Minor: nvrtcProgram is leaked if compilation fails (before nvrtcDestroyProgram can be called), and launchTmaCopy1D does not assert 16-byte GMEM pointer alignment required by TMA.
    • CMakeLists.txt: Correct integration of the new test source and runtime file into the build system, matching the existing pattern used for multicast.cu and alltoallv.cu.

    Confidence Score: 5/5

    • Safe to merge — kernel logic and synchronization are correct, build integration is consistent with existing patterns, and the two minor issues are non-blocking cleanup items.
    • The TMA kernel PTX is correct (mbarrier parity, smem alignment, two-phase copy protocol). The build changes follow established patterns. The two P2 comments (nvrtcDestroyProgram on error path, missing GMEM alignment assertion) are quality-of-life improvements that do not affect correctness in the current tests. The PR is experimental/validation code with no production path yet, consistent with the stated intent.
    • No files require special attention — the only items are optional cleanups in tests/cpp/test_multidevice_tma.cpp.

    Important Files Changed

    Filename Overview
    runtime/tma_copy.cu New TMA 1D bulk copy kernel: implements the two-phase GMEM→SMEM→GMEM copy using cp.async.bulk with mbarrier synchronization. The PTX is correct — parity=0 in try_wait.parity is the right initial phase, smem layout (128-byte aligned buffer + 8-byte mbarrier at smem+num_bytes) satisfies all alignment requirements, and the warp-level synchronisation around the mbarrier.init is sound.
    tests/cpp/test_multidevice_tma.cpp Three multidevice tests for TMA copy (local, P2P, multicast). The NVRTC helper leaks nvrtcProgram on the error path, and launchTmaCopy1D does not assert 16-byte GMEM pointer alignment required by TMA. Both are minor; barrier/synchronization ordering across tests is correct.
    CMakeLists.txt Adds test_multidevice_tma.cpp to the multidevice test target, registers runtime/tma_copy.cu in NVFUSER_RUNTIME_FILES (stringification pipeline), moves alltoallv.cu from csrc/multidevice/ to runtime/ to be consistent with other runtime files, and adds a CMAKE_BINARY_DIR/include include path + nvfuser_rt_tma_copy dependency for the generated header. All changes look correct and consistent with the existing build patterns.
    runtime/alltoallv.cu Pure rename from csrc/multidevice/alltoallv.cu to runtime/alltoallv.cu with no content changes; aligns with the convention of keeping NVRTC-compiled kernel sources under runtime/.

    Sequence Diagram

    sequenceDiagram
        participant Host as Host (CPU)
        participant T0 as Thread 0 (SM)
        participant SMEM as Shared Memory
        participant SRC as GMEM src
        participant DST as GMEM dst
    
        Host->>T0: cuLaunchKernel(tma_copy_1d, smem=num_bytes+8)
        T0->>SMEM: mbarrier.init(arrival_count=1)
        T0->>T0: fence.mbarrier_init + __syncwarp()
        T0->>SMEM: mbarrier.arrive.expect_tx(num_bytes)
        T0->>SRC: cp.async.bulk.shared::cluster.global [SMEM], [src], num_bytes, [mbar]
        SRC-->>SMEM: TMA Load (async, GMEM→SMEM)
        T0->>T0: mbarrier.try_wait.parity(parity=0) [spin until load done]
        SMEM-->>T0: mbarrier complete (phase flips 0→1)
        T0->>DST: cp.async.bulk.global.shared::cta [dst], [SMEM], num_bytes
        T0->>T0: cp.async.bulk.commit_group
        T0->>T0: cp.async.bulk.wait_group.read 0 [wait store done]
        SMEM-->>DST: TMA Store committed (SMEM→GMEM)
        T0->>SMEM: mbarrier.inval
        T0-->>Host: kernel complete
    
    Loading

    Reviews (3): Last reviewed commit: "move files to runtime/" | Re-trigger Greptile

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @wujingyue wujingyue requested a review from naoyam February 25, 2026 19:57
    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @naoyam I noticed

    Fuser/runtime/memory.cu

    Lines 86 to 96 in 005f7e3

    // References:
    //
    // TMA:
    // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
    // https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp
    //
    // Tensor map:
    // https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
    // 1D TMA load:
    // https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp#L1400
    for codegen. Would you recommend using those building blocks or adding some so it's easier for nvFuser to generate fused comm/gemm in the future?

    @naoyam
    Copy link
    Collaborator

    naoyam commented Feb 26, 2026

    @naoyam I noticed

    Fuser/runtime/memory.cu

    Lines 86 to 96 in 005f7e3

    // References:
    //
    // TMA:
    // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
    // https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp
    //
    // Tensor map:
    // https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
    // 1D TMA load:
    // https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/copy_sm90_tma.hpp#L1400

    for codegen. Would you recommend using those building blocks or adding some so it's easier for nvFuser to generate fused comm/gemm in the future?

    They are mostly just wrappers around some PTX instructions. We could add IR nodes to the Kernel IR and still use them for simpler final codegen (CudaKernelGenerator). For example, there's a IR node for mbarrier.

    The overall design philosophy is to generate the Kernel IR that explicitly represents the final CUDA kernel and minimize the logic necessary in CudaKernelGenerator. That is not the case in not so small number of cases, unfortunately, but I'd vote for adding explicit TMA nodes to Kernel to make it explicit.

    @samnordmann
    Copy link
    Collaborator Author

    Ok regarding code gen, however, this pr is not about code gen. The present tma kernel is used as a "host op" to perform inter-GPU comms, similarly to a cudaMemcpyAsync. This PR provides a reference implementation and the next one adds this transport as a possible p2p backend.

    I am not sure to understand -- are you ok with the pr's current implementation or do you suggest something else?

    @samnordmann
    Copy link
    Collaborator Author

    @naoyam @wujingyue
    Gentle reminder, do you request any change?

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    however, this pr is not about code gen

    I understood that. I may have overread your intention to reuse csrc/multidevice/tma_copy.cu for codegen as well. For that purpose, I agree with @naoyam on kernel IR.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @naoyam organization-wise, do you prefer to move this (and alltoallv.cu) to runtime/tma_copy.cu?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yes, since that directory is the one where we hold all runtime code.

    @samnordmann
    Copy link
    Collaborator Author

    !test

    Comment on lines +64 to +76
    nvrtcResult res = nvrtcCompileProgram(prog, (int)opts.size(), opts.data());
    if (res != NVRTC_SUCCESS) {
    size_t logSize;
    NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
    std::vector<char> log(logSize);
    NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log.data()));
    NVF_ERROR(
    false,
    "NVRTC compilation of '",
    source_name,
    "' failed:\n",
    log.data());
    }
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    P2 nvrtcDestroyProgram leaked on compilation error

    When nvrtcCompileProgram fails, the error path reads the log and then calls NVF_ERROR which throws. nvrtcDestroyProgram(&prog) is never called on this path, leaking the NVRTC program object. While NVRTC programs are small and this only triggers on failure, a guard ensures clean teardown:

      nvrtcResult res = nvrtcCompileProgram(prog, (int)opts.size(), opts.data());
      if (res != NVRTC_SUCCESS) {
        size_t logSize;
        NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
        std::vector<char> log(logSize);
        NVFUSER_NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log.data()));
        nvrtcDestroyProgram(&prog);
        NVF_ERROR(
            false,
            "NVRTC compilation of '",
            source_name,
            "' failed:\n",
            log.data());
      }

    Comment on lines +107 to +118
    void launchTmaCopy1D(
    void* dst,
    const void* src,
    int num_bytes,
    CUstream stream = nullptr) {
    NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0);
    CUfunction tma_kernel = getTmaCopy1dKernel();
    int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t));
    void* args[] = {&dst, &src, &num_bytes};
    NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel(
    tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr));
    }
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    P2 Missing GMEM pointer alignment check

    cp.async.bulk (both load and store forms) requires the global memory address to be 16-byte aligned. The function checks num_bytes % 16 == 0 but neither src nor dst alignment is verified. In the current tests all pointers come from PyTorch/VMM allocations that are always aligned, but an explicit assertion would guard against future callers:

    Suggested change
    void launchTmaCopy1D(
    void* dst,
    const void* src,
    int num_bytes,
    CUstream stream = nullptr) {
    NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0);
    CUfunction tma_kernel = getTmaCopy1dKernel();
    int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t));
    void* args[] = {&dst, &src, &num_bytes};
    NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel(
    tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr));
    }
    void launchTmaCopy1D(
    void* dst,
    const void* src,
    int num_bytes,
    CUstream stream = nullptr) {
    NVF_CHECK(num_bytes > 0 && num_bytes % 16 == 0);
    NVF_CHECK(
    reinterpret_cast<uintptr_t>(src) % 16 == 0 &&
    reinterpret_cast<uintptr_t>(dst) % 16 == 0,
    "TMA cp.async.bulk requires 16-byte aligned GMEM addresses");
    CUfunction tma_kernel = getTmaCopy1dKernel();
    int smem_size = num_bytes + static_cast<int>(sizeof(uint64_t));
    void* args[] = {&dst, &src, &num_bytes};
    NVFUSER_CUDA_SAFE_CALL(cuLaunchKernel(
    tma_kernel, 1, 1, 1, 32, 1, 1, smem_size, stream, args, nullptr));
    }

    @samnordmann samnordmann merged commit 617fa07 into main Mar 23, 2026
    50 of 51 checks passed
    @samnordmann samnordmann deleted the tma_p2p branch March 23, 2026 17:02
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants