Symmetric memory pytorch backends by saivishal1999 · Pull Request #6023 · NVIDIA/Fuser

saivishal1999 · 2026-03-02T22:55:09Z

No description provided.

github-actions · 2026-03-02T22:56:04Z

Review updated until commit 6996d05

Description

Add PyTorch symmetric memory backends (NCCL, NVSHMEM, CUDA) as alternatives to native VMM
Implement getSymmetricMemoryBackend() to select backend via NVFUSER_ENABLE=symmetric_memory_backend option
Integrate PyTorch's c10d::symmetric_memory for allocation, rendezvous, and remote tensor access
Add Communicator methods to expose Store and Backend for PyTorch symmetric memory integration

Changes walkthrough

Relevant files

Enhancement

6 files

ipc_utils.h `Add SymmetricMemoryBackend enum and getter`	+13/-0
ipc_utils.cpp `Implement getSymmetricMemoryBackend option parsing`	+18/-0
symmetric_tensor.h `Add PyTorch symmetric memory handle member`	+15/-6
symmetric_tensor.cpp `Implement PyTorch backend allocation and remote access`	+162/-1
communicator.h `Declare getStore and getWorldBackendIntrusivePtr`	+13/-0
communicator.cpp `Implement getStore and getWorldBackendIntrusivePtr`	+16/-0

Configuration changes

2 files

options.h `Add SymmetricMemoryBackend to EnableOption enum`	+2/-0
options.cpp `Register symmetric_memory_backend enable option`	+1/-0

Tests

1 files

test_multidevice_symmetric_tensor.cpp `Add tests for symmetric memory backend selection`	+108/-0

Miscellaneous

1 files

fbuild.sh `Add build script for development`	+24/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Silent fallback to Native backend

When an invalid argument is passed to symmetric_memory_backend option (e.g., "pytorch_invalid"),
getSymmetricMemoryBackend() silently falls back to Native instead of reporting an error.
This could mask user configuration mistakes. Consider adding validation to warn or error
on unknown backend arguments.

SymmetricMemoryBackend getSymmetricMemoryBackend() {
  if (isOptionEnabled(EnableOption::SymmetricMemoryBackend)) {
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nccl")) {
      return SymmetricMemoryBackend::PyTorchNccl;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nvshmem")) {
      return SymmetricMemoryBackend::PyTorchNvshmem;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_cuda")) {
      return SymmetricMemoryBackend::PyTorchCuda;
    }
  }
  return SymmetricMemoryBackend::Native;
}

PyTorch backend tests commented out

The test PyTorchBackend_RemoteAccessCorrectness (lines 125-163) is commented out. Since this
PR introduces PyTorch symmetric memory backends, having at least one active test for the
non-native paths would be valuable to ensure correctness. Consider enabling or adding an
alternative test for the PyTorch backend path.

// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
//   if (communicator_->size() == 1) {
//     GTEST_SKIP() << "Skipping test for single device";
//   }
//   SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
//   if (backend == SymmetricMemoryBackend::Native) {
//     GTEST_SKIP()
//         << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
//   }

//   const int64_t rank = communicator_->deviceId();
//   const int64_t world_size = communicator_->size();

//   at::Tensor local_tensor = SymmetricTensor::allocate(
//       {256, 512}, at::ScalarType::Float, communicator_->device());
//   SymmetricTensor sym_tensor(local_tensor);

//   EXPECT_TRUE(local_tensor.is_cuda());
//   EXPECT_EQ(local_tensor.numel(), 256 * 512);

//   float local_value = static_cast<float>(rank + 200);
//   local_tensor.fill_(local_value);

//   sym_tensor.setupRemoteHandles();

//   for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
//     void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
//     EXPECT_NE(peer_ptr, nullptr);

//     float peer_value;
//     NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
//         &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

//     float expected_value = static_cast<float>(peer_rank + 200);
//     EXPECT_FLOAT_EQ(peer_value, expected_value)
//         << "Rank " << rank << " reading from rank " << peer_rank
//         << " (PyTorch backend)";
//   }
// }

Unnecessary build script added

A new file fbuild.sh was added which appears to be a local development/build script with
hardcoded paths (e.g., /opt/hpcx/ucc). This should likely be removed from the PR as it's
not part of the feature implementation and contains machine-specific configuration.

#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation

greptile-apps · 2026-03-02T23:01:28Z

Greptile Summary

This PR adds PyTorch's symmetric memory layer (torch.distributed._symmetric_memory) as an optional backend for SymmetricTensor, selectable via NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl|pytorch_nvshmem|pytorch_cuda). The native CUDA VMM + IPC path remains the default. Along the way, it adds several correctness fixes to the existing native path (unsigned-loop UB, uninitialized struct members, nullptr-instead-of-NULL cleanups).

Key changes:

New SymmetricMemoryBackend enum and getSymmetricMemoryBackend() parser in ipc_utils
SymmetricTensor::allocate / constructor / setupRemoteHandles / setupMulticast all branch on the active backend
Communicator::getBackendForTeam now additionally wraps the NCCL Backend in a c10d::ProcessGroup and registers it in c10d's global group registry so PyTorch's symmetric-memory rendezvous can resolve it
New getSymmMemGroupKey helper exposes the registered key to ensurePyTorchSymmMemBackend

Issues still requiring attention:

comm.barrier() inside ensurePyTorchSymmMemBackend executes on every call (not just first-time setup), adding an NCCL barrier to every allocate() and setupRemoteHandles() invocation after initialization is complete — see comment at line 63
ProcessGroup wrapper registration in getBackendForTeam is inside the backends_.find(team_key) == backends_.end() guard, meaning it is silently skipped if the world backend was already created (e.g., via an earlier comm.barrier()) before the first PyTorch symmetric memory operation — see comment at line 415
process_groups_ is declared under #if NVFUSER_DISTRIBUTED && USE_DISTRIBUTED but the cleanup loop iterates it under the broader #if NVFUSER_DISTRIBUTED, producing a compile error when NVFUSER_DISTRIBUTED is set without USE_DISTRIBUTED
The "0" group alias registered in ensurePyTorchSymmMemBackend is never cleaned up in Communicator::cleanup(), leaving a stale registration across test teardowns
strides.back() is called without an empty-sizes guard in both the PyTorch and native allocation paths, yielding UB for 0-dimensional tensors
The static std::once_flag once in ensurePyTorchSymmMemBackend permanently locks in whichever backend is used first; a second call with a different backend silently proceeds with the wrong set_backend state
The PyTorch backend paths (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) lack dedicated end-to-end tests — existing tests provide only allocation and basic remote-access coverage

Confidence Score: 2/5

Not ready to merge — contains a potential compile error in certain build configurations and multiple functional correctness issues in the new PyTorch backend paths.
Several previous-round concerns have been addressed (debug print removed, unsigned-loop UB fixed with explicit cast, NVF_THROW signature corrected, register_process_group now called via getBackendForTeam, group-key prefix mismatch resolved, redundant getSymmetricMemoryBackend double-call eliminated). However, a number of blocking issues remain: the process_groups_ cleanup guard mismatch is a hard compile error in NVFUSER_DISTRIBUTED-without-USE_DISTRIBUTED builds; the ProcessGroup creation being gated inside the first-backend-creation path means a prior barrier() call silently breaks symmetric memory rendezvous; the barrier-on-every-call concern adds unnecessary overhead and asymmetric-call risk; and the "0" alias is never unregistered. The PyTorch backend also lacks dedicated CI-visible tests.
csrc/multidevice/symmetric_tensor.cpp and csrc/multidevice/communicator.cpp require the most attention — the former for the repeated barrier and the latter for the ProcessGroup registration ordering issue and the #if guard mismatch in cleanup().

Important Files Changed

Filename	Overview
csrc/multidevice/symmetric_tensor.cpp	Core of the PR: adds PyTorch symmetric memory backends (NCCL, NVSHMEM, CUDA) alongside the native VMM path. Several issues remain: redundant barrier on every ensurePyTorchSymmMemBackend call, strides.back() UB for 0-dim tensors, once_flag binding to first-called backend only, and "0" alias registration that leaks across test teardowns.
csrc/multidevice/communicator.cpp	Adds ProcessGroup wrapper creation/registration inside getBackendForTeam and the new getSymmMemGroupKey helper. Key issue: ProcessGroup is only created during the first backend creation, so any prior getWorld()/barrier() call silently prevents it from being registered. Also the process_groups_ cleanup loop is under #if NVFUSER_DISTRIBUTED while the member is declared under #if NVFUSER_DISTRIBUTED && USE_DISTRIBUTED causing a compile error in some configs.
csrc/multidevice/communicator.h	Adds getSymmMemGroupKey declaration and the process_groups_ unordered_map member (guarded by NVFUSER_DISTRIBUTED && USE_DISTRIBUTED). The include guard was correctly tightened from NVFUSER_DISTRIBUTED to NVFUSER_DISTRIBUTED && USE_DISTRIBUTED.
tests/cpp/test_multidevice_symmetric_tensor.cpp	Adds ContiguousView guard to skip for non-native backends, and new SmallAllocation/SmallAllocationMulticast tests. The new PyTorch-specific backends (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) still have no dedicated end-to-end test; the existing tests run for all backends but only exercise allocation/remote-access, not the rendezvous and group-registration paths.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant SymTensor as SymmetricTensor
    participant ensurePT as ensurePyTorchSymmMemBackend
    participant Comm as Communicator
    participant PyTorch as c10d::symmetric_memory

    Note over Caller,PyTorch: Allocation path (PyTorch backend)
    Caller->>SymTensor: allocate(sizes, dtype, device)
    SymTensor->>ensurePT: ensurePyTorchSymmMemBackend(backend)
    ensurePT->>PyTorch: set_backend("NCCL"|"NVSHMEM") [once]
    ensurePT->>Comm: getSymmMemGroupKey(kNccl)
    Comm->>Comm: getBackendForTeam(all_ranks, kNccl)
    Note right of Comm: Creates ProcessGroupNCCL + <br/>c10d::ProcessGroup wrapper,<br/>registers under team_key
    Comm-->>ensurePT: group_name = team_key
    ensurePT->>PyTorch: resolve_process_group("0") [once, fallback]
    ensurePT->>Comm: barrier(kNccl) ⚠️ every call
    ensurePT-->>SymTensor: group_name
    SymTensor->>PyTorch: empty_strided_p2p(sizes, strides, dtype, device, alloc_group_name)
    PyTorch-->>SymTensor: local_tensor (PyTorch-managed)
    SymTensor-->>Caller: local_tensor

    Note over Caller,PyTorch: Rendezvous path
    Caller->>SymTensor: setupRemoteHandles(tag)
    SymTensor->>ensurePT: ensurePyTorchSymmMemBackend(backend)
    ensurePT->>Comm: barrier(kNccl) ⚠️ redundant on 2nd+ call
    ensurePT-->>SymTensor: group_name
    SymTensor->>PyTorch: rendezvous(local_tensor, group_name)
    PyTorch-->>SymTensor: torch_symm_handle_
    Note right of SymTensor: Sets are_remote_tensors_setup_=true,<br/>optionally is_multicast_setup_=true

    Caller->>SymTensor: remoteTensor(rank)
    SymTensor->>PyTorch: torch_symm_handle_->get_remote_tensor(rank, ...)
    PyTorch-->>Caller: remote_tensor

_{Reviews (13): Last reviewed commit: "Merge branch 'main' into symmetric-memor..." | Re-trigger Greptile}

greptile-apps

_{10 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-02T23:01:32Z

fbuild.sh

+#!/bin/bash
+
+export CC=clang-20
+export CXX=clang++-20
+export LDFLAGS="-fuse-ld=mold"
+
+export NVFUSER_BUILD_ENABLE_PCH
+
+export UCC_HOME="/opt/hpcx/ucc"
+export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
+export UCX_HOME="/opt/hpcx/ucx"
+export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"
+
+# export TORCH_CUDA_ARCH_LIST="9.0"
+
+export NVFUSER_BUILD_WITH_UCC=1
+export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
+export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY
+
+# Enable debug mode, leave empty for non-debug compilation
+export NVFUSER_BUILD_BUILD_TYPE=Debug
+export RUN_CMAKE=""
+
+pip install -v -e ./python --no-build-isolation


Personal developer build script committed to repository

This script contains machine-specific, hardcoded toolchain paths that are unlikely to work anywhere except the author's development machine:

clang-20 and clang++-20 — not a standard compiler version available broadly

-fuse-ld=mold — requires the mold linker to be installed

/opt/hpcx/ucc and /opt/hpcx/ucx — HPC-X installation path specific to the author's environment

$BUILD_DIRECTORY is used but never validated; if it is unset, NVFUSER_BUILD_INSTALL_DIR and NVFUSER_BUILD_DIR will silently be empty strings, likely breaking the build

This kind of personal convenience script should live outside version control (e.g., in a .gitignore-d directory or in the author's home directory). Committing it to the main repo risks confusing other contributors and cluttering the root directory.

greptile-apps · 2026-03-02T23:01:33Z

csrc/multidevice/symmetric_tensor.cpp

+void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
+  static std::once_flag once;
+  std::call_once(once, [backend]() {
+    const char* name = nullptr;
+    switch (backend) {
+      case SymmetricMemoryBackend::PyTorchNccl:
+        name = "NCCL";
+        break;
+      case SymmetricMemoryBackend::PyTorchNvshmem:
+        name = "NVSHMEM";
+        break;
+      case SymmetricMemoryBackend::PyTorchCuda:
+        name = "CUDA";
+        break;
+      default:
+        NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
+    }
+    c10d::symmetric_memory::set_backend(name);
+    Communicator& comm = Communicator::getInstance();
+    NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
+    c10d::symmetric_memory::set_group_info(
+        kPyTorchSymmMemGroupName,
+        static_cast<int>(comm.deviceId()),
+        static_cast<int>(comm.size()),
+        comm.getStore());
+  });
+}


NCCL backend initialization is incomplete — register_process_group is never called

ensurePyTorchSymmMemBackend calls set_group_info but never calls c10d::register_process_group. According to the comment added to communicator.h for getWorldBackendIntrusivePtr:

Returns the world backend as an intrusive_ptr so it can be registered with c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL rendezvous, which resolves the group by name).

getWorldBackendIntrusivePtr was clearly introduced to supply the backend for this registration, yet the call to c10d::register_process_group is absent from ensurePyTorchSymmMemBackend. PyTorch's NCCL symmetric-memory rendezvous resolves the process group by name at the point it is called; without a prior register_process_group(kPyTorchSymmMemGroupName, ...), the NCCL backend path will fail to locate the group and throw at rendezvous time.

The missing call should be something like:

// After set_group_info, for NCCL backend: c10d::register_process_group( kPyTorchSymmMemGroupName, comm.getWorldBackendIntrusivePtr(CommunicatorBackend::kNccl));

The fact that getWorldBackendIntrusivePtr was added in this exact PR but is never invoked strongly suggests this step was accidentally left out.

greptile-apps · 2026-03-02T23:01:34Z

tests/cpp/test_multidevice_symmetric_tensor.cpp

+// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
+//   if (communicator_->size() == 1) {
+//     GTEST_SKIP() << "Skipping test for single device";
+//   }
+//   SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
+//   if (backend == SymmetricMemoryBackend::Native) {
+//     GTEST_SKIP()
+//         << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
+//   }
+
+//   const int64_t rank = communicator_->deviceId();
+//   const int64_t world_size = communicator_->size();
+
+//   at::Tensor local_tensor = SymmetricTensor::allocate(
+//       {256, 512}, at::ScalarType::Float, communicator_->device());
+//   SymmetricTensor sym_tensor(local_tensor);
+
+//   EXPECT_TRUE(local_tensor.is_cuda());
+//   EXPECT_EQ(local_tensor.numel(), 256 * 512);
+
+//   float local_value = static_cast<float>(rank + 200);
+//   local_tensor.fill_(local_value);
+
+//   sym_tensor.setupRemoteHandles();
+
+//   for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
+//     void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
+//     EXPECT_NE(peer_ptr, nullptr);
+
+//     float peer_value;
+//     NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
+//         &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));
+
+//     float expected_value = static_cast<float>(peer_rank + 200);
+//     EXPECT_FLOAT_EQ(peer_value, expected_value)
+//         << "Rank " << rank << " reading from rank " << peer_rank
+//         << " (PyTorch backend)";
+//   }
+// }


Entire PyTorch backend correctness test is commented out

PyTorchBackend_RemoteAccessCorrectness is the only test that exercises the new PyTorch backend path end-to-end (allocation → rendezvous → remote access). Leaving it commented out means the three new backend variants (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) have zero test coverage in CI.

The comment says it should be run manually with NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl), but that means regressions in the PyTorch path will go undetected in normal CI runs.

If the test can't pass yet (e.g., because the NCCL register_process_group call is missing), that's a strong signal to fix the underlying issue rather than suppress the test. If the test is intentionally deferred, consider converting it into a proper GTEST_SKIP with an explanatory message so the intent is visible to reviewers and CI.

greptile-apps · 2026-03-02T23:01:35Z

csrc/multidevice/symmetric_tensor.cpp

+    std::vector<int64_t> strides(sizes.size());
+    strides.back() = 1;
+    for (int64_t i = (int64_t)strides.size() - 2; i >= 0; --i) {


Undefined behavior when sizes is empty (0-dim tensor)

std::vector<int64_t> strides(sizes.size()); strides.back() = 1; // UB if sizes is empty

std::vector::back() on an empty vector is undefined behaviour. The same guard-free pattern also exists in the native path further down in the same function (~line 225). While allocating a 0-dimensional symmetric tensor is unusual, the PyTorch path that was just added adds a new callsite where callers may pass {} as sizes. A simple check is sufficient:

NVF_CHECK(!sizes.empty(), "Cannot allocate a 0-dim symmetric tensor");

or initialise strides defensively (matching the standard row-major convention for 0-dim tensors, which is an empty strides vector) and skip the loop entirely when sizes is empty.

nsarka · 2026-03-03T21:22:05Z

Sorry! I accidentally hit the button to merge main into the branch. Hopefully it's ok.

greptile-apps · 2026-03-03T21:25:13Z

csrc/multidevice/symmetric_tensor.cpp

+void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
+  static std::once_flag once;
+  std::call_once(once, [backend]() {
+    const char* name = nullptr;
+    switch (backend) {
+      case SymmetricMemoryBackend::PyTorchNccl:
+        name = "NCCL";
+        break;
+      case SymmetricMemoryBackend::PyTorchNvshmem:
+        name = "NVSHMEM";
+        break;
+      case SymmetricMemoryBackend::PyTorchCuda:
+        name = "CUDA";
+        break;
+      default:
+        NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
+    }
+    c10d::symmetric_memory::set_backend(name);
+    Communicator& comm = Communicator::getInstance();
+    NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
+    c10d::symmetric_memory::set_group_info(
+        kPyTorchSymmMemGroupName,
+        static_cast<int>(comm.deviceId()),
+        static_cast<int>(comm.size()),
+        comm.getStore());
+  });
+}


std::call_once exception-safety leaves set_backend in a permanently broken state on retry

std::call_once resets its once_flag if the callable exits via an exception, allowing a subsequent call to retry. However, the callable here calls set_backend(name) before set_group_info(...). If set_backend succeeds but set_group_info subsequently throws (e.g., because the store is unavailable), once_flag is reset and the next allocate() call will attempt set_backend(name) a second time. PyTorch's symmetric memory layer is likely to throw on that second set_backend call (backend already configured), making it impossible to recover without restarting the process.

A straightforward mitigation is to separate the two calls into distinct phases or to wrap set_backend in its own protection:

// Separate once-flags for each idempotent step, or catch and suppress // the "already set" error from set_backend on retry: try { c10d::symmetric_memory::set_backend(name); } catch (const std::exception& e) { // If the backend is already set to the correct name, treat as success. // Re-throw otherwise. } c10d::symmetric_memory::set_group_info( kPyTorchSymmMemGroupName, ...);

Alternatively, split the once_flag so set_backend has its own dedicated guard that truly runs at most once, while set_group_info can retry on failure.

greptile-apps · 2026-03-09T09:32:34Z

csrc/multidevice/symmetric_tensor.cpp

 void* SymmetricTensor::multicastPtr() const {
+#ifdef NVFUSER_DISTRIBUTED
+  if (py_symm_handle_) {
+    return py_symm_handle_->has_multicast_support()
+        ? py_symm_handle_->get_multicast_ptr()
+        : nullptr;
+  }
+#endif


multicastPtr() silently returns nullptr for PyTorch backend when multicast is not supported, which is inconsistent with the native path (which calls NVF_CHECK(is_multicast_setup_, "Multicast not setup")).

Any caller that does not check for nullptr before using the pointer will trigger a null pointer dereference / silent GPU fault rather than a clear diagnostic error.

Consider throwing or at least asserting instead of silently returning nullptr:

Suggested change

void* SymmetricTensor::multicastPtr() const {

#ifdef NVFUSER_DISTRIBUTED

if (py_symm_handle_) {

return py_symm_handle_->has_multicast_support()

? py_symm_handle_->get_multicast_ptr()

: nullptr;

}

#endif

void* SymmetricTensor::multicastPtr() const {

#ifdef NVFUSER_DISTRIBUTED

if (py_symm_handle_) {

NVF_CHECK(

py_symm_handle_->has_multicast_support(),

"Multicast not supported by the selected PyTorch symmetric memory backend.");

return py_symm_handle_->get_multicast_ptr();

}

#endif

NVF_CHECK(is_multicast_setup_, "Multicast not setup");

return mc_ptr_;

}

This brings the error contract in line with the native path, where multicastPtr() always either returns a valid pointer or throws.

greptile-apps · 2026-03-09T09:32:36Z

csrc/multidevice/symmetric_tensor.cpp

+  if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {
+    ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());


getSymmetricMemoryBackend() is invoked twice in back-to-back lines, which redundantly re-parses the option string on each call. A single local variable should be used:

Suggested change

if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {

ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());

SymmetricMemoryBackend backend = getSymmetricMemoryBackend();

if (backend != SymmetricMemoryBackend::Native) {

ensurePyTorchSymmMemBackend(backend);

greptile-apps · 2026-03-09T09:32:36Z

tests/cpp/test_multidevice_symmetric_tensor.cpp

+TEST_F(SymmetricTensorTest, GetSymmetricMemoryBackend_ReturnsValidBackend) {
+  SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
+  EXPECT_TRUE(
+      backend == SymmetricMemoryBackend::Native ||
+      backend == SymmetricMemoryBackend::PyTorchNccl ||
+      backend == SymmetricMemoryBackend::PyTorchNvshmem ||
+      backend == SymmetricMemoryBackend::PyTorchCuda)
+      << "getSymmetricMemoryBackend() returned an invalid backend value";
+}


GetSymmetricMemoryBackend_ReturnsValidBackend test is trivially true. Every branch of getSymmetricMemoryBackend() explicitly returns one of the four enum values listed in the EXPECT_TRUE condition, so there is no code path that could return a fifth value. This test can never fail and provides no meaningful coverage.

If the intent is to document the valid values, a static assertion in ipc_utils.cpp would be more appropriate. If the intent is to test that the env-var parsing correctly maps strings to enum values, the test should set up specific NVFUSER_ENABLE strings and assert the exact expected enum variant (e.g., set pytorch_nccl and assert PyTorchNccl).

samnordmann

Thank you! Some minor comments
Please add test, fix linter, and run the CI with !test command (comment directly on the PR)

Fuser/.github/workflows/lint.yml

Line 83 in 5b210dd

- name: Run lintrunner

samnordmann · 2026-03-10T10:19:06Z

tests/cpp/test_multidevice_symmetric_tensor.cpp

+// Symmetric memory backend and option tests
+// -----------------------------------------------------------------------------
+
+TEST_F(SymmetricTensorTest, GetSymmetricMemoryBackend_ReturnsValidBackend) {


not a useful test

samnordmann · 2026-03-10T10:20:31Z

tests/cpp/test_multidevice_symmetric_tensor.cpp

  }
 }

+// Same remote-access correctness as BasicAllocation but only runs when


This is the only test but it is commented. Either remove it or un-comment it. An idea would be to reuse the pre-existing tests but to parametrize them with the new backends.

samnordmann · 2026-03-10T10:22:27Z

csrc/multidevice/symmetric_tensor.h

+// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.
+// - PyTorch (Nccl, Nvshmem, Cuda): Use PyTorch's symmetric memory
+//   (torch.distributed._symmetric_memory) with the chosen transport backend.
+//   Select via NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl|pytorch_nvshmem|pytorch_cuda).


the selection should also be about native and contain it as an option

samnordmann · 2026-03-10T10:22:35Z

csrc/multidevice/symmetric_tensor.h

-// further fragment the memory. On the other hand, having our own implementation
-// allows us to experiment more advanced features like contigous view creation.
+// Backends (see SymmetricMemoryBackend in ipc_utils.h):
+// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.


Suggested change

// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.

// - Native (default): Fuser's own CUDA VMM + IPC implementation.

samnordmann · 2026-03-10T10:24:04Z

csrc/multidevice/symmetric_tensor.h

+  // When set, remote/multicast APIs delegate to PyTorch symmetric memory.
+  c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_;


py_ prefix wrongly suggests python.
I am not sure to understand the comment

Suggested change

// When set, remote/multicast APIs delegate to PyTorch symmetric memory.

c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_;

c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> symm_handle_;

samnordmann · 2026-03-10T10:47:44Z

csrc/multidevice/symmetric_tensor.cpp

+#ifdef NVFUSER_DISTRIBUTED
+  // PyTorch backend: perform rendezvous here (lazy, on first setupRemoteHandles).
+  if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {
+    ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());


has already been called in the constructor

samnordmann · 2026-03-10T10:48:42Z

csrc/multidevice/symmetric_tensor.cpp

+    NVF_ERROR(
+        false,


Suggested change

NVF_ERROR(

false,

NVF_THROW(

samnordmann · 2026-03-10T11:14:51Z

csrc/multidevice/communicator.h

    return store_.get();
  }

+#ifdef NVFUSER_DISTRIBUTED


why do we need guard here?

samnordmann · 2026-03-10T11:14:58Z

csrc/multidevice/communicator.h


 #ifdef NVFUSER_DISTRIBUTED
 #include <torch/csrc/distributed/c10d/Backend.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>


samnordmann · 2026-03-10T11:15:24Z

csrc/multidevice/communicator.h

+  // Returns the store as an intrusive_ptr for use with PyTorch symmetric
+  // memory (c10d::symmetric_memory::set_group_info).
+  c10::intrusive_ptr<c10d::Store> getStore() const;
+
+  // Returns the world backend as an intrusive_ptr so it can be registered with
+  // c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL
+  // rendezvous, which resolves the group by name).
+  c10::intrusive_ptr<c10d::Backend> getWorldBackendIntrusivePtr(
+      std::optional<CommunicatorBackend> backend = std::nullopt);


rather, change the signature of the existing getter method to return intrusive_ptr instead of raw pointer

greptile-apps · 2026-03-16T12:02:07Z

csrc/multidevice/communicator.cpp

+std::string Communicator::getSymmMemGroupKey(
+  std::optional<CommunicatorBackend> backend) {
+std::vector<RankType> all_ranks(size_);
+std::iota(all_ranks.begin(), all_ranks.end(), 0);
+CommunicatorBackend b = backend.value_or(default_backend_);
+(void)getBackendForTeam(all_ranks, b, "symm_mem_");
+return getTeamKey(all_ranks, b);
+}


getSymmMemGroupKey returns key without "symm_mem_" prefix — mismatch with registered process group

getBackendForTeam(all_ranks, b, "symm_mem_") registers the process group under the key "symm_mem_" + getTeamKey(all_ranks, b) (see the register_process_group call in that function). However, getSymmMemGroupKey then returns just getTeamKey(all_ranks, b) — without the "symm_mem_" prefix.

The returned key is subsequently used in ensurePyTorchSymmMemBackend as the group_name passed to both set_group_info and rendezvous. Newer NCCL builds resolve the process group by name at rendezvous time; they will look for a process group registered as "nccl0,1,..." but only "symm_mem_nccl0,1,..." exists, causing rendezvous to fail.

The current workaround that registers under "0" papers over this for older NCCL, but the mismatch will surface as soon as the TODO comment is resolved and older-NCCL special-casing is removed.

The return statement should return the full team_key including the prefix:

Suggested change

std::string Communicator::getSymmMemGroupKey(

std::optional<CommunicatorBackend> backend) {

std::vector<RankType> all_ranks(size_);

std::iota(all_ranks.begin(), all_ranks.end(), 0);

CommunicatorBackend b = backend.value_or(default_backend_);

(void)getBackendForTeam(all_ranks, b, "symm_mem_");

return getTeamKey(all_ranks, b);

}

std::string Communicator::getSymmMemGroupKey(

std::optional<CommunicatorBackend> backend) {

std::vector<RankType> all_ranks(size_);

std::iota(all_ranks.begin(), all_ranks.end(), 0);

CommunicatorBackend b = backend.value_or(default_backend_);

const std::string prefix = "symm_mem_";

(void)getBackendForTeam(all_ranks, b, prefix);

return prefix + getTeamKey(all_ranks, b);

}

seems relevant, wdyt ?

greptile-apps · 2026-03-16T12:02:08Z

csrc/multidevice/communicator.h

+  c10::intrusive_ptr<c10d::Store> getStore() const {
+    return c10::intrusive_ptr<c10d::Store>(store_);
+  }


getStore() uses non-idiomatic intrusive_ptr construction

c10::intrusive_ptr<c10d::Store>(store_) passes the raw TCPStore* obtained from store_ (via the implicit operator T* of intrusive_ptr) to a new intrusive_ptr<Store>. This calls the unsafe intrusive_ptr<T>(T*, bool) constructor that takes an already-retained raw pointer — but store_ is managed and this path risks a ref-count imbalance.

The idiomatic way is to let the intrusive_ptr copy-conversion handle it:

Suggested change

c10::intrusive_ptr<c10d::Store> getStore() const {

return c10::intrusive_ptr<c10d::Store>(store_);

}

c10::intrusive_ptr<c10d::Store> getStore() const {

return store_;

}

greptile-apps · 2026-03-17T15:35:52Z

csrc/multidevice/symmetric_tensor.cpp

+  if(is_multicast_setup_==false) {
+    SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
+    if (backend != SymmetricMemoryBackend::Native) {
+      const std::string group_name = ensurePyTorchSymmMemBackend(backend);
+      torch_symm_handle_ = c10d::symmetric_memory::rendezvous(
+          local_tensor_, group_name);
+      are_remote_tensors_setup_ = true;
+      if (torch_symm_handle_->has_multicast_support()) {
+        is_multicast_setup_ = true;
+        mc_ptr_ = torch_symm_handle_->get_multicast_ptr();
+      }
+      return;
+    }
+  }


if(is_multicast_setup_==false) guard is dead code for PyTorch backend

is_multicast_setup_ is never set to true before setupRemoteHandles is called on the PyTorch path: setupMulticast returns unconditionally at line ~615 when torch_symm_handle_ is set, so is_multicast_setup_ remains false. The outer guard is therefore always true and provides no real protection.

The effect is that the rendezvous code is unreachable if any caller were to set is_multicast_setup_ = true first (e.g., through a future code path). The intent—"skip rendezvous if multicast is already fully set up"—is actually achieved by the are_remote_tensors_setup_ early-return at the top of the function, not by this inner guard.

Consider removing this redundant outer condition to make the control flow clearer:

#ifdef NVFUSER_DISTRIBUTED // PyTorch backend: perform rendezvous here (lazy, on first setupRemoteHandles). SymmetricMemoryBackend backend = getSymmetricMemoryBackend(); if (backend != SymmetricMemoryBackend::Native) { const std::string group_name = ensurePyTorchSymmMemBackend(backend); torch_symm_handle_ = c10d::symmetric_memory::rendezvous( local_tensor_, group_name); are_remote_tensors_setup_ = true; if (torch_symm_handle_->has_multicast_support()) { is_multicast_setup_ = true; mc_ptr_ = torch_symm_handle_->get_multicast_ptr(); } return; } #endif

greptile-apps · 2026-03-20T00:17:01Z

csrc/multidevice/symmetric_tensor.cpp

+    NVF_THROW(
+        false,
+        "Contiguous view is not yet supported for PyTorch symmetric memory backend. "
+        "Use native backend for SymmetricContiguousView.");
+  }


NVF_THROW with false as first argument produces a garbled error message

NVF_THROW(...) is an unconditional throw whose variadic arguments are all concatenated into the error message via to_str(__VA_ARGS__). Passing false as the first argument does not act as a condition — it is serialised as part of the message (e.g. "0Contiguous view is not yet...") by to_str. This makes the resulting error message confusing and hard to read in diagnostics.

The same pattern is used again in getContiguousView (line 607–611).

Use NVF_THROW with only the message string, or use the established NVF_ERROR(false, "msg") pattern that is already used elsewhere in this file (e.g. line 74):

Suggested change

NVF_THROW(

false,

"Contiguous view is not yet supported for PyTorch symmetric memory backend. "

"Use native backend for SymmetricContiguousView.");

}

NVF_THROW(

"Contiguous view is not yet supported for PyTorch symmetric memory backend. "

"Use native backend for SymmetricContiguousView.");

relevant comment

greptile-apps · 2026-03-20T00:17:02Z

csrc/multidevice/symmetric_tensor.cpp

+        case SymmetricMemoryBackend::PyTorchCuda:
+          name = "CUDA";
+          break;


set_backend is never called for the PyTorchCuda backend

For PyTorchNccl and PyTorchNvshmem, c10d::symmetric_memory::set_backend(name) is called inside the call_once lambda. For PyTorchCuda, name is assigned "CUDA" but set_backend is never invoked. If PyTorch's symmetric-memory layer requires an explicit set_backend call before allocating with a CUDA transport, every empty_strided_p2p call on the CUDA path will either use whatever backend was previously configured (potentially NCCL or NVSHMEM) or fail silently at rendezvous time.

If PyTorchCuda truly requires no set_backend call (e.g., because "CUDA" is the implicit default), please add a comment explaining this so future maintainers don't perceive it as an oversight. Otherwise, add the missing call:

case SymmetricMemoryBackend::PyTorchCuda: name = "CUDA"; c10d::symmetric_memory::set_backend(name); break;

greptile-apps · 2026-03-20T20:35:35Z

tests/cpp/test_multidevice_symmetric_tensor.cpp

  if (communicator_->size() == 1) {
    GTEST_SKIP() << "Skipping test for single device";
  }
+  std::cout << "Vishal chishta" << std::endl;


Debug print statement must be removed

std::cout << "Vishal chishta" << std::endl; is an accidental debug line that will pollute test output for all CI runs of SmallAllocation. This should be removed before merging.

Suggested change

std::cout << "Vishal chishta" << std::endl;

greptile-apps · 2026-03-20T20:35:36Z

csrc/multidevice/symmetric_tensor.cpp

+    if (backend != SymmetricMemoryBackend::Native) {
+      NVF_CHECK(
+          comm.isBackendAvailable(CommunicatorBackend::kNccl),
+          "NCCL backend is required for symmetric_memory_backend(nccl)");


NCCL availability check incorrectly required for all PyTorch backends

isBackendAvailable(CommunicatorBackend::kNccl) is checked unconditionally for every non-Native backend — including PyTorchNvshmem and PyTorchCuda. If those backends don't actually require an NCCL process group (e.g., NVSHMEM uses its own transport), this check will spuriously reject them on systems where NCCL is unavailable.

Additionally, the error message hardcodes "(nccl)" even when the active backend is NVSHMEM or CUDA, which will confuse users:

"NCCL backend is required for symmetric_memory_backend(nccl)" // fired even when NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nvshmem)

Consider guarding the NCCL check only for PyTorchNccl, and adjusting the error message dynamically:

if (backend == SymmetricMemoryBackend::PyTorchNccl) { NVF_CHECK( comm.isBackendAvailable(CommunicatorBackend::kNccl), "NCCL backend is required for symmetric_memory_backend(pytorch_nccl)"); }

greptile-apps · 2026-03-20T20:35:37Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag once;
+    std::call_once(once, [backend]() {
+      const char* name = nullptr;
+      switch (backend) {
+        case SymmetricMemoryBackend::PyTorchNccl:
+          name = "NCCL";
+          c10d::symmetric_memory::set_backend(name);
+          break;
+        case SymmetricMemoryBackend::PyTorchNvshmem:
+          name = "NVSHMEM";
+          c10d::symmetric_memory::set_backend(name);
+          break;
+        case SymmetricMemoryBackend::PyTorchCuda:
+          name = "CUDA";
+          break;
+        default:
+          NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
+      }
+    });


Static once_flag binds to whichever backend is passed first — silently ignores later backends

once is a static std::once_flag, so set_backend(name) is called exactly once for the lifetime of the process. If the flag fires on the first call (e.g., PyTorchCuda), a later call with PyTorchNccl won't call set_backend("NCCL") at all — the wrong (or absent) backend will silently remain active.

In practice a single process shouldn't mix backends, but the current structure provides no error if it does. The typical guard is to also capture the name into a static and assert consistency on subsequent calls:

static std::string configured_name; std::call_once(once, [backend, &configured_name]() { // ... set backend and populate configured_name }); NVF_CHECK( configured_name == expected_name, "symmetric memory backend already configured as '", configured_name, "', cannot reconfigure to '", expected_name, "'");

Or, at minimum, document that mixing backends within a process is undefined behaviour.

greptile-apps · 2026-03-20T20:35:38Z

csrc/multidevice/communicator.cpp

+std::string Communicator::getSymmMemGroupKey(
+  std::optional<CommunicatorBackend> backend) {
+std::vector<RankType> all_ranks(size_);
+std::iota(all_ranks.begin(), all_ranks.end(), 0);
+CommunicatorBackend b = backend.value_or(default_backend_);
+(void)getBackendForTeam(all_ranks, b);
+return getTeamKey(all_ranks, b);
+}


getSymmMemGroupKey body is incorrectly indented and missing trailing newline

The function body uses column-0 indentation, inconsistent with every other member function in this file. All statements should be indented at the standard 2-space level. Additionally, the file is missing a trailing newline (shown by \ No newline at end of file in the diff).

Suggested change

std::string Communicator::getSymmMemGroupKey(

std::optional<CommunicatorBackend> backend) {

std::vector<RankType> all_ranks(size_);

std::iota(all_ranks.begin(), all_ranks.end(), 0);

CommunicatorBackend b = backend.value_or(default_backend_);

(void)getBackendForTeam(all_ranks, b);

return getTeamKey(all_ranks, b);

}

std::string Communicator::getSymmMemGroupKey(

std::optional<CommunicatorBackend> backend) {

std::vector<RankType> all_ranks(size_);

std::iota(all_ranks.begin(), all_ranks.end(), 0);

CommunicatorBackend b = backend.value_or(default_backend_);

(void)getBackendForTeam(all_ranks, b);

return getTeamKey(all_ranks, b);

}

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

samnordmann

LGTM overall!
Please cleanup, fix the CI and all the minor issues

samnordmann · 2026-03-23T15:17:32Z

1

remove this file

samnordmann · 2026-03-23T15:19:47Z

csrc/multidevice/communicator.h

 #include <c10/util/intrusive_ptr.h>

+#if defined(NVFUSER_DISTRIBUTED) && \
+    __has_include(<torch/csrc/distributed/c10d/GroupRegistry.hpp>) && \


what is the rationale behind defining NVFUSER_CAN_REGISTER_C10D_PROCESS_GROUP? In what scenario can the header be missing?

samnordmann · 2026-03-23T15:20:15Z

csrc/multidevice/communicator.h

 #ifdef NVFUSER_DISTRIBUTED
 #include <torch/csrc/distributed/c10d/Backend.hpp>
+#if NVFUSER_CAN_REGISTER_C10D_PROCESS_GROUP
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>


this header should always be present, no?

samnordmann · 2026-03-23T15:22:39Z

csrc/multidevice/symmetric_tensor.cpp

+          comm.isBackendAvailable(CommunicatorBackend::kNccl),
+          "NCCL backend is required for symmetric_memory_backend(nccl)");
+
+      const std::string group_name = comm.getSymmMemGroupKey(CommunicatorBackend::kNccl);


Imo it would better/simpler to simply here define the string, compared to having a new method in Communicator. Wdyt?

samnordmann · 2026-03-23T15:25:26Z

csrc/multidevice/symmetric_tensor.cpp

+    if (backend != SymmetricMemoryBackend::Native) {
+      NVF_CHECK(
+          comm.isBackendAvailable(CommunicatorBackend::kNccl),
+          "NCCL backend is required for symmetric_memory_backend(nccl)");


Suggested change

"NCCL backend is required for symmetric_memory_backend(nccl)");

"NCCL backend is required for non-native symmetric memory backend: , backend");

samnordmann · 2026-03-23T15:28:58Z

csrc/multidevice/symmetric_tensor.cpp

+    NVF_THROW(
+        false,
+        "Contiguous view is not yet supported for PyTorch symmetric memory backend.");


Suggested change

NVF_THROW(

false,

"Contiguous view is not yet supported for PyTorch symmetric memory backend.");

NVF_THROW(

"Contiguous view is not yet supported for PyTorch symmetric memory backend.");

samnordmann · 2026-03-24T10:50:27Z

Also, please write a PR description

greptile-apps · 2026-03-24T10:54:53Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const c10::Error&) {
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }
+    });


"0" alias registered but never unregistered on cleanup

c10d::register_process_group("0", pg) is called inside a static std::once_flag lambda that lives in ensurePyTorchSymmMemBackend. The "0" key is never added to process_groups_ in Communicator, so Communicator::cleanup() will not unregister it:

for (const auto& entry : process_groups_) { c10d::unregister_process_group(entry.first); // only unregisters team_key, never "0" }

In test environments that tear down and re-create a Communicator, the stale "0" registration persists across test cases. On the next call to ensurePyTorchSymmMemBackend, pg0_once is permanently fired, so c10d::resolve_process_group("0") succeeds with the old, destroyed process group — and symm-mem rendezvous will silently use it.

The fix is to track the "0" alias and unregister it during cleanup(), or unconditionally overwrite the "0" registration rather than checking first.

greptile-apps · 2026-03-24T11:31:46Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const std::exception&) {
+        // resolve_process_group throws c10::Error
+        // (derives from std::exception)
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }


resolve_process_group("0") may silently use an unrelated process group

The try/catch idiom checks whether "0" is already registered; if resolve_process_group("0") succeeds, the code assumes the existing registration is the one it created and uses it for rendezvous. However, "0" is a key that Python's torch.distributed commonly registers for the default process group (e.g., after init_process_group). If Python code is running in the same process and has already registered a process group under "0", the catch block is never entered, and symmetric-memory rendezvous will silently proceed against the wrong process group, likely causing a hang or incorrect buffer mapping.

Since the PR already tracks process groups in process_groups_ and unregisters them in cleanup(), a safer pattern would be to always overwrite the "0" alias unconditionally, or to store the mapping and check it explicitly:

// Always overwrite to ensure we use the correct group: auto pg = c10d::resolve_process_group(group_name); c10d::unregister_process_group("0"); // no-op if not registered c10d::register_process_group("0", pg);

This eliminates the ambiguity and avoids the fragile try/catch.

greptile-apps · 2026-03-25T08:18:15Z

csrc/multidevice/symmetric_tensor.cpp

+  for (size_t i = strides.size() - 2; i >= 0; --i) {
    strides[i] = strides[i + 1] * sizes[i + 1];
  }


Infinite loop / UB: size_t loop variable wraps past zero

The loop variable was changed from int64_t to size_t, but the termination condition i >= 0 is always true for an unsigned type. After the last legitimate iteration (when i == 0), --i wraps to SIZE_MAX, and the condition SIZE_MAX >= 0 holds unconditionally — causing the loop to read strides[SIZE_MAX], an immediate out-of-bounds access and undefined behavior.

For 1D tensors the damage is worse: strides.size() - 2 (as size_t) wraps to SIZE_MAX before the first iteration, so the very first access is already OOB.

This regresses the native path for every tensor with more than one dimension. The PyTorch-backend path directly above (line ~162) correctly casts to int64_t before the subtraction — the same fix is needed here.

Suggested change

for (size_t i = strides.size() - 2; i >= 0; --i) {

strides[i] = strides[i + 1] * sizes[i + 1];

}

for (int64_t i = static_cast<int64_t>(strides.size()) - 2; i >= 0; --i) {

strides[i] = strides[i + 1] * sizes[i + 1];

}

greptile-apps · 2026-03-25T08:18:16Z

csrc/multidevice/communicator.h

+  // c10d process-group wrappers registered for symmetric-memory rendezvous.
+  std::unordered_map<std::string, c10::intrusive_ptr<c10d::ProcessGroup>>
+      process_groups_;


Non-distributed build compilation failure: c10d::ProcessGroup not in mock

c10d::ProcessGroup is declared as the value type of process_groups_ without a #ifdef NVFUSER_DISTRIBUTED guard. In a non-distributed build, communicator.h falls through to c10d_mock.h, which defines c10d::Backend, c10d::TCPStore, and the options structs — but not c10d::ProcessGroup. The class declaration therefore fails to compile.

The same issue propagates into communicator.cpp:

cleanup() calls c10d::unregister_process_group(...) (not in mock) outside any #ifdef guard.

getBackendForTeam() uses c10d::ProcessGroup::BackendType, c10d::ProcessGroup, and c10d::register_process_group after the #endif that closes the distributed block.

The process_groups_ field and all code that touches it needs to be wrapped in #ifdef NVFUSER_DISTRIBUTED / #endif, or a minimal ProcessGroup stub must be added to c10d_mock.h.

// communicator.h — wrap the new field: #ifdef NVFUSER_DISTRIBUTED std::unordered_map<std::string, c10::intrusive_ptr<c10d::ProcessGroup>> process_groups_; #endif

And similarly guard the cleanup() loop and the new block in getBackendForTeam() in communicator.cpp.

samnordmann · 2026-03-25T09:36:49Z

!test

greptile-apps · 2026-03-25T16:50:50Z

csrc/multidevice/communicator.cpp

+  for (const auto& entry : process_groups_) {
+    c10d::unregister_process_group(entry.first);
+  }
+  process_groups_.clear();


process_groups_ cleanup guard mismatch — compile error when NVFUSER_DISTRIBUTED is set without USE_DISTRIBUTED

process_groups_ is declared in communicator.h under #if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED), but the cleanup loop here lives under the broader #if defined(NVFUSER_DISTRIBUTED) (without the USE_DISTRIBUTED guard). When a build defines NVFUSER_DISTRIBUTED but not USE_DISTRIBUTED, process_groups_ does not exist as a member, yet this code tries to iterate over it — a hard compile error.

c10d::unregister_process_group (from GroupRegistry.hpp) is already included under #ifdef NVFUSER_DISTRIBUTED, so fixing just the guard on these lines is sufficient:

Suggested change

for (const auto& entry : process_groups_) {

c10d::unregister_process_group(entry.first);

}

process_groups_.clear();

#if defined(USE_DISTRIBUTED)

for (const auto& entry : process_groups_) {

c10d::unregister_process_group(entry.first);

}

process_groups_.clear();

#endif

(The surrounding #if defined(NVFUSER_DISTRIBUTED) / #endif already provides the outer distributed guard.)

greptile-apps · 2026-03-25T17:29:13Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const std::exception&) {
+        // resolve_process_group throws c10d Error
+        // (derives from std::exception)
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }
+    });
+
+    comm.barrier(CommunicatorBackend::kNccl);
+    return group_name;


Barrier fires on every call, not just during initial setup

comm.barrier(CommunicatorBackend::kNccl) is placed outside the pg0_once lambda, so it executes on every invocation of ensurePyTorchSymmMemBackend — including every subsequent call to allocate() and setupRemoteHandles() after initialization has already completed. This is both a performance concern (unnecessary NCCL barrier per allocation/rendezvous) and a correctness risk: if any caller ever invokes allocate() or setupRemoteHandles() asymmetrically across ranks (different call counts), these extra barriers will hang.

The barrier is only needed once — after pg0_once — to ensure all ranks have completed the group alias registration before any rank proceeds to use it. Moving it inside the pg0_once lambda would fix both concerns:

static std::once_flag pg0_once; std::call_once(pg0_once, [&]() { try { (void)c10d::resolve_process_group("0"); } catch (const std::exception&) { auto pg = c10d::resolve_process_group(group_name); c10d::register_process_group("0", pg); } comm.barrier(CommunicatorBackend::kNccl); // moved inside once-flag });

greptile-apps · 2026-03-25T17:29:14Z

csrc/multidevice/communicator.cpp

+#if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED)
+    std::optional<c10d::ProcessGroup::BackendType> pg_backend =
+        (b == CommunicatorBackend::kNccl)
+        ? std::optional<c10d::ProcessGroup::BackendType>(
+              c10d::ProcessGroup::BackendType::NCCL)
+        : std::nullopt;
+    if (backends_[team_key] != nullptr && pg_backend.has_value()) {
+      auto rank_it = std::ranges::find(team.begin(), team.end(), deviceId());
+      RankType team_rank = std::distance(team.begin(), rank_it);
+
+      auto pg = c10::make_intrusive<c10d::ProcessGroup>(
+          c10::make_intrusive<c10d::PrefixStore>(team_key, store_),
+          team_rank,
+          static_cast<int>(team.size()));
+      pg->setBackend(c10::DeviceType::CUDA, *pg_backend, backends_[team_key]);
+      pg->setDefaultBackend(*pg_backend);
+      pg->setGroupName(team_key);
+
+      c10d::register_process_group(team_key, pg);
+      process_groups_[team_key] = std::move(pg);
+    }


ProcessGroup wrapper only created on first backend creation — silently missing after getWorld()

The ProcessGroup wrapper is registered inside the backends_.find(team_key) == backends_.end() guard, meaning it is only created the first time getBackendForTeam is called for a given team_key. If getWorld() (or any early comm.barrier()) is called before the first PyTorch symmetric memory operation, the NCCL backend gets created and cached with no ProcessGroup wrapper. When getSymmMemGroupKey subsequently calls getBackendForTeam, it finds team_key already in backends_ and returns early — no ProcessGroup is created and c10d::resolve_process_group(group_name) inside ensurePyTorchSymmMemBackend will throw.

The ProcessGroup registration should not be gated solely on first-time backend creation. Consider also checking process_groups_.count(team_key) == 0 so the wrapper is created even when the backend already exists:

if (backends_[team_key] != nullptr && pg_backend.has_value() && process_groups_.count(team_key) == 0) { // ... create and register ProcessGroup }

wujingyue · 2026-03-25T17:32:14Z

csrc/multidevice/symmetric_tensor.cpp

@@ -144,7 +246,10 @@ at::Tensor SymmetricTensor::allocate(
 }

 std::string SymmetricTensor::validate(at::Tensor tensor) {


Suggested change

std::string SymmetricTensor::validate(at::Tensor tensor) {

/*static*/ std::string SymmetricTensor::validate(at::Tensor tensor) {

Related: I understood that this function returns an error message, but why don't we simply NVF_CHECK/NVF_ERROR when validation fails? This way, errors are reported close to where they occur. The current uses of validate all seem to be followed by NVF_CHECK, so I've yet to see a benefit for delaying an error.

the idea is that validate is part of the API, along with the allocator that returns a mere at::Tensor.
IOW, if the user has a at::Tensor coming from the framework, it can validate it before feeding it to the SymmetricTensor constructor.

Does it make sense?

csrc/multidevice/symmetric_tensor.h

wujingyue · 2026-03-25T17:42:56Z

csrc/multidevice/symmetric_tensor.h

  bool is_contiguous_view_setup_ = false;
  at::Tensor contiguous_view_;
+#if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED)
+  c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory>


Do you have a bare minimum example of using c10d::symmetric_memory::SymmetricMemory and at::Tensor without any nvFuser? I ask this because I feel this class has lots of fields that are irrelevant for c10d SymmetricMemory, but I could be terribly wrong.

Here are some tests using c10d::symmetric_memory. https://github.com/pytorch/pytorch/blob/main/test/distributed/test_symmetric_memory.py
Does this help?
Yes, except for one or two fields like local_tensor_ and mc_ptr_ no other fields are used by c10d::symmetric_memory.

saivishal1999 · 2026-03-25T17:50:40Z

!test

saivishal1999 added 3 commits February 27, 2026 04:40

Initial implementation of symmetric memory backend for PyTorch

14fd212

Initital changes to add pytorch symmetric memory backend

5646c03

Initial pytorch symmetric memory backend changes

14816aa

saivishal1999 requested a review from samnordmann March 2, 2026 22:55

greptile-apps bot reviewed Mar 2, 2026

View reviewed changes

Merge branch 'main' into symmetric-memory-pytorch-backends

6996d05

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

Initial review comments

49d669c

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

samnordmann reviewed Mar 10, 2026

View reviewed changes

saivishal1999 added 2 commits March 16, 2026 13:55

Alloc, rendezvous passing

8962475

Merge branch 'main' into symmetric-memory-pytorch-backends

62c6945

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

multicast pending

67181c8

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

all backends passing

eea57d8

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

saivishal1999 added 2 commits March 20, 2026 22:24

delete build file

a9ddffd

Merge branch 'main' into symmetric-memory-pytorch-backends

f9cac71

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

samnordmann reviewed Mar 23, 2026

View reviewed changes

Lint errors and review comments

8e62ccc

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

fix 3 lint errors

1be0134

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

Fix clang-tidy errors

3596301

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

Fixing outdated lint errors

9b05915

Add torch distributed gaurd

6147139

saivishal1999 requested a review from wujingyue March 25, 2026 16:47

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

wujingyue requested a review from Priya2698 March 25, 2026 17:16

Merge branch 'main' into symmetric-memory-pytorch-backends

b5a2418

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

wujingyue reviewed Mar 25, 2026

View reviewed changes

		if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {
		ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());

	// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.
	// - Native (default): Fuser's own CUDA VMM + IPC implementation.

		// When set, remote/multicast APIs delegate to PyTorch symmetric memory.
		c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_;

	"NCCL backend is required for symmetric_memory_backend(nccl)");
	"NCCL backend is required for non-native symmetric memory backend: , backend");

		@@ -144,7 +246,10 @@ at::Tensor SymmetricTensor::allocate(
		}

		std::string SymmetricTensor::validate(at::Tensor tensor) {

	std::string SymmetricTensor::validate(at::Tensor tensor) {
	/static/ std::string SymmetricTensor::validate(at::Tensor tensor) {

Conversation

saivishal1999 commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

nsarka commented Mar 3, 2026

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

github-actions bot commented Mar 2, 2026 •

edited

Loading

greptile-apps bot commented Mar 2, 2026 •

edited

Loading