Skip to content

Add NIXL backend #6016

Open
x41lakazam wants to merge 43 commits intomainfrom
dispatch_combine/nixl_backend
Open

Add NIXL backend #6016
x41lakazam wants to merge 43 commits intomainfrom
dispatch_combine/nixl_backend

Conversation

@x41lakazam
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 26, 2026

Review updated until commit 7283aa8

Description

  • Add NIXL backend for GPU-to-GPU RDMA transfers in multi-device communication

  • Implement tensor registration, metadata exchange, and transfer preparation/post/wait APIs

  • Add NIXL build option (NVFUSER_BUILD_WITH_NIXL) to CMake and Python build system

  • Include comprehensive tests for transfer handles, validation, and end-to-end transfers

Changes walkthrough

Relevant files
Enhancement
5 files
nixl.h
Define NixlBackend and NixlTransferHandle classes               
+221/-0 
nixl.cpp
Implement NIXL backend with UCX for GPU transfers               
+474/-0 
multidevice.h
Add kNixl to CommunicatorBackend enum                                       
+1/-1     
communicator.h
Add nixl_available_ flag and backend check                             
+7/-1     
communicator.cpp
Initialize nixl_available_ and add NIXL case to output     
+9/-1     
Configuration changes
2 files
CMakeLists.txt
Add NVFUSER_STANDALONE_BUILD_WITH_NIXL option and configuration
+39/-0   
utils.py
Add build_with_nixl config and cmake flag                               
+4/-0     
Documentation
1 files
setup.py
Document NVFUSER_BUILD_WITH_NIXL build option                       
+3/-0     
Tests
1 files
test_multidevice_nixl.cpp
Add tests for NIXL backend functionality                                 
+289/-0 

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Spin loop performance concern

The waitTransfer() function at line 385-399 uses a busy-wait spin loop to poll transfer status.
This can cause high CPU usage and may not be optimal for latency-sensitive workloads.
Consider if this should yield to other threads or use a more efficient waiting mechanism.

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
  NVF_ERROR(handle.isValid(), "Cannot wait on an invalid handle");
  NVF_ERROR(handle.impl_->posted, "Transfer has not been posted yet");

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
}
Metadata exchange scalability

The exchangeMetadata() function performs O(world_size) iterations to fetch metadata from all peers
and uses a barrier. This may not scale well to large distributed configurations.
Consider if there's a more efficient approach for metadata exchange.

void NixlBackend::Impl::exchangeMetadata() {
  nixl_blob_t local_md;
  nixl_status_t md_status = agent_->getLocalMD(local_md);
  NVF_ERROR(
      md_status == NIXL_SUCCESS,
      "NIXL getLocalMD failed with status ",
      static_cast<int>(md_status));

  auto* store = communicator_.getTcpStore();
  const auto my_rank = communicator_.deviceId();
  const auto world_size = communicator_.size();

  std::string md_key_prefix = "nixl_agent_md_rank_";
  store->set(
      md_key_prefix + std::to_string(my_rank),
      std::vector<uint8_t>(local_md.begin(), local_md.end()));

  for (int64_t rank = 0; rank < world_size; ++rank) {
    if (rank == my_rank) {
      continue;
    }
    // Fetch & load MD
    auto bytes = store->get(md_key_prefix + std::to_string(rank));
    nixl_blob_t remote_md(bytes.begin(), bytes.end());
    std::string remote_agent_name;
    nixl_status_t status = agent_->loadRemoteMD(remote_md, remote_agent_name);
    NVF_ERROR(
        status == NIXL_SUCCESS,
        "NIXL loadRemoteMD failed for rank ",
        rank,
        " with status ",
        static_cast<int>(status));
  }

  // Barrier before deleting keys so no rank reads a deleted key.
  communicator_.barrier();

  store->deleteKey(md_key_prefix + std::to_string(my_rank));
  metadata_exchanged_ = true;
}
Probe mechanism validation

The UCX CUDA support probe (lines 168-210) is a good defensive addition. However, verify that
the probe correctly handles all edge cases where UCX might claim VRAM support but actually
misclassify memory. Consider adding logging when the probe fails.

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
// the UCX backend. Some UCX installations lack CUDA support, causing
// registerMem to silently misclassify VRAM as host memory. We detect
// this by registering a small buffer and asking NIXL to prepare a
// local descriptor list for VRAM -- if no backend claims VRAM, the
// probe fails and we mark the backend as unavailable.
{
  constexpr int64_t kProbeBytes = 1;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixl_status_t reg_status = impl->agent_->registerMem(reg_dlist);
  if (reg_status != NIXL_SUCCESS) {
    return nullptr;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      impl->agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);

  if (dlist_handle) {
    impl->agent_->releasedDlistH(dlist_handle);
  }
  impl->agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return nullptr;
  }
}

Copy link
Copy Markdown
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

#endif
}

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big

@samnordmann
Copy link
Copy Markdown
Collaborator

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

@samnordmann
Copy link
Copy Markdown
Collaborator

samnordmann commented Feb 26, 2026

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

@samnordmann
Copy link
Copy Markdown
Collaborator

Note the build error in the CI

  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:144:17: error: private field 'communicator_' is not used [-Werror,-Wunused-private-field]
    144 |   Communicator& communicator_;
        |                 ^
  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:146:8: error: private field 'metadata_exchanged_' is not used [-Werror,-Wunused-private-field]
    146 |   bool metadata_exchanged_ = false;
        |        ^

@x41lakazam x41lakazam marked this pull request as ready for review March 2, 2026 16:23
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR introduces a new NIXL backend for GPU-to-GPU RDMA transfers in nvFuser's multi-device framework. It adds a NixlBackend singleton wrapping a UCX-backed nixlAgent, with tensor registration, metadata exchange via TCPStore, and one-sided read/write transfer primitives (prepareTransfer / postTransfer / waitTransfer). The build system integration is opt-in (NVFUSER_BUILD_WITH_NIXL=OFF by default) and nixl.cpp is unconditionally compiled with a full no-op stub for non-NIXL builds. A CI install script (tools/install-nixl.sh) handles both pip-wheel and from-source UCX+NIXL installation modes.

Key items from this review iteration:

  • P1 logic bug: exchangeMetadata() calls communicator_.getTcpStore() and immediately dereferences the result without a null check. Because Impl::create() only probes hardware (not the distributed store), isAvailable() can return true even when the communicator has no TCPStore — leading to a null pointer dereference on the first registerTensors() call in a non-distributed environment. The helper utilities in nixl.h (storeTensorDescs, fetchTensorDescs) correctly guard against this with NVF_CHECK(communicator.is_available(), ...), but exchangeMetadata() does not.
  • P2 (cmake): CUDA version detection in handle_nixl.cmake accesses the private nixl._pkg internal attribute, which is fragile across packaging tools and NIXL versions; failures are silently swallowed.
  • P2 (tooling): install-nixl.sh uses an unquoted $(find ...) in a for loop (word-splitting hazard), and NIXL is cloned from HEAD without a pinned commit SHA.

Several previously raised concerns have been addressed in the current revision: the #else stub now has all required method stubs (fixing the compile-without-NIXL path), <thread> is included, and TensorDesc now correctly separates dev (CUDA device index) from rank (communicator rank).

Confidence Score: 3/5

  • PR is close to merge-ready but has one unfixed P1 bug (null TCPStore dereference in exchangeMetadata) that can crash in non-distributed single-process usage when NIXL hardware is present.
  • The core transfer logic is sound and several earlier P0/P1 issues have been resolved. The remaining P1 (missing null guard in exchangeMetadata) is a one-line fix. The P2s are tooling/CI concerns that don't affect production correctness. Score reflects one concrete remaining bug that could cause a runtime crash in a plausible (if edge-case) scenario.
  • csrc/multidevice/nixl.cpp (exchangeMetadata null-TCPStore path) and cmake/deps/handle_nixl.cmake / tools/install-nixl.sh (CI reliability).

Important Files Changed

Filename Overview
csrc/multidevice/nixl.cpp Core NIXL backend implementation: UCX-backed RDMA transfers for GPU tensors. Several previously-flagged issues addressed (stub compile errors, <thread> include, dev vs rank fields). New concern: exchangeMetadata() dereferences TCPStore pointer without a null guard, while the public helper utilities in nixl.h do guard against unavailable communicators.
csrc/multidevice/nixl.h Public NIXL API header: TensorDesc now correctly carries both dev (CUDA device index) and rank (communicator rank) fields; serialization helpers guard against unavailable communicator. Clean design.
cmake/deps/handle_nixl.cmake New CMake handler for NIXL discovery and CUDA version compatibility check. The CUDA version detection relies on the private nixl._pkg Python attribute, which is fragile and may silently produce incorrect results across different NIXL installations or packaging methods.
tools/install-nixl.sh CI install script supporting pip and source build modes for NIXL + UCX. Two concerns: find output is used in an unquoted for loop (word-splitting hazard), and NIXL is cloned without a pinned commit SHA making builds non-reproducible.
tests/cpp/test_multidevice_nixl.cpp Comprehensive test coverage: singleton access, input validation guards, end-to-end read/write ring transfers, and register/deregister round-trips. Tests correctly skip when NIXL is unavailable and use GTEST_SKIP appropriately.
csrc/multidevice/communicator.h Adds kNixl to the CommunicatorBackend enum (now strongly typed as uint8_t) and nixl_available_ member. isBackendAvailable extended cleanly. getTcpStore() public accessor needed by NIXL metadata exchange.
csrc/multidevice/communicator.cpp Wires in nixl_available_ initialization via compile-time USE_NIXL flag, adds kNixl case to the stream operator, and adopts std::ranges::sort / std::ranges::find modernization. Minor cosmetic improvements (std::endl'\n').
CMakeLists.txt Adds NVFUSER_BUILD_WITH_NIXL option, wires in the new CMake handler, unconditionally appends nixl.cpp to sources (safe because the file has a complete #else stub), and registers the test binary.
python/tools/prereqs/requirements/nixl.py Python dependency reporter for NIXL with CUDA constraint validation display. Correctly handles match/mismatch/not_available states and provides actionable install guidance.

Sequence Diagram

sequenceDiagram
    participant User
    participant NixlBackend
    participant NixlImpl as NixlBackend::Impl
    participant nixlAgent
    participant TCPStore

    User->>NixlBackend: getInstance()
    NixlBackend->>NixlImpl: Impl::create(communicator)
    NixlImpl->>nixlAgent: new nixlAgent(agent_name, cfg)
    NixlImpl->>nixlAgent: createBackend("UCX", ...)
    NixlImpl->>nixlAgent: registerMem(probe) [VRAM probe]
    NixlImpl->>nixlAgent: prepXferDlist(probe) [verify VRAM usable]
    NixlImpl->>nixlAgent: deregisterMem(probe)
    NixlImpl-->>NixlBackend: impl_ (or nullptr if probe failed)

    User->>NixlBackend: registerTensors({t1, t2})
    NixlBackend->>NixlImpl: registerTensors(tensors)
    NixlImpl->>nixlAgent: registerMem(dlist)
    NixlImpl->>NixlImpl: exchangeMetadata() [collective]
    NixlImpl->>nixlAgent: getLocalMD(local_md)
    NixlImpl->>TCPStore: set("nixl_agent_md_rank_N", local_md)
    loop for each peer rank
        NixlImpl->>TCPStore: get("nixl_agent_md_rank_P") [blocks until ready]
        NixlImpl->>nixlAgent: loadRemoteMD(remote_md)
    end
    NixlImpl->>NixlImpl: communicator.barrier()
    NixlImpl->>TCPStore: deleteKey("nixl_agent_md_rank_N")

    User->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
    NixlBackend->>NixlImpl: prepareTransfer(...)
    NixlImpl->>nixlAgent: createXferReq(op, local_dlist, remote_dlist, agent_name)
    NixlImpl-->>User: NixlTransferHandle

    User->>NixlBackend: postTransfer(handle)
    NixlBackend->>NixlImpl: postTransfer(handle)
    NixlImpl->>nixlAgent: postXferReq(xfer_handle)

    User->>NixlBackend: waitTransfer(handle)
    NixlBackend->>NixlImpl: waitTransfer(handle)
    loop until done
        NixlImpl->>nixlAgent: getXferStatus(xfer_handle)
        NixlImpl->>NixlImpl: this_thread::yield() [if in progress]
    end

    User->>NixlBackend: deregisterTensors({t1, t2})
    NixlBackend->>NixlImpl: deregisterTensors(tensors)
    NixlImpl->>nixlAgent: deregisterMem(dlist)
    NixlImpl->>NixlImpl: exchangeMetadata() [collective]
Loading

Comments Outside Diff (1)

  1. csrc/multidevice/nixl.cpp, line 255-262 (link)

    exchangeMetadata crashes with null TCPStore when communicator is unavailable

    communicator_.getTcpStore() returns store_.get(), which is nullptr when the Communicator was not initialized with a distributed environment (i.e., is_available() is false). There is no null guard before calling store->set(...).

    This can be triggered in practice: Impl::create() runs a hardware probe that only requires a CUDA device — it does not check communicator.is_available(). So on a machine with NIXL hardware but no distributed environment variables set, Impl::create() succeeds, isAvailable() returns true, and a call to registerTensors() will crash here with a null pointer dereference.

    The inline helpers in nixl.h (storeTensorDescs, fetchTensorDescs) both guard against this with NVF_CHECK(communicator.is_available(), ...), but exchangeMetadata does not. Consider adding the same guard at the start of exchangeMetadata:

    void NixlBackend::Impl::exchangeMetadata() {
      NVF_ERROR(
          communicator_.getTcpStore() != nullptr,
          "TCPStore is not initialized – communicator is not in distributed mode");
      // ...

    Or alternatively, Impl::create() could check communicator.is_available() and return nullptr early if the communicator has no store, so isAvailable() itself reflects whether the full NIXL pipeline (including metadata exchange) can function.

Reviews (18): Last reviewed commit: "Merge branch 'main' into dispatch_combin..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Collaborator

@mdavis36 mdavis36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do?
Thanks!

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!build

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

Copy link
Copy Markdown
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@samnordmann samnordmann requested a review from wujingyue March 23, 2026 13:55
Comment on lines +1308 to +1312
message(STATUS " NIXL_FOUND : ${NIXL_FOUND}")
if(NIXL_FOUND)
message(STATUS " NIXL_INCLUDE_DIR: ${NIXL_INCLUDE_DIR}")
message(STATUS " NIXL_LIBRARY : ${NIXL_LIBRARY}")
endif()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please report this in cmake/deps/handle_nixl.cmake

Copy link
Copy Markdown
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the new backend! I'll do another round after my initial comments are addressed. I'll defer the build system changes to @mdavis36

struct TensorDesc {
uintptr_t addr;
size_t size;
uint32_t dev; // CUDA device index (tensor.device().index())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t dev; // CUDA device index (tensor.device().index())
uint32_t local_rank; // CUDA device index (tensor.device().index())

// Helper functions for serializing and deserializing tensors descriptors for
// TCP store
struct TensorDesc {
uintptr_t addr;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uintptr_t addr;
void* addr;

unless we do a lot of pointer arithmetics on this. I haven't seen that just yet in this PR.

// TCP store
struct TensorDesc {
uintptr_t addr;
size_t size;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t size;
int64_t size;

https://google.github.io/styleguide/cppguide.html#Integer_Types => "On Unsigned Integers"

Comment on lines +199 to +200
const std::vector<TensorDesc>& local_descs,
const std::vector<TensorDesc>& remote_descs,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you code-comment what these arguments mean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants