Conversation
|
Review updated until commit 7283aa8 Description
|
| Relevant files | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 5 files
| ||||||||||
| Configuration changes | |||||||||||
| Documentation | 1 files
| ||||||||||
| Tests | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Spin loop performance concern
This can cause high CPU usage and may not be optimal for latency-sensitive workloads. Consider if this should yield to other threads or use a more efficient waiting mechanism. |
samnordmann
left a comment
There was a problem hiding this comment.
Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.
| #endif | ||
| } | ||
|
|
||
| void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) { |
There was a problem hiding this comment.
This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?
There was a problem hiding this comment.
Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big
|
@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package |
https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029 |
|
Note the build error in the CI |
Greptile SummaryThis PR introduces a new NIXL backend for GPU-to-GPU RDMA transfers in nvFuser's multi-device framework. It adds a Key items from this review iteration:
Several previously raised concerns have been addressed in the current revision: the Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant NixlBackend
participant NixlImpl as NixlBackend::Impl
participant nixlAgent
participant TCPStore
User->>NixlBackend: getInstance()
NixlBackend->>NixlImpl: Impl::create(communicator)
NixlImpl->>nixlAgent: new nixlAgent(agent_name, cfg)
NixlImpl->>nixlAgent: createBackend("UCX", ...)
NixlImpl->>nixlAgent: registerMem(probe) [VRAM probe]
NixlImpl->>nixlAgent: prepXferDlist(probe) [verify VRAM usable]
NixlImpl->>nixlAgent: deregisterMem(probe)
NixlImpl-->>NixlBackend: impl_ (or nullptr if probe failed)
User->>NixlBackend: registerTensors({t1, t2})
NixlBackend->>NixlImpl: registerTensors(tensors)
NixlImpl->>nixlAgent: registerMem(dlist)
NixlImpl->>NixlImpl: exchangeMetadata() [collective]
NixlImpl->>nixlAgent: getLocalMD(local_md)
NixlImpl->>TCPStore: set("nixl_agent_md_rank_N", local_md)
loop for each peer rank
NixlImpl->>TCPStore: get("nixl_agent_md_rank_P") [blocks until ready]
NixlImpl->>nixlAgent: loadRemoteMD(remote_md)
end
NixlImpl->>NixlImpl: communicator.barrier()
NixlImpl->>TCPStore: deleteKey("nixl_agent_md_rank_N")
User->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
NixlBackend->>NixlImpl: prepareTransfer(...)
NixlImpl->>nixlAgent: createXferReq(op, local_dlist, remote_dlist, agent_name)
NixlImpl-->>User: NixlTransferHandle
User->>NixlBackend: postTransfer(handle)
NixlBackend->>NixlImpl: postTransfer(handle)
NixlImpl->>nixlAgent: postXferReq(xfer_handle)
User->>NixlBackend: waitTransfer(handle)
NixlBackend->>NixlImpl: waitTransfer(handle)
loop until done
NixlImpl->>nixlAgent: getXferStatus(xfer_handle)
NixlImpl->>NixlImpl: this_thread::yield() [if in progress]
end
User->>NixlBackend: deregisterTensors({t1, t2})
NixlBackend->>NixlImpl: deregisterTensors(tensors)
NixlImpl->>nixlAgent: deregisterMem(dlist)
NixlImpl->>NixlImpl: exchangeMetadata() [collective]
|
mdavis36
left a comment
There was a problem hiding this comment.
Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)
|
!test |
@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do? |
|
!test |
|
!test |
|
!build |
|
!test |
|
!test |
|
!test |
|
!test |
|
!test |
| message(STATUS " NIXL_FOUND : ${NIXL_FOUND}") | ||
| if(NIXL_FOUND) | ||
| message(STATUS " NIXL_INCLUDE_DIR: ${NIXL_INCLUDE_DIR}") | ||
| message(STATUS " NIXL_LIBRARY : ${NIXL_LIBRARY}") | ||
| endif() |
There was a problem hiding this comment.
Please report this in cmake/deps/handle_nixl.cmake
| struct TensorDesc { | ||
| uintptr_t addr; | ||
| size_t size; | ||
| uint32_t dev; // CUDA device index (tensor.device().index()) |
There was a problem hiding this comment.
| uint32_t dev; // CUDA device index (tensor.device().index()) | |
| uint32_t local_rank; // CUDA device index (tensor.device().index()) |
| // Helper functions for serializing and deserializing tensors descriptors for | ||
| // TCP store | ||
| struct TensorDesc { | ||
| uintptr_t addr; |
There was a problem hiding this comment.
| uintptr_t addr; | |
| void* addr; |
unless we do a lot of pointer arithmetics on this. I haven't seen that just yet in this PR.
| // TCP store | ||
| struct TensorDesc { | ||
| uintptr_t addr; | ||
| size_t size; |
There was a problem hiding this comment.
| size_t size; | |
| int64_t size; |
https://google.github.io/styleguide/cppguide.html#Integer_Types => "On Unsigned Integers"
| const std::vector<TensorDesc>& local_descs, | ||
| const std::vector<TensorDesc>& remote_descs, |
There was a problem hiding this comment.
Can you code-comment what these arguments mean?
No description provided.