Remove extra copy by assuming reduce tensor is symmetric#6040
Remove extra copy by assuming reduce tensor is symmetric#6040nsarka wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR removes an unnecessary device-to-device copy in the CUDA backend's Reduce collective by requiring the input tensor to already reside in symmetric (VMM-backed) memory, which enables the Key changes:
Confidence Score: 4/5Not safe to merge as-is: the root rank will crash at the Wait phase due to a missing cache-key fix in the wait handler. The core optimization in cuda_p2p.cpp and ipc_handle.cpp is correct and clean. The evaluator post path is also correctly updated. However, the wait handler in evaluator.cpp was not updated to mirror the post path's Reduce-specific cache-key logic, causing the root rank to look up a different cache entry during Wait and triggering a runtime NVF_CHECK failure. Score 4 reflects one clear P1 defect that must be resolved before merging. csrc/host_ir/evaluator.cpp — the Wait handler (around line 527) needs the same Important Files Changed
Sequence DiagramsequenceDiagram
participant E as HostIrEvaluator
participant C as HandleCache
participant R as SymMemForReduce
participant K as CUDA Reduce Kernel
note over E: Post phase (handle Communication)
E->>C: get(input_tensor, comm, root)
C->>R: construct with input_tensor (symmetric)
R-->>C: handle created
C-->>E: reduce_handle ptr
note over K: Before this PR: cudaMemcpyAsync input to sym_buf (removed)
E->>K: postReduceWithCudaBackend(output, handle, stream, root)
K->>K: All ranks signal kInProgress
K->>K: Root waits for all non-roots
K->>K: launchMulticastReduceKernel via mc_ptr to output
K->>K: Root signals kIdle to non-roots
note over E: Wait phase (handle Wait) - BUG on root rank
E->>C: get(output_tensor, comm, root) - mismatched key on root
C->>R: construct with output_tensor (NOT symmetric)
R-->>C: SymmetricTensor validate FAILS - crash
|
|
!test |
| NVF_CHECK( | ||
| input.scalar_type() == at::kFloat && | ||
| reduce_handle->inputBuffer().scalar_type() == at::kFloat && | ||
| (!output.defined() || output.scalar_type() == at::kFloat), |
There was a problem hiding this comment.
For better error reporting, can you split this into two NVF_CHECKs? The first one may use NVF_CHECK_EQ.
| // For Broadcast and Reduce, non-roots may have no output; use input for | ||
| // cache key | ||
| at::Tensor cache_buffer = | ||
| output_tensor.defined() ? output_tensor : input_tensor; |
There was a problem hiding this comment.
I'm trying to figure out the contract here since the code is getting a bit too if-elsy.
- When can input_tensor/output_tensor be undefined? When and only when the rank is not in the team?
- For the kCuda backend, broadcast's output has to be symmetric and reduce's input has to be symmetric?
(I'm trying to confirm my understanding I before proposing any cleanups -- thanks!)
b57ce84 to
e8d7f80
Compare
For the cuda backend. I will update the PR shortly to do the same for allreduce.