Skip to content

Commit f62d1d7

Browse files
authored
libfabric: Use desc-specific target offset (#883)
This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead. Also impacted: Block-based transfers (Iteration N would read blocks from iteration 0, etc), Partial buffer updates, etc. Signed-off-by: Tushar Gohad <[email protected]>
1 parent bb0b873 commit f62d1d7

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

src/plugins/libfabric/libfabric_backend.cpp

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1035,7 +1035,8 @@ nixlLibfabricEngine::postXfer(const nixl_xfer_op_t &operation,
10351035
int gpu_id = local[desc_idx].devId;
10361036

10371037
NIXL_DEBUG << "Processing descriptor " << desc_idx << " GPU " << gpu_id
1038-
<< " addr: " << transfer_addr << " size: " << transfer_size;
1038+
<< " local_addr: " << transfer_addr << " size: " << transfer_size
1039+
<< " remote_addr: " << (void *)remote[desc_idx].addr;
10391040

10401041
NIXL_DEBUG << "DEBUG: remote_agent='" << remote_agent << "' localAgent='" << localAgent
10411042
<< "'";
@@ -1071,11 +1072,14 @@ nixlLibfabricEngine::postXfer(const nixl_xfer_op_t &operation,
10711072
}
10721073

10731074
// Prepare and submit transfer for remote agents
1075+
// Use descriptor's specific target address
1076+
uint64_t remote_target_addr = remote[desc_idx].addr;
1077+
10741078
nixl_status_t status = rail_manager.prepareAndSubmitTransfer(
10751079
op_type,
10761080
transfer_addr,
10771081
transfer_size,
1078-
remote_md->remote_buf_addr_,
1082+
remote_target_addr,
10791083
local_md->selected_rails_,
10801084
local_md->rail_mr_list_,
10811085
remote_md->rail_remote_key_list_,

0 commit comments

Comments
 (0)