Skip to content

Commit 047cf33

Browse files
committed
rdma: fix uninitialized value on fi_cq_readerr call
In commit 35264d2, we changed the error processing code so that `fi_cq_err_entry` was left uninitialized (instead of zero-initialized) on call to `fi_cq_readerr`. However, per the Libfabric spec, the `err_data_size` field of `fi_cq_err_entry` needs to be set to zero (or a valid buffer provided): > If err_data_size is 0 on input, or the fabric was opened with release > < 1.5, then any buffer referenced by err_data will be ignored on input. ref: https://ofiwg.github.io/libfabric/v2.1.0/man/fi_cq.3.html Fix by zeroing out the whole struct, as we did before. Add explanatory comments to this line in both protocols. Signed-off-by: Eric Raut <[email protected]>
1 parent ccc7528 commit 047cf33

File tree

2 files changed

+11
-1
lines changed

2 files changed

+11
-1
lines changed

src/nccl_ofi_rdma.cpp

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2020,7 +2020,12 @@ static int ofi_process_cq_rail(nccl_net_ofi_rdma_ep_t *ep, nccl_net_ofi_ep_rail_
20202020
if (OFI_UNLIKELY(ret != 0))
20212021
goto exit;
20222022
} else if (OFI_UNLIKELY(rc == -FI_EAVAIL)) {
2023-
struct fi_cq_err_entry err_entry;
2023+
/*
2024+
* On call to fi_cq_readerr, Libfabric requires some members of
2025+
* err_entry to be zero-initialized or point to valid data. For
2026+
* simplicity, just zero out the whole struct.
2027+
*/
2028+
struct fi_cq_err_entry err_entry = { };
20242029

20252030
ret = fi_cq_readerr(rail->cq, &err_entry, 0);
20262031
if (OFI_UNLIKELY(ret == -FI_EAGAIN)) {

src/nccl_ofi_sendrecv.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,11 @@ static int sendrecv_cq_process(struct fid_cq *cq)
295295
{
296296
ssize_t rc = 0;
297297
int ret = 0;
298+
/*
299+
* On call to fi_cq_readerr, Libfabric requires some members of
300+
* err_entry to be zero-initialized or point to valid data. For
301+
* simplicity, just zero out the whole struct.
302+
*/
298303
struct fi_cq_err_entry err_buffer = {};
299304
struct fi_cq_tagged_entry cqe_tagged_buffers[cq_read_count];
300305

0 commit comments

Comments
 (0)