Skip to content

Commit a50eedb

Browse files
committed
sendrecv: fix random system crashes during multi-threaded comm abort scenario
The `err_buffer` needs to be zero initialized before passed to libfabric `fi_cq_readerr`. Earlir when the buffer was allocated outside the while() loop, we missed resetting it to zero before invoking `fi_cq_readerr` everytime inside the while loop. and this was causing random memory corruptions. The fix is either (1) allocate the `err_buffer` outside and zero init every time before calling `fi_cq_readerr` or (2) move the allocation+zero init to inside the while loop. This commit implements the option(2): moved the `err_buffer` allocation+zero init to inside the while loop. Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
1 parent 7a252ef commit a50eedb

File tree

1 file changed

+6
-7
lines changed

1 file changed

+6
-7
lines changed

src/nccl_ofi_sendrecv.cpp

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -277,12 +277,6 @@ static int sendrecv_cq_process(struct fid_cq *cq)
277277
{
278278
ssize_t rc = 0;
279279
int ret = 0;
280-
/*
281-
* On call to fi_cq_readerr, Libfabric requires some members of
282-
* err_entry to be zero-initialized or point to valid data. For
283-
* simplicity, just zero out the whole struct.
284-
*/
285-
struct fi_cq_err_entry err_buffer = {};
286280
struct fi_cq_tagged_entry cqe_tagged_buffers[cq_read_count];
287281

288282
while (true) {
@@ -296,7 +290,12 @@ static int sendrecv_cq_process(struct fid_cq *cq)
296290
}
297291
else if (OFI_UNLIKELY(rc == -FI_EAVAIL)) {
298292
nccl_net_ofi_context_t *ctx;
299-
293+
/*
294+
* On call to fi_cq_readerr, Libfabric requires some members of
295+
* err_entry to be zero-initialized or point to valid data. For
296+
* simplicity, just zero out the whole struct.
297+
*/
298+
struct fi_cq_err_entry err_buffer = {};
300299
rc = fi_cq_readerr(cq, &err_buffer, 0);
301300

302301
if (OFI_UNLIKELY(rc == -FI_EAGAIN)) {

0 commit comments

Comments
 (0)