perftest: Fix double frees during RDMA CM connection retry #368
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When running perftest with RDMA CM enabled under an environment where the server is under high pressure and likely to reject a CM connection issued by the client, if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in
rdma_cm_client_connection.However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption.
The error print looked like this:
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: 8.
ERRNO:Operation not supported.
Failed to handle RDMA CM event.
ERRNO: Operation not supported.
Failed to connect RDMA CM events.
ERRNO:Operation not supported.
Failed to resolve RDMA CM address.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM ID number 0.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM contexts.
ERRNO: Bad file descriptor.
free(): double free detected in tcache 2
The backtrace of the segfault triggered core dump looked like this:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fcb90783db5 in __GI_abort () at abort.c:79
#2 0x00007fcb907dc4e7 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fcb908ebaae "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fcb907e35ec in malloc_printerr (str=str@entry=0x7fcb908ed6d8 "free(): double free detected in tcache 2") at malloc.c:5374
#4 0x00007fcb907e535d in _int_free (av=0x7fcb90b21bc0 <main_arena>, p=0xf459c0, have_lock=) at malloc.c:4213
#5 0x000000000040dd86 in create_rdma_cm_connection (ctx=0x7ffcf6f18410, user_param=0x7ffcf6f17fe0, comm=0x7ffcf6f17fc0, my_dest=0xf41590, rem_dest=0xf41a00) at src/perftest_communication.c:2949
#6 0x0000000000404b83 in main (argc=26, argv=0x7ffcf6f186f8) at src/send_bw.c:273
The following specific issues were identified and fixed:
Double Free of Event Channel: The
rdma_destroy_event_channel()was called insiderdma_cm_destroy_cma()during internal cleanup, but the pointer was not cleared. The caller functioncreate_rdma_cm_connection()would then attempt to destroy the same channel again in its error path.Fix: Set the channel pointer to NULL after destruction and check for NULL before attempting to destroy it. The event channel and cm nodes will be reallocated when entering another retry attempt.
Heap Corruption via Index Overflow: The
ctx->cma_master.connection_indexwas incremented on every connection attempt but was never reset upon failure. During retries, this index would exceed the bounds of thenodesarray, leading to out-of-bound writes and heap metadata corruption. Similar things would happen for other fields of cma_master.Fix: Complete reset for fields of cma_master in
rdma_cm_destroy_cma().Context Corruption and Leaks:
rdma_cm_route_handler()unconditionally calledctx_init()andcreate_qp_main()onevery retry attempt. This overwrote existing pointers (PD, MR, Buffers) in the context structure without releasing the old resources, causing memory leaks and "Bad file descriptor" errors during final cleanup. Recreating old qp would cause qp creation error that leads to a retry failure.
Fix: Add a check in
rdma_cm_route_handler()to ensurectx_init()andcreate_qp_mainare only called if the context has not been initialized yet.This fix is tested in the same multi-node environment for 10+ hrs and no segfaults are observed again.