perftest: Fix double frees during RDMA CM connection retry #368

SherrinZhou · 2025-12-09T03:31:06Z

When running perftest with RDMA CM enabled under an environment where the server is under high pressure and likely to reject a CM connection issued by the client, if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in rdma_cm_client_connection.

However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption.

The error print looked like this:
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: 8.
ERRNO:Operation not supported.
Failed to handle RDMA CM event.
ERRNO: Operation not supported.
Failed to connect RDMA CM events.
ERRNO:Operation not supported.
Failed to resolve RDMA CM address.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM ID number 0.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM contexts.
ERRNO: Bad file descriptor.
free(): double free detected in tcache 2

The backtrace of the segfault triggered core dump looked like this:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fcb90783db5 in __GI_abort () at abort.c:79
#2 0x00007fcb907dc4e7 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fcb908ebaae "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fcb907e35ec in malloc_printerr (str=str@entry=0x7fcb908ed6d8 "free(): double free detected in tcache 2") at malloc.c:5374
#4 0x00007fcb907e535d in _int_free (av=0x7fcb90b21bc0 <main_arena>, p=0xf459c0, have_lock=) at malloc.c:4213
#5 0x000000000040dd86 in create_rdma_cm_connection (ctx=0x7ffcf6f18410, user_param=0x7ffcf6f17fe0, comm=0x7ffcf6f17fc0, my_dest=0xf41590, rem_dest=0xf41a00) at src/perftest_communication.c:2949
#6 0x0000000000404b83 in main (argc=26, argv=0x7ffcf6f186f8) at src/send_bw.c:273

The following specific issues were identified and fixed:

Double Free of Event Channel: The rdma_destroy_event_channel() was called inside rdma_cm_destroy_cma() during internal cleanup, but the pointer was not cleared. The caller function create_rdma_cm_connection() would then attempt to destroy the same channel again in its error path.
Fix: Set the channel pointer to NULL after destruction and check for NULL before attempting to destroy it. The event channel and cm nodes will be reallocated when entering another retry attempt.
Heap Corruption via Index Overflow: The ctx->cma_master.connection_index was incremented on every connection attempt but was never reset upon failure. During retries, this index would exceed the bounds of the nodes array, leading to out-of-bound writes and heap metadata corruption. Similar things would happen for other fields of cma_master.
Fix: Complete reset for fields of cma_master in rdma_cm_destroy_cma().
Context Corruption and Leaks: rdma_cm_route_handler() unconditionally called ctx_init() and create_qp_main() on
every retry attempt. This overwrote existing pointers (PD, MR, Buffers) in the context structure without releasing the old resources, causing memory leaks and "Bad file descriptor" errors during final cleanup. Recreating old qp would cause qp creation error that leads to a retry failure.
Fix: Add a check in rdma_cm_route_handler() to ensure ctx_init() and create_qp_main are only called if the context has not been initialized yet.

This fix is tested in the same multi-node environment for 10+ hrs and no segfaults are observed again.

When running perftest with RDMA CM enabled (-R), if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in `rdma_cm_client_connection`. However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption. The following specific issues were identified and fixed: 1. Double Free of Event Channel: The `rdma_destroy_event_channel()` was called inside `rdma_cm_destroy_cma()` during internal cleanup, but the pointer was not cleared. The caller function `create_rdma_cm_connection()` would then attempt to destroy the same channel again in its error path. Fix: Set the channel pointer to NULL after destruction and check for NULL before attempting to destroy it. The event channel and cm nodes will be reallocated when entering another retry attempt. 2. Heap Corruption via Index Overflow: The `ctx->cma_master.connection_index` was incremented on every connection attempt but was never reset upon failure. During retries, this index would exceed the bounds of the `nodes` array, leading to out-of-bound writes and heap metadata corruption. Similar things would happen for other fields of cma_master. Fix: Complete reset for fields of cma_master in `rdma_cm_destroy_cma()`. 3. Context Corruption and Leaks: `rdma_cm_route_handler()` unconditionally called `ctx_init()` and `create_qp_main()` on every retry attempt. This overwrote existing pointers (PD, MR, Buffers) in the context structure without releasing the old resources, causing memory leaks and "Bad file descriptor" errors during final cleanup. Recreating old qp would cause qp creation error causing retry to fail. Fix: Add a check in `rdma_cm_route_handler()` to ensure `ctx_init()` and `create_qp_main` are only called if the context has not been initialized yet. 4. Similar issues happened on server side retry in `rdma_cm_connection_request_handler`. Apply same fix as in `rdma_cm_route_handler()`. Signed-off-by: Ruizhe Zhou <[email protected]>

SherrinZhou force-pushed the fix/cm_retry_resource_leak branch from e9714a2 to 4b90310 Compare December 10, 2025 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perftest: Fix double frees during RDMA CM connection retry #368

perftest: Fix double frees during RDMA CM connection retry #368

Uh oh!

SherrinZhou commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perftest: Fix double frees during RDMA CM connection retry #368

Are you sure you want to change the base?

perftest: Fix double frees during RDMA CM connection retry #368

Uh oh!

Conversation

SherrinZhou commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SherrinZhou commented Dec 9, 2025 •

edited

Loading