Skip to content

Conversation

@SherrinZhou
Copy link

@SherrinZhou SherrinZhou commented Dec 9, 2025

When running perftest with RDMA CM enabled under an environment where the server is under high pressure and likely to reject a CM connection issued by the client, if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in rdma_cm_client_connection.

However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption.

The error print looked like this:
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: 8.
ERRNO:Operation not supported.
Failed to handle RDMA CM event.
ERRNO: Operation not supported.
Failed to connect RDMA CM events.
ERRNO:Operation not supported.
Failed to resolve RDMA CM address.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM ID number 0.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM contexts.
ERRNO: Bad file descriptor.
free(): double free detected in tcache 2

The backtrace of the segfault triggered core dump looked like this:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fcb90783db5 in __GI_abort () at abort.c:79
#2 0x00007fcb907dc4e7 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fcb908ebaae "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fcb907e35ec in malloc_printerr (str=str@entry=0x7fcb908ed6d8 "free(): double free detected in tcache 2") at malloc.c:5374
#4 0x00007fcb907e535d in _int_free (av=0x7fcb90b21bc0 <main_arena>, p=0xf459c0, have_lock=) at malloc.c:4213
#5 0x000000000040dd86 in create_rdma_cm_connection (ctx=0x7ffcf6f18410, user_param=0x7ffcf6f17fe0, comm=0x7ffcf6f17fc0, my_dest=0xf41590, rem_dest=0xf41a00) at src/perftest_communication.c:2949
#6 0x0000000000404b83 in main (argc=26, argv=0x7ffcf6f186f8) at src/send_bw.c:273

The following specific issues were identified and fixed:

  1. Double Free of Event Channel: The rdma_destroy_event_channel() was called inside rdma_cm_destroy_cma() during internal cleanup, but the pointer was not cleared. The caller function create_rdma_cm_connection() would then attempt to destroy the same channel again in its error path.
    Fix: Set the channel pointer to NULL after destruction and check for NULL before attempting to destroy it. The event channel and cm nodes will be reallocated when entering another retry attempt.

  2. Heap Corruption via Index Overflow: The ctx->cma_master.connection_index was incremented on every connection attempt but was never reset upon failure. During retries, this index would exceed the bounds of the nodes array, leading to out-of-bound writes and heap metadata corruption. Similar things would happen for other fields of cma_master.
    Fix: Complete reset for fields of cma_master in rdma_cm_destroy_cma().

  3. Context Corruption and Leaks: rdma_cm_route_handler() unconditionally called ctx_init() and create_qp_main() on
    every retry attempt. This overwrote existing pointers (PD, MR, Buffers) in the context structure without releasing the old resources, causing memory leaks and "Bad file descriptor" errors during final cleanup. Recreating old qp would cause qp creation error that leads to a retry failure.
    Fix: Add a check in rdma_cm_route_handler() to ensure ctx_init() and create_qp_main are only called if the context has not been initialized yet.

This fix is tested in the same multi-node environment for 10+ hrs and no segfaults are observed again.

When running perftest with RDMA CM enabled (-R), if the connection
request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry
loop in `rdma_cm_client_connection`.
However, the previous retry logic contained multiple flaws causing
segmentation faults, double frees, and heap corruption.

The following specific issues were identified and fixed:

1. Double Free of Event Channel:
   The `rdma_destroy_event_channel()` was called inside `rdma_cm_destroy_cma()`
during internal cleanup, but the pointer was not cleared. The caller function
`create_rdma_cm_connection()` would then attempt to destroy the same channel
again in its error path.
   Fix: Set the channel pointer to NULL after destruction and check for
NULL before attempting to destroy it. The event channel and cm nodes
will be reallocated when entering another retry attempt.

2. Heap Corruption via Index Overflow:
   The `ctx->cma_master.connection_index` was incremented on every
connection attempt but was never reset upon failure. During retries,
this index would exceed the bounds of the `nodes` array, leading to
out-of-bound writes and heap metadata corruption. Similar things would
happen for other fields of cma_master.
   Fix: Complete reset for fields of cma_master in
`rdma_cm_destroy_cma()`.

3. Context Corruption and Leaks:
   `rdma_cm_route_handler()` unconditionally called `ctx_init()` and
`create_qp_main()` on
every retry attempt. This overwrote existing pointers (PD, MR, Buffers)
in the context structure without releasing the old resources, causing
memory leaks and "Bad file descriptor" errors during final cleanup.
Recreating old qp would cause qp creation error causing retry to fail.
   Fix: Add a check in `rdma_cm_route_handler()` to ensure `ctx_init()`
and `create_qp_main` are only called if the context has not been
initialized yet.

4. Similar issues happened on server side retry in
`rdma_cm_connection_request_handler`. Apply same fix as in
`rdma_cm_route_handler()`.

Signed-off-by: Ruizhe Zhou <[email protected]>
@SherrinZhou SherrinZhou force-pushed the fix/cm_retry_resource_leak branch from e9714a2 to 4b90310 Compare December 10, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant