UCT/CUDA_IPC: Support VMM with multiple memory allocations by tomerg-nvidia · Pull Request #11283 · openucx/ucx

tomerg-nvidia · 2026-03-23T07:10:36Z

What?

Add support for CUDA VMM allocations that span multiple underlying physical allocations (cuMemCreate chunks) in the cuda_ipc transport.

Why?

A user-visible VA range may be backed by several independently created and mapped VMM chunks. Currently, cuda_ipc only exports a single fabric handle per region, which covers just the first chunk. This means transfers involving multi-chunk VMM regions silently access only the first chunk's memory, producing incorrect results for offsets beyond the first chunk boundary.

How?

A new handle type UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM_MULTI is introduced. During mkey_pack, all VMM chunks in the VA range are discovered and their fabric handles are written into a GPU-resident metadata buffer. The metadata buffer's own fabric handle, chunk count, and allocation size are packed into the rkey (reusing the buffer_id union, so the wire format size is unchanged).

During rkey_unpack, the receiver imports the metadata buffer, reads the chunk descriptors, and creates a persistent contiguous local VA mapping by importing and mapping each chunk at the correct offset. This mapping lives on the unpacked rkey and is used directly for address translation on every put/get, bypassing the per-operation cache lookup. Cleanup happens at rkey_release.

Handle CUDA VMM allocations spanning multiple cuMemCreate chunks by discovering all chunks, exporting their fabric handles into a GPU metadata buffer, and sharing that buffer's fabric handle via the rkey. On the receiver, the metadata is fetched and a persistent contiguous VA mapping is created by importing each chunk individually. Address translation for put/get uses this mapping directly.

tvegas1 · 2026-03-26T10:21:29Z

test/mpi/test_cuda_common.c

 }
+
+typedef struct {
+    CUmemGenericAllocationHandle *handles;


use CUmemGenericAllocationHandle handles[]; at the end of the struct?

tvegas1 · 2026-03-26T10:24:18Z

src/uct/cuda/cuda_ipc/cuda_ipc_md.h

    size_t                    b_len;  /* Allocation size */
    ucs_list_link_t           link;
+#if HAVE_CUDA_FABRIC
+    CUdeviceptr vmm_meta_dev_ptr; /* GPU metadata buffer VA */


align members

tvegas1 · 2026-03-26T10:37:23Z

src/uct/cuda/cuda_ipc/cuda_ipc_md.h

    uct_cuda_ipc_extended_rkey_t super;
    int                          stream_id;
+#if HAVE_CUDA_FABRIC
+    uct_cuda_ipc_vmm_chunk_desc_t *chunks;


align members

tvegas1 · 2026-03-26T10:57:35Z

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

                                                               uct_cuda_ipc_event_desc_t);
    ucs_status_t status;

+    if (cuda_ipc_event->mapped_addr == NULL) {


Can we integrate VMM_MULTI in the cuda_ipc cache, releasing some refcnt here? This could enable LRU as implemented in #11245, to avoid VA exhaustion on some platforms?

tvegas1 · 2026-03-26T11:03:32Z