Skip to content

UCT/CUDA_IPC: Support VMM with multiple memory allocations#11283

Open
tomerg-nvidia wants to merge 1 commit intoopenucx:masterfrom
tomerg-nvidia:multichunk-vmm-in-unpack
Open

UCT/CUDA_IPC: Support VMM with multiple memory allocations#11283
tomerg-nvidia wants to merge 1 commit intoopenucx:masterfrom
tomerg-nvidia:multichunk-vmm-in-unpack

Conversation

@tomerg-nvidia
Copy link
Copy Markdown
Contributor

@tomerg-nvidia tomerg-nvidia commented Mar 23, 2026

What?

Add support for CUDA VMM allocations that span multiple underlying physical allocations (cuMemCreate chunks) in the cuda_ipc transport.

Why?

A user-visible VA range may be backed by several independently created and mapped VMM chunks. Currently, cuda_ipc only exports a single fabric handle per region, which covers just the first chunk. This means transfers involving multi-chunk VMM regions silently access only the first chunk's memory, producing incorrect results for offsets beyond the first chunk boundary.

How?

A new handle type UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM_MULTI is introduced. During mkey_pack, all VMM chunks in the VA range are discovered and their fabric handles are written into a GPU-resident metadata buffer. The metadata buffer's own fabric handle, chunk count, and allocation size are packed into the rkey (reusing the buffer_id union, so the wire format size is unchanged).

During rkey_unpack, the receiver imports the metadata buffer, reads the chunk descriptors, and creates a persistent contiguous local VA mapping by importing and mapping each chunk at the correct offset. This mapping lives on the unpacked rkey and is used directly for address translation on every put/get, bypassing the per-operation cache lookup. Cleanup happens at rkey_release.

@tomerg-nvidia tomerg-nvidia force-pushed the multichunk-vmm-in-unpack branch from 6dfc62d to 4920050 Compare March 23, 2026 07:57
Handle CUDA VMM allocations spanning multiple cuMemCreate chunks by
discovering all chunks, exporting their fabric handles into a GPU
metadata buffer, and sharing that buffer's fabric handle via the rkey.

On the receiver, the metadata is fetched and a persistent contiguous
VA mapping is created by importing each chunk individually. Address
translation for put/get uses this mapping directly.
@tomerg-nvidia tomerg-nvidia force-pushed the multichunk-vmm-in-unpack branch from 4920050 to 250e476 Compare March 23, 2026 09:48
@tomerg-nvidia tomerg-nvidia marked this pull request as ready for review March 24, 2026 07:21
@tomerg-nvidia tomerg-nvidia requested a review from brminich March 24, 2026 07:21
@brminich brminich requested review from rakhmets and tvegas1 March 25, 2026 15:16
}

typedef struct {
CUmemGenericAllocationHandle *handles;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use CUmemGenericAllocationHandle handles[]; at the end of the struct?

size_t b_len; /* Allocation size */
ucs_list_link_t link;
#if HAVE_CUDA_FABRIC
CUdeviceptr vmm_meta_dev_ptr; /* GPU metadata buffer VA */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align members

uct_cuda_ipc_extended_rkey_t super;
int stream_id;
#if HAVE_CUDA_FABRIC
uct_cuda_ipc_vmm_chunk_desc_t *chunks;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align members

uct_cuda_ipc_event_desc_t);
ucs_status_t status;

if (cuda_ipc_event->mapped_addr == NULL) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we integrate VMM_MULTI in the cuda_ipc cache, releasing some refcnt here? This could enable LRU as implemented in #11245, to avoid VA exhaustion on some platforms?

return status;
}

#if HAVE_CUDA_FABRIC
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt to move VMM_MULTI related functions to something like cuda_ipc/cuda_ipc_{multi,aggreg,vmm}.c?

ucs_array_init_dynamic(&chunks);

pos = va_base;
while (pos < va_base + va_len) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use for(){}

}

status = UCT_CUDADRV_FUNC_LOG_ERR(
cuMemcpyHtoD(meta_dev_ptr, host_chunks, meta_size));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use one of our internal stream?

#if HAVE_CUDA_FABRIC
if (unpacked->super.super.ph.handle_type ==
UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM_MULTI) {
uct_cuda_ipc_vmm_contig_cleanup(unpacked);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts about the performance of this solution.
We do not unmap other types of handles in rkey release for this reason. Since opening a handle exported from another process is a time-consuming operation. Probably we should do the same way for this new handle type.

goto err_pop_ctx;
}

status = uct_cuda_ipc_create_contig_mapping(unpacked,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to unify the flow with existing types?
The mapping is done in get/put function. And unmapping is done via the event callback.

@guy-ealey-morag
Copy link
Copy Markdown
Contributor

Make sure to update the copyright years at the top of all the changed files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants