UCT/CUDA_IPC: Support VMM with multiple memory allocations#11283
UCT/CUDA_IPC: Support VMM with multiple memory allocations#11283tomerg-nvidia wants to merge 1 commit intoopenucx:masterfrom
Conversation
6dfc62d to
4920050
Compare
Handle CUDA VMM allocations spanning multiple cuMemCreate chunks by discovering all chunks, exporting their fabric handles into a GPU metadata buffer, and sharing that buffer's fabric handle via the rkey. On the receiver, the metadata is fetched and a persistent contiguous VA mapping is created by importing each chunk individually. Address translation for put/get uses this mapping directly.
4920050 to
250e476
Compare
| } | ||
|
|
||
| typedef struct { | ||
| CUmemGenericAllocationHandle *handles; |
There was a problem hiding this comment.
use CUmemGenericAllocationHandle handles[]; at the end of the struct?
| size_t b_len; /* Allocation size */ | ||
| ucs_list_link_t link; | ||
| #if HAVE_CUDA_FABRIC | ||
| CUdeviceptr vmm_meta_dev_ptr; /* GPU metadata buffer VA */ |
| uct_cuda_ipc_extended_rkey_t super; | ||
| int stream_id; | ||
| #if HAVE_CUDA_FABRIC | ||
| uct_cuda_ipc_vmm_chunk_desc_t *chunks; |
| uct_cuda_ipc_event_desc_t); | ||
| ucs_status_t status; | ||
|
|
||
| if (cuda_ipc_event->mapped_addr == NULL) { |
There was a problem hiding this comment.
Can we integrate VMM_MULTI in the cuda_ipc cache, releasing some refcnt here? This could enable LRU as implemented in #11245, to avoid VA exhaustion on some platforms?
| return status; | ||
| } | ||
|
|
||
| #if HAVE_CUDA_FABRIC |
There was a problem hiding this comment.
wdyt to move VMM_MULTI related functions to something like cuda_ipc/cuda_ipc_{multi,aggreg,vmm}.c?
| ucs_array_init_dynamic(&chunks); | ||
|
|
||
| pos = va_base; | ||
| while (pos < va_base + va_len) { |
| } | ||
|
|
||
| status = UCT_CUDADRV_FUNC_LOG_ERR( | ||
| cuMemcpyHtoD(meta_dev_ptr, host_chunks, meta_size)); |
There was a problem hiding this comment.
do we need to use one of our internal stream?
| #if HAVE_CUDA_FABRIC | ||
| if (unpacked->super.super.ph.handle_type == | ||
| UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM_MULTI) { | ||
| uct_cuda_ipc_vmm_contig_cleanup(unpacked); |
There was a problem hiding this comment.
I have doubts about the performance of this solution.
We do not unmap other types of handles in rkey release for this reason. Since opening a handle exported from another process is a time-consuming operation. Probably we should do the same way for this new handle type.
| goto err_pop_ctx; | ||
| } | ||
|
|
||
| status = uct_cuda_ipc_create_contig_mapping(unpacked, |
There was a problem hiding this comment.
Is it possible to unify the flow with existing types?
The mapping is done in get/put function. And unmapping is done via the event callback.
|
Make sure to update the copyright years at the top of all the changed files |
What?
Add support for CUDA VMM allocations that span multiple underlying physical allocations (cuMemCreate chunks) in the cuda_ipc transport.
Why?
A user-visible VA range may be backed by several independently created and mapped VMM chunks. Currently, cuda_ipc only exports a single fabric handle per region, which covers just the first chunk. This means transfers involving multi-chunk VMM regions silently access only the first chunk's memory, producing incorrect results for offsets beyond the first chunk boundary.
How?
A new handle type UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM_MULTI is introduced. During mkey_pack, all VMM chunks in the VA range are discovered and their fabric handles are written into a GPU-resident metadata buffer. The metadata buffer's own fabric handle, chunk count, and allocation size are packed into the rkey (reusing the buffer_id union, so the wire format size is unchanged).
During rkey_unpack, the receiver imports the metadata buffer, reads the chunk descriptors, and creates a persistent contiguous local VA mapping by importing and mapping each chunk at the correct offset. This mapping lives on the unpacked rkey and is used directly for address translation on every put/get, bypassing the per-operation cache lookup. Cleanup happens at rkey_release.