Skip to content

UCT/CUDA/CUDA_IPC: Fixed cuda_ipc_cache cleanup at process termination.#11275

Draft
rakhmets wants to merge 1 commit intoopenucx:masterfrom
rakhmets:topic/cuda-ipc-cache-destroy
Draft

UCT/CUDA/CUDA_IPC: Fixed cuda_ipc_cache cleanup at process termination.#11275
rakhmets wants to merge 1 commit intoopenucx:masterfrom
rakhmets:topic/cuda-ipc-cache-destroy

Conversation

@rakhmets
Copy link
Copy Markdown
Contributor

@rakhmets rakhmets commented Mar 19, 2026

What?

Fixed cuda_ipc_cache cleanup at process termination to close opened CUDA IPC memory handles.

Why?

UCS_STATIC_CLEANUP runs in a destructor when the UCX library is unloaded. Unload order is undefined, so the CUDA Driver can already be deinitialized. Then uct_cuda_ipc_destroy_cache -> uct_cuda_ipc_cache_purge -> uct_cuda_ipc_primary_ctx_retain_and_push calls cuDevicePrimaryCtxGetState, which (as any other CUDA Driver API) returns CUDA_ERROR_DEINITIALIZED.

@rakhmets rakhmets force-pushed the topic/cuda-ipc-cache-destroy branch 2 times, most recently from 0266978 to f8a1d9f Compare March 19, 2026 13:13
@rakhmets rakhmets changed the title UCT/CUDA/CUDA_IPC: Fixed cuda_ipc_cache cleanup. UCT/CUDA/CUDA_IPC: Fixed cuda_ipc_cache cleanup at process termination. Mar 19, 2026
@rakhmets rakhmets force-pushed the topic/cuda-ipc-cache-destroy branch from f8a1d9f to 3bf315c Compare March 19, 2026 15:27
@yosefe
Copy link
Copy Markdown
Contributor

yosefe commented Mar 19, 2026

so atexit handler would run before cuda driver is deinitialized?
and if cuda driver is deinitialized before we destroy the mem handles it causes a leak?

@rakhmets
Copy link
Copy Markdown
Contributor Author

so atexit handler would run before cuda driver is deinitialized? and if cuda driver is deinitialized before we destroy the mem handles it causes a leak?

In most cases - yes. There are still some corner cases. E.g. if the first call to cuInit is from cuda_md.c's UCS_STATIC_INIT, and is called after atexit from cuda_ipc_cache.c's UCS_STATIC_INIT. It doesn't cause the leak, as the driver closes the mem handles at process exit. It's the best effort.

@rakhmets rakhmets marked this pull request as draft March 20, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants