Skip to content

ucp: allow eager inline sends for host memory when CUDA MDs are present#11306

Open
ndg8743 wants to merge 1 commit intoopenucx:masterfrom
ndg8743:fix/inline-send-with-cuda
Open

ucp: allow eager inline sends for host memory when CUDA MDs are present#11306
ndg8743 wants to merge 1 commit intoopenucx:masterfrom
ndg8743:fix/inline-send-with-cuda

Conversation

@ndg8743
Copy link
Copy Markdown

@ndg8743 ndg8743 commented Mar 30, 2026

Summary

Test plan

  • Verify inline eager sends are used for small host-memory tag messages on systems with CUDA installed
  • Run existing UCP tag send tests

@ndg8743 ndg8743 force-pushed the fix/inline-send-with-cuda branch 2 times, most recently from c914d16 to ed6648b Compare March 31, 2026 02:16
@ndg8743
Copy link
Copy Markdown
Author

ndg8743 commented Apr 2, 2026

CI failures are all infrastructure flakes on RoCE workers, unrelated to the inline send change:

  • Tests roce worker 0: ud_mlx5/uct_p2p_am_test.am_short_keep_data/3 — UD endpoint unhandled timeout error (ud_ep.c:296)
  • Tests roce worker 1: shm_gga/test_ucp_wireup_1sided.one_sided_wireup_rndv/5 — UMR CQ poll timeout on mlx5_1, failed to pack remote key, rndv assertion failure
  • Tests roce worker 3: rc_mlx5/test_uct_perf.envelope/1 — connection timed out after ~15min
  • ASAN roce worker 2: shm_ib/test_ucp_sockaddr_with_wakeup.wakeup_s2c/2 — request stuck, connection timed out, job canceled
  • ASAN roce worker 0: rcx/test_ucp_sockaddr.concurrent_disconnect_s2c/10 — request stuck, connection timed out, job canceled

All are pre-existing IB/RoCE timeout patterns on swx-rain03. Retriggering CI.

@ndg8743 ndg8743 force-pushed the fix/inline-send-with-cuda branch from 9a39731 to ed6648b Compare April 3, 2026 04:44
When CUDA memory domains are loaded, the memory type cache becomes
non-empty after any GPU allocation. Previously, ucp_proto_is_inline()
would conservatively disable inline (am_short) sends for all buffers
when the cache was non-empty, unless the user explicitly set the
memory type to HOST. This caused a performance regression for host
memory buffers on systems with CUDA/ROCm installed.

Fix by performing a memtype cache lookup when the cache is non-empty
to positively identify whether the buffer is host memory. If the
address is not found in the cache, it is host memory and inline send
is safe to use.

Fixes openucx#4275
@ndg8743 ndg8743 force-pushed the fix/inline-send-with-cuda branch from ed6648b to 39939fb Compare April 4, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eager tag send inline ucp_tag_send_inline is not called on a systems with CUDA/RoCM by default

1 participant