Skip to content

Conversation

@tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Oct 15, 2025

What?

Fix perftest failure with cuda_ipc but without rc_gda on device tests.

Why?

Workaround: Allow self endpoints to miss a device lane (UCP_FEATURE_DEVICE). This can happen with UCX_TLS=^rc_gda as cuda_ipc does not support same process lane.

How

ucx_perftest uses a self endpoint to copy SN on non-host memory (for instance, ucp_put_lat, to copy the sn). This self endpoint shares the same worker containing the device feature request, hence triggers failure. Using separate context and worker without UCP_FEATURE_DEVICE would mean that we need to register the memory both on context and context_self, which does not seem completely impossible.

Repro:

UCX_NET_DEVICES=ens10f0 UCX_TLS=tcp,cuda ucx_perftest \
  -m cuda -a cuda:0 -t ucp_put_single_bw -w 1000 -n 100000 -s 8 -T 32 localhost

@brminich
Copy link
Contributor

looks like #10933 should fix this problem properly

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 16, 2025

looks like #10933 should fix this problem properly

Seems it currently prevents completely cuda_ipc on same process. Commented on that PR and tested fix below that makes it work:

diff --git a/src/uct/cuda/cuda_ipc/cuda_ipc_iface.c b/src/uct/cuda/cuda_ipc/cuda_ipc_iface.c
index 85228487e..5d6b35b0c 100644
--- a/src/uct/cuda/cuda_ipc/cuda_ipc_iface.c
+++ b/src/uct/cuda/cuda_ipc/cuda_ipc_iface.c
@@ -143,8 +143,7 @@ uct_cuda_ipc_iface_is_reachable_v2(const uct_iface_h tl_iface,
     same_uuid    = (ucs_get_system_id() == dev_addr->system_uuid);

     if ((getpid() == *(pid_t*)params->iface_addr) && same_uuid) {
-        uct_iface_fill_info_str_buf(params, "same process");
-        return 0;
+        return uct_iface_scope_is_reachable(tl_iface, params);
     }

     if (same_uuid ||

@ofirfarjun7
Copy link
Contributor

looks like #10933 should fix this problem properly

It indeed solve the issue for self ep, so I will close this PR.
We still need to make it best effort for the general case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants