Skip to content

UCP/API: add estimated bandwidth query to ucp_ep_query#11308

Open
ndg8743 wants to merge 2 commits intoopenucx:masterfrom
ndg8743:feat/ep-bandwidth-query
Open

UCP/API: add estimated bandwidth query to ucp_ep_query#11308
ndg8743 wants to merge 2 commits intoopenucx:masterfrom
ndg8743:feat/ep-bandwidth-query

Conversation

@ndg8743
Copy link
Copy Markdown

@ndg8743 ndg8743 commented Mar 30, 2026

Summary

  • Adds UCP_EP_ATTR_FIELD_ESTIMATED_BW field to ucp_ep_attr_t allowing users to query the estimated aggregate bandwidth (bytes/sec) for a given local/remote memory type pair on an endpoint.
  • The caller sets local_mem_type and remote_mem_type before calling ucp_ep_query(), and the implementation returns the sum of bandwidths across data lanes whose memory domain supports the requested memory type.

Closes #6254

Add UCP_EP_ATTR_FIELD_ESTIMATED_BW field to ucp_ep_attr_t that allows
querying the estimated aggregate bandwidth (bytes/sec) for a given
local/remote memory type pair on an endpoint. The caller sets
local_mem_type and remote_mem_type and the implementation returns the
sum of bandwidths across all data lanes whose memory domain supports
the requested memory type.

Closes openucx#6254
@ndg8743 ndg8743 force-pushed the feat/ep-bandwidth-query branch from 226769b to 502bc7e Compare March 31, 2026 02:16
@ndg8743
Copy link
Copy Markdown
Author

ndg8743 commented Apr 2, 2026

CI failures are infrastructure flakes on RoCE workers, unrelated to the bandwidth query API change:

  • Tests roce worker 0: dcx/test_ucp_tag_match_rndv_align.recv_align/1 — DC transport rendezvous hung, connection timed out after ~16min
  • ASAN roce worker 1: shm_ib_ipc/test_ucp_atomic32.post/2ibv_poll_cq(UMR CQ) timed out on mlx5_1, rcache refcount assertion failure during cleanup
  • Tests roce workers 1, 2, 3: canceled as consequence of worker 0 failure

None of these tests exercise the ucp_ep_query path. Retriggering CI.

@ndg8743 ndg8743 force-pushed the feat/ep-bandwidth-query branch from 80d014a to 502bc7e Compare April 3, 2026 04:44
Use @A instead of @ref for local_mem_type, remote_mem_type, and
bandwidth since they are fields inside a nested anonymous struct and
cannot be resolved by Doxygen as cross-references.
@ndg8743 ndg8743 force-pushed the feat/ep-bandwidth-query branch from 502bc7e to ed3d69e Compare April 4, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It would be nice to have an API to query expected bandwidth for a peer + memory pair combination

1 participant