Skip to content

Conversation

@michal-shalev
Copy link
Contributor

What?

Fixes UCX GPU Device API detection.

Why?

The UCX GPU-side compilation check was failing because:

  1. DOCA GPUNETIO dependency detection was broken (nvcc_cmd.version() returned "2005" instead of "12.9")
  2. UCX headers with GDAKI include DOCA GPUNETIO headers (doca_gpunetio_dev_verbs_qp.cuh), but the compilation check wasn't passing the DOCA dependency

Without these fixes, HAVE_UCX_GPU_DEVICE_API was not defined even when all requirements were met.

How?

  • Changed line 119 to use nvcc.version() instead of nvcc_cmd.version() for reliable version detection
  • Added doca_gpunetio_dep to the UCX GPU-side compilation check dependencies
  • Added DOCA GPUNETIO status to build summary for better visibility

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@michal-shalev
Copy link
Contributor Author

/build

@michal-shalev
Copy link
Contributor Author

/build

aranadive
aranadive previously approved these changes Nov 4, 2025
@aranadive
Copy link
Contributor

/build

meson.build Outdated
Comment on lines 125 to 126
nvcc_cmd = find_program('nvcc', required: false)
if nvcc_cmd.found()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still need it?
on line 107 we use the same version comparision without nvcc_cmd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Signed-off-by: Michal Shalev <[email protected]>
meson.build Outdated
Comment on lines 124 to 130
# DOCA GPUNETIO
if nvcc.version().version_compare('>=12.8') and nvcc.version().version_compare('<13.0')
doca_gpunetio_dep = dependency('doca-gpunetio', required : false)
else
warning('CUDA version = ' + nvcc.version() + ', GPUNETIO plugin will be disabled')
doca_gpunetio_dep = disabler()
endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we now delete it since it looks exactly the same as lines 107-112

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed

Signed-off-by: Michal Shalev <[email protected]>
@michal-shalev
Copy link
Contributor Author

/build

@brminich
Copy link
Contributor

failure looks related

25-11-12T21:00:59.699Z] Mems:
[2025-11-12T21:00:59.699Z]   DRAM_SEG
[2025-11-12T21:00:59.699Z]   VRAM_SEG
[2025-11-12T21:00:59.699Z] [1762981259.635177] [swx-nixl01:123977:0]          select.c:2504 UCX  ERROR   could not find device lanes
[2025-11-12T21:00:59.699Z] [1762981259.635296] [swx-nixl01:123977:0]          select.c:2504 UCX  ERROR   could not find device lanes
[2025-11-12T21:00:59.699Z] W1112 21:00:59.635302  123977 ucx_utils.cpp:70] Unexpected UCX error: Destination is unreachable
[2025-11-12T21:00:59.699Z] E1112 21:00:59.637612  123977 nixl_agent.cpp:374] createBackend: backend 'UCX' encountered error during intra-agent transfer setup with status NIXL_ERR_BACKEND
[2025-11-12T21:00:59.699Z] W1112 21:00:59.694323  123977 ucx_utils.cpp:70] Unexpected UCX error: Destination is unreachable
[2025-11-12T21:00:59.699Z] [1762981259.694202] [swx-nixl01:123977:0]          select.c:2504 UCX  ERROR   could not find device lanes
[2025-11-12T21:00:59.699Z] [1762981259.694319] [swx-nixl01:123977:0]          select.c:2504 UCX  ERROR   could not find device lanes
[2025-11-12T21:00:59.965Z] E1112 21:00:59.696541  123977 nixl_agent.cpp:374] createBackend: backend 'UCX' encountered error during intra-agent transfer setup with status NIXL_ERR_BACKEND
[2025-11-12T21:00:59.965Z] E1112 21:00:59.696578  123977 test_utils.cpp:26] Failed to create UCX backend for agent Agent001: NIXL_ERR_BACKEND [NIXL_ERR_BACKEND]
[2025-11-12T21:01:02.236Z] + docker rm -f nixl-ci-test-1690-1
[2025-11-12T21:01:12.317Z] nixl-ci-test-1690-1
[2025-11-12T21:01:12.317Z] + docker image rm -f nixl-ci-test-1690-1
[2025-11-12T21:01:12.317Z] Untagged: nixl-ci-test-1690-1:latest
Test CPP failed with msg: Step Test CPP failed with exit code=1[2025-11-12T20:48:55.642Z] + [ -x /bin/bash ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants