-
Notifications
You must be signed in to change notification settings - Fork 534
Description
Describe the bug
Starting with UCX 1.20.0, the ucx-cuda DEB package declares Recommends: libnvidia-compute | libnvidia-ml1. Since apt installs Recommends by default, this causes NVIDIA driver userspace libraries to be pulled in automatically when installing UCX — even in environments that already have a working GPU driver.
When the version of the recommended libnvidia-compute package (resolved from the apt repository) does not match the kernel driver already installed on the host, this results in a driver/library version mismatch that breaks GPU functionality:
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 595.45
In UCX 1.19.x and earlier, the ucx-cuda package had no Recommends field, so installing UCX was harmless to the system's existing driver setup.
Steps to Reproduce
- Start with a system or container that has a working NVIDIA GPU driver (e.g., kernel driver 590.44.01)
- Install UCX 1.20.0 DEB packages:
wget https://github.com/openucx/ucx/releases/download/v1.20.0/ucx-1.20.0-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2 tar -xvf ucx-1.20.0-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2 apt install -y *.deb - Observe that
aptautomatically installs additional NVIDIA packages as recommended dependencies:The following additional packages will be installed: libnvidia-cfg1 libnvidia-common libnvidia-compute libnvidia-decode libnvidia-gpucomp nvidia-persistenced - Run
nvidia-smi— it fails withDriver/library version mismatch
Expected behavior
UCX should not pull in driver packages, even as soft dependencies. UCX uses the CUDA Driver API via forward-compatible libcuda.so, which is designed to work across driver versions. The driver is a system-level component managed independently of UCX.
Setup and versions
- UCX version: 1.20.0
- OS: Ubuntu 22.04
- Package:
ucx-1.20.0-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2 - Host driver: 590.44.01
- Pulled driver: 595.45.04 (from NVIDIA CUDA apt repository)
Additional information
- Downstream issue: [FEA] Upgrade to UCX 1.20 NVIDIA/spark-rapids#14055
- Downstream PR: Upgrade ucx to 1.20 NVIDIA/spark-rapids#14383