You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CI: Switch from PyTorch to cuda-dl-base images for unification (#924)
* CI: Switch from PyTorch to cuda-dl-base for unification
Signed-off-by: Alexey Rivkin <[email protected]>
* Handle Meson update in build.sh
Meson update requires Python, which is installed in build.sh
Previous base image had Python pre-installed, but cuda-dl-base has not
Signed-off-by: Alexey Rivkin <[email protected]>
* Limit ninja parallelism to fix OOM in Ubuntu22 build
Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills.
Signed-off-by: Alexey Rivkin <[email protected]>
* Align Python version with other install procedures
Signed-off-by: Alexey Rivkin <[email protected]>
* Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04
cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages
support. Upgrade pip to 24.x to match PyTorch image behavior.
Signed-off-by: Alexey Rivkin <[email protected]>
* Add ~/.local/bin to PATH for user pip installs
Fixes "pytest: command not found" when pip defaults to user installation.
Signed-off-by: Alexey Rivkin <[email protected]>
* Update to CUDA12.9
Signed-off-by: Alexey Rivkin <[email protected]>
* Use latest cuda-dl-base image for CUDA12.8
Signed-off-by: Alexey Rivkin <[email protected]>
* Set CUDA_HOME in the build script
Signed-off-by: Alexey Rivkin <[email protected]>
* Fix the Permission denied err on DOCA download
Use /tmp to avoid Permission denied in non-writable directories
Also add cleanup for the DOCA install package
Signed-off-by: Alexey Rivkin <[email protected]>
* Make /workspace writable to resolve fs access failures
Signed-off-by: Alexey Rivkin <[email protected]>
* Use cuda-dl-base 25.06 to match rock32 node driver version
The images comes with CUDA 12.9 - verified with Ovidiu it is supported.
Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06
which includes compat driver 575.57.08, matching the H100 nodes' driver version.
Previous 25.03 image had driver 570.124.06 causing version mismatch.
Signed-off-by: Alexey Rivkin <[email protected]>
* Control ninja parallelism in test_python and increase timeout
cuda-dl-base is missing large Pyuthon packages that
comes pre-instelled with Pytorch images. Install
caused frequent OOM and/or timeout on Ubuntu22
Signed-off-by: Alexey Rivkin <[email protected]>
* UCX/BACKEND: Add worker_id selection support (#938)
Signed-off-by: Michal Shalev <[email protected]>
* libfabric: Use desc-specific target offset (#883)
This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.
Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.
Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.
Signed-off-by: Tushar Gohad <[email protected]>
* Parallelism Control for pip install
Signed-off-by: Alexey Rivkin <[email protected]>
* Reorder Python and CPP test stages
Python stage has higher fail probability,
so better fall fast.
Signed-off-by: Alexey Rivkin <[email protected]>
* Fix log message when env var not defined (#914)
Signed-off-by: Ovidiu Mara <[email protected]>
Co-authored-by: Mikhail Brinskiy <[email protected]>
* Minor cleanup
Signed-off-by: Alexey Rivkin <[email protected]>
* Reorder Python and CPP test stages
Signed-off-by: Alexey Rivkin <[email protected]>
* Unify to the latest Docker tag
Signed-off-by: Alexey Rivkin <[email protected]>
* Revert the timeout extension
The expectation was to longer build times due to
switching to a base image with no Python.
In practice, no test is running more then 10 minutes
so old 30 minutes timeout is still valid.
Signed-off-by: Alexey Rivkin <[email protected]>
* Move /workspace chmod to the Dockerfile
That chmod is only needed for CI use cases.
Moving it to the CI-specific Dockerfiles so it would
not affect other cases.
Signed-off-by: Alexey Rivkin <[email protected]>
* Set NPROC in common.sh and reuse
Reduce NPROC set occurences with the default fallback
Signed-off-by: Alexey Rivkin <[email protected]>
* Improve NPROC and CUDA_HOME handling in common.sh
- Move CUDA_HOME setup to common.sh before UCX build check
- Calculate NPROC based on container memory limits (1 proc/GB, max 16)
- Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST
Signed-off-by: Alexey Rivkin <[email protected]>
* Remove hardcoded NPROC from pipelines
NPROC is now set dynamically by common.sh instead
Signed-off-by: Alexey Rivkin <[email protected]>
* Limit CPU parallelism on bare metal nodes
Docker containers see all host CPUs, need to limit on BM
Signed-off-by: Alexey Rivkin <[email protected]>
---------
Signed-off-by: Alexey Rivkin <[email protected]>
Signed-off-by: Michal Shalev <[email protected]>
Signed-off-by: Tushar Gohad <[email protected]>
Signed-off-by: Ovidiu Mara <[email protected]>
Signed-off-by: ovidiusm <[email protected]>
Co-authored-by: Michal Shalev <[email protected]>
Co-authored-by: Tushar Gohad <[email protected]>
Co-authored-by: ovidiusm <[email protected]>
Co-authored-by: Mikhail Brinskiy <[email protected]>
0 commit comments