Skip to content

Commit e4ba569

Browse files
Alexey-Rivkinmichal-shalevtsg-ovidiusmbrminich
authored
CI: Switch from PyTorch to cuda-dl-base images for unification (#924)
* CI: Switch from PyTorch to cuda-dl-base for unification Signed-off-by: Alexey Rivkin <[email protected]> * Handle Meson update in build.sh Meson update requires Python, which is installed in build.sh Previous base image had Python pre-installed, but cuda-dl-base has not Signed-off-by: Alexey Rivkin <[email protected]> * Limit ninja parallelism to fix OOM in Ubuntu22 build Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills. Signed-off-by: Alexey Rivkin <[email protected]> * Align Python version with other install procedures Signed-off-by: Alexey Rivkin <[email protected]> * Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04 cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages support. Upgrade pip to 24.x to match PyTorch image behavior. Signed-off-by: Alexey Rivkin <[email protected]> * Add ~/.local/bin to PATH for user pip installs Fixes "pytest: command not found" when pip defaults to user installation. Signed-off-by: Alexey Rivkin <[email protected]> * Update to CUDA12.9 Signed-off-by: Alexey Rivkin <[email protected]> * Use latest cuda-dl-base image for CUDA12.8 Signed-off-by: Alexey Rivkin <[email protected]> * Set CUDA_HOME in the build script Signed-off-by: Alexey Rivkin <[email protected]> * Fix the Permission denied err on DOCA download Use /tmp to avoid Permission denied in non-writable directories Also add cleanup for the DOCA install package Signed-off-by: Alexey Rivkin <[email protected]> * Make /workspace writable to resolve fs access failures Signed-off-by: Alexey Rivkin <[email protected]> * Use cuda-dl-base 25.06 to match rock32 node driver version The images comes with CUDA 12.9 - verified with Ovidiu it is supported. Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06 which includes compat driver 575.57.08, matching the H100 nodes' driver version. Previous 25.03 image had driver 570.124.06 causing version mismatch. Signed-off-by: Alexey Rivkin <[email protected]> * Control ninja parallelism in test_python and increase timeout cuda-dl-base is missing large Pyuthon packages that comes pre-instelled with Pytorch images. Install caused frequent OOM and/or timeout on Ubuntu22 Signed-off-by: Alexey Rivkin <[email protected]> * UCX/BACKEND: Add worker_id selection support (#938) Signed-off-by: Michal Shalev <[email protected]> * libfabric: Use desc-specific target offset (#883) This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead. Also impacted: Block-based transfers (Iteration N would read blocks from iteration 0, etc), Partial buffer updates, etc. Signed-off-by: Tushar Gohad <[email protected]> * Parallelism Control for pip install Signed-off-by: Alexey Rivkin <[email protected]> * Reorder Python and CPP test stages Python stage has higher fail probability, so better fall fast. Signed-off-by: Alexey Rivkin <[email protected]> * Fix log message when env var not defined (#914) Signed-off-by: Ovidiu Mara <[email protected]> Co-authored-by: Mikhail Brinskiy <[email protected]> * Minor cleanup Signed-off-by: Alexey Rivkin <[email protected]> * Reorder Python and CPP test stages Signed-off-by: Alexey Rivkin <[email protected]> * Unify to the latest Docker tag Signed-off-by: Alexey Rivkin <[email protected]> * Revert the timeout extension The expectation was to longer build times due to switching to a base image with no Python. In practice, no test is running more then 10 minutes so old 30 minutes timeout is still valid. Signed-off-by: Alexey Rivkin <[email protected]> * Move /workspace chmod to the Dockerfile That chmod is only needed for CI use cases. Moving it to the CI-specific Dockerfiles so it would not affect other cases. Signed-off-by: Alexey Rivkin <[email protected]> * Set NPROC in common.sh and reuse Reduce NPROC set occurences with the default fallback Signed-off-by: Alexey Rivkin <[email protected]> * Improve NPROC and CUDA_HOME handling in common.sh - Move CUDA_HOME setup to common.sh before UCX build check - Calculate NPROC based on container memory limits (1 proc/GB, max 16) - Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST Signed-off-by: Alexey Rivkin <[email protected]> * Remove hardcoded NPROC from pipelines NPROC is now set dynamically by common.sh instead Signed-off-by: Alexey Rivkin <[email protected]> * Limit CPU parallelism on bare metal nodes Docker containers see all host CPUs, need to limit on BM Signed-off-by: Alexey Rivkin <[email protected]> --------- Signed-off-by: Alexey Rivkin <[email protected]> Signed-off-by: Michal Shalev <[email protected]> Signed-off-by: Tushar Gohad <[email protected]> Signed-off-by: Ovidiu Mara <[email protected]> Signed-off-by: ovidiusm <[email protected]> Co-authored-by: Michal Shalev <[email protected]> Co-authored-by: Tushar Gohad <[email protected]> Co-authored-by: ovidiusm <[email protected]> Co-authored-by: Mikhail Brinskiy <[email protected]>
1 parent b21955c commit e4ba569

File tree

17 files changed

+74
-33
lines changed

17 files changed

+74
-33
lines changed

.ci/dockerfiles/Dockerfile.gpu_test

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# This Dockerfile creates a GPU-enabled test environment for NIXL (NVIDIA I/O eXchange Layer)
44
# development and testing. It provides a containerized environment with:
55
#
6-
# - NVIDIA PyTorch base image with CUDA support
6+
# - NVIDIA cuda-dl-base image with CUDA support
77
# - Non-root user setup for security
88
# - Sudo access for package installation and system configuration
99
# - Optimized for CI/CD pipeline testing
@@ -13,7 +13,7 @@
1313
# docker run --gpus all --privileged -it nixl-gpu-test
1414
#
1515
# Build arguments:
16-
# BASE_IMAGE: Base NVIDIA PyTorch image (default: nvcr.io/nvidia/pytorch:25.02-py3)
16+
# BASE_IMAGE: Base NVIDIA cuda-dl-base image (default: nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04)
1717
# _UID: User ID for the non-root user (default: 148069)
1818
# _GID: Group ID for the user (default: 30)
1919
# _LOGIN: Username (default: svc-nixl)
@@ -22,7 +22,7 @@
2222
# WORKSPACE: Workspace directory path
2323
#
2424

25-
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:25.02-py3
25+
ARG BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04
2626

2727
FROM ${BASE_IMAGE}
2828

@@ -41,7 +41,7 @@ LABEL version="1.0"
4141

4242
# Update package list and install required packages in one layer
4343
RUN apt-get update && \
44-
apt-get install -y sudo \
44+
apt-get install -y sudo python3 python3-pip \
4545
&& apt-get clean \
4646
&& rm -rf /var/lib/apt/lists/*
4747

@@ -59,6 +59,9 @@ RUN mkdir -p /etc/sudoers.d && \
5959
chmod 440 /etc/sudoers.d/${_LOGIN} && \
6060
chown root:root /etc/sudoers.d/${_LOGIN}
6161

62+
# Create and set permissions for workspace directory
63+
RUN mkdir -p ${WORKSPACE} && chmod 777 ${WORKSPACE}
64+
6265
# Copy workspace into container (workaround for files disappearing from workspace)
6366
COPY --chown="${_UID}":"${_GID}" . ${WORKSPACE}
6467

.ci/docs/setup_nvidia_gpu_with_rdma_support_on_ubuntu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ sudo nvidia-ctk runtime configure --runtime=docker
137137
sudo systemctl restart docker
138138
```
139139

140-
Verify GPU access in containers using `docker run --gpus all nvcr.io/nvidia/pytorch:25.02-py3 nvidia-smi`[^1_3].
140+
Verify GPU access in containers using `docker run --gpus all nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 nvidia-smi`[^1_3].
141141

142142
### 9. **Validation and Troubleshooting**
143143

.ci/jenkins/lib/build-container-matrix.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ env:
3131
REGISTRY_REPO: "sw-nbu-swx-nixl-docker-local/verification"
3232
LOCAL_TAG_BASE: "nixl-ci:build-"
3333
MAIL_FROM: "[email protected]"
34-
NPROC: "16"
3534

3635
taskName: "${BUILD_TARGET}/${arch}/${axis_index}"
3736

.ci/jenkins/lib/build-matrix.yaml

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
# Key Components:
77
# - Job Configuration: Defines timeout, failure behavior, and Kubernetes resources
88
# - Docker Images: Specifies the container images used for different build stages
9-
# - PyTorch images (24.10 and 25.02) for building and testing
9+
# - cuda-dl-base images (25.06 for Ubuntu 24.04, 24.10 for Ubuntu 22.04) for building and testing
1010
# - Podman image for container builds
1111
# - Matrix Axes: Defines build variations (currently x86_64 architecture)
1212
# - Build Steps: Sequential steps for building, testing, and container creation
@@ -34,8 +34,8 @@ kubernetes:
3434
requests: "{memory: 8Gi, cpu: 8000m}"
3535

3636
runs_on_dockers:
37-
- { name: "ubuntu24.04-pytorch", url: "nvcr.io/nvidia/pytorch:25.02-py3" }
38-
- { name: "ubuntu22.04-pytorch", url: "nvcr.io/nvidia/pytorch:24.10-py3" }
37+
- { name: "ubuntu24.04-cuda-dl-base", url: "nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04" }
38+
- { name: "ubuntu22.04-cuda-dl-base", url: "nvcr.io/nvidia/cuda-dl-base:24.10-cuda12.6-devel-ubuntu22.04" }
3939
- { name: "podman-v5.0.2", url: "quay.io/podman/stable:v5.0.2", category: 'tool', privileged: true }
4040

4141
matrix:
@@ -47,17 +47,12 @@ matrix:
4747
env:
4848
NIXL_INSTALL_DIR: /opt/nixl
4949
TEST_TIMEOUT: 30
50-
NPROC: "16"
5150
UCX_TLS: "^shm"
5251

5352
steps:
5453
- name: Build
5554
parallel: false
5655
run: |
57-
if [[ "${name}" == *"ubuntu22.04"* ]]; then
58-
# distro's meson version is too old project requires >= 0.64.0
59-
pip3 install meson
60-
fi
6156
.gitlab/build.sh ${NIXL_INSTALL_DIR}
6257
6358
- name: Test CPP

.ci/jenkins/lib/test-matrix.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ runs_on_agents:
3030
matrix:
3131
axes:
3232
image:
33-
- nvcr.io/nvidia/pytorch:25.02-py3
33+
- nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04
3434
arch:
3535
- x86_64
3636
ucx_version:
@@ -42,9 +42,10 @@ taskName: "${name}/${arch}/ucx-${ucx_version}/${axis_index}"
4242
env:
4343
CONTAINER_WORKSPACE: /workspace
4444
INSTALL_DIR: ${CONTAINER_WORKSPACE}/nixl_install
45-
NPROC: "16"
4645
# Manual timeout - ci-demo doesn't handle docker exec
4746
TEST_TIMEOUT: 30
47+
# NPROC for bare-metal: containers see all host CPUs, need to limit parallelism
48+
NPROC: 16
4849

4950
steps:
5051
- name: Get Environment Info

.ci/jenkins/pipeline/proj-jjb.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@
280280
description: "Base Docker image for the container build"
281281
- string:
282282
name: "BASE_IMAGE_TAG"
283-
default: "25.03-cuda12.8-devel-ubuntu24.04"
283+
default: "25.06-cuda12.9-devel-ubuntu24.04"
284284
description: "Tag for the base Docker image"
285285
- string:
286286
name: "TAG_SUFFIX"
@@ -294,7 +294,7 @@
294294
description: >
295295
Update the latest tag for this architecture.<br/>
296296
When enabled, also creates: <code>&lt;base-image-tag&gt;-&lt;arch&gt;-latest</code><br/>
297-
Example: <code>25.03-cuda12.8-devel-ubuntu24.04-aarch64-latest</code><br/>
297+
Example: <code>25.06-cuda12.9-devel-ubuntu24.04-aarch64-latest</code><br/>
298298
- string:
299299
name: "MAIL_TO"
300300
default: "[email protected]"

.ci/scripts/common.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,11 @@ max_gtest_port=$((tcp_port_max + gtest_offset))
7878
# Check if a GPU is present
7979
nvidia-smi -L | grep -q '^GPU' && HAS_GPU=true || HAS_GPU=false
8080

81+
# Ensure CUDA_HOME is set if CUDA is installed (cuda-dl-base images don't set it by default)
82+
if [ -d "/usr/local/cuda" ] && [ -z "$CUDA_HOME" ]; then
83+
export CUDA_HOME=/usr/local/cuda
84+
fi
85+
8186
if $HAS_GPU && test -d "$CUDA_HOME"
8287
then
8388
UCX_CUDA_BUILD_ARGS="--with-cuda=${CUDA_HOME}"
@@ -89,3 +94,24 @@ fi
8994

9095
# Default to false, unless TEST_LIBFABRIC is set. AWS EFA tests must set it to true.
9196
export TEST_LIBFABRIC=${TEST_LIBFABRIC:-false}
97+
98+
# Set default parallelism for make/ninja (can be overridden by NPROC env var)
99+
if [ -z "$NPROC" ]; then
100+
# In containers, calculate based on memory limits to avoid OOM
101+
if [[ -f /.dockerenv || -f /run/.containerenv || -n "${KUBERNETES_SERVICE_HOST}" ]]; then
102+
if [ -f /sys/fs/cgroup/memory/memory.limit_in_bytes ]; then
103+
limit=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes)
104+
elif [ -f /sys/fs/cgroup/memory.max ]; then
105+
limit=$(cat /sys/fs/cgroup/memory.max)
106+
else
107+
limit=$((4 * 1024 * 1024 * 1024))
108+
fi
109+
# Use 1 process per GB of memory, max 16
110+
nproc=$((limit / (1024 * 1024 * 1024)))
111+
nproc=$((nproc > 16 ? 16 : nproc))
112+
nproc=$((nproc < 1 ? 1 : nproc))
113+
else
114+
nproc=$(nproc --all)
115+
fi
116+
export NPROC=$nproc
117+
fi

.gitlab/build.sh

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,9 @@ ARCH=$(uname -m)
5757
$SUDO rm -rf /usr/lib/cmake/grpc /usr/lib/cmake/protobuf
5858

5959
$SUDO apt-get -qq update
60-
$SUDO apt-get -qq install -y curl \
60+
$SUDO apt-get -qq install -y python3-dev \
61+
python3-pip \
62+
curl \
6163
wget \
6264
libnuma-dev \
6365
numactl \
@@ -101,6 +103,17 @@ $SUDO apt-get -qq install -y curl \
101103
libhwloc-dev \
102104
libcurl4-openssl-dev zlib1g-dev # aws-sdk-cpp dependencies
103105

106+
# Ubuntu 22.04 specific setup
107+
if grep -q "Ubuntu 22.04" /etc/os-release 2>/dev/null; then
108+
# Upgrade pip for '--break-system-packages' support
109+
$SUDO pip3 install --upgrade pip
110+
111+
# Upgrade meson (distro version 0.61.2 is too old, project requires >= 0.64.0)
112+
$SUDO pip3 install --upgrade meson
113+
# Ensure pip3's meson takes precedence over apt's version
114+
export PATH="$HOME/.local/bin:/usr/local/bin:$PATH"
115+
fi
116+
104117
# Add DOCA repository and install packages
105118
ARCH_SUFFIX=$(if [ "${ARCH}" = "aarch64" ]; then echo "arm64"; else echo "amd64"; fi)
106119
MELLANOX_OS="$(. /etc/lsb-release; echo ${DISTRIB_ID}${DISTRIB_RELEASE} | tr A-Z a-z | tr -d .)"
@@ -172,7 +185,7 @@ rm "libfabric-${LIBFABRIC_VERSION#v}.tar.bz2"
172185
cd etcd-cpp-apiv3 && \
173186
mkdir build && cd build && \
174187
cmake .. && \
175-
make -j"${NPROC:-$(nproc)}" && \
188+
make -j"$NPROC" && \
176189
$SUDO make install && \
177190
$SUDO ldconfig \
178191
)
@@ -183,7 +196,7 @@ rm "libfabric-${LIBFABRIC_VERSION#v}.tar.bz2"
183196
mkdir aws_sdk_build && \
184197
cd aws_sdk_build && \
185198
cmake ../aws-sdk-cpp/ -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3" -DENABLE_TESTING=OFF -DCMAKE_INSTALL_PREFIX=/usr/local && \
186-
make -j"${NPROC:-$(nproc)}" && \
199+
make -j"$NPROC" && \
187200
$SUDO make install
188201
)
189202

@@ -215,12 +228,12 @@ export UCX_TLS=^cuda_ipc
215228

216229
# shellcheck disable=SC2086
217230
meson setup nixl_build --prefix=${INSTALL_DIR} -Ducx_path=${UCX_INSTALL_DIR} -Dbuild_docs=true -Drust=false ${EXTRA_BUILD_ARGS} -Dlibfabric_path="${LIBFABRIC_INSTALL_DIR}"
218-
ninja -C nixl_build && ninja -C nixl_build install
231+
ninja -j"$NPROC" -C nixl_build && ninja -j"$NPROC" -C nixl_build install
219232
mkdir -p dist && cp nixl_build/src/bindings/python/nixl-meta/nixl-*.whl dist/
220233

221234
# TODO(kapila): Copy the nixl.pc file to the install directory if needed.
222235
# cp ${BUILD_DIR}/nixl.pc ${INSTALL_DIR}/lib/pkgconfig/nixl.pc
223236

224237
cd benchmark/nixlbench
225238
meson setup nixlbench_build -Dnixl_path=${INSTALL_DIR} -Dprefix=${INSTALL_DIR}
226-
ninja -C nixlbench_build && ninja -C nixlbench_build install
239+
ninja -j"$NPROC" -C nixlbench_build && ninja -j"$NPROC" -C nixlbench_build install

.gitlab/test_python.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,12 +40,16 @@ export NIXL_PREFIX=${INSTALL_DIR}
4040
# Raise exceptions for logging errors
4141
export NIXL_DEBUG_LOGGING=yes
4242

43-
pip3 install --break-system-packages .
43+
# Control ninja parallelism during pip build to prevent OOM (NPROC from common.sh)
44+
pip3 install --break-system-packages --config-settings=compile-args="-j${NPROC}" .
4445
pip3 install --break-system-packages dist/nixl-*none-any.whl
4546
pip3 install --break-system-packages pytest
4647
pip3 install --break-system-packages pytest-timeout
4748
pip3 install --break-system-packages zmq
4849

50+
# Add user pip packages to PATH
51+
export PATH="$HOME/.local/bin:$PATH"
52+
4953
echo "==== Running ETCD server ===="
5054
etcd_port=$(get_next_tcp_port)
5155
etcd_peer_port=$(get_next_tcp_port)

benchmark/nixlbench/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ cd nixl/benchmark/nixlbench/contrib
172172
| `--ucx <path>` | Path to custom UCX source (optional) | Uses base image UCX |
173173
| `--build-type <type>` | Build type: `debug` or `release` | `release` |
174174
| `--base-image <image>` | Base Docker image | `nvcr.io/nvidia/cuda-dl-base` |
175-
| `--base-image-tag <tag>` | Base image tag | `25.03-cuda12.8-devel-ubuntu24.04` |
175+
| `--base-image-tag <tag>` | Base image tag | `25.06-cuda12.9-devel-ubuntu24.04` |
176176
| `--arch <arch>` | Target architecture: `x86_64` or `aarch64` | Auto-detected |
177177
| `--python-versions <versions>` | Python versions (comma-separated) | `3.12` |
178178
| `--tag <tag>` | Custom Docker image tag | Auto-generated |

0 commit comments

Comments
 (0)