wheels CI: stricter torch index selection, test oldest versions of dependencies#413
wheels CI: stricter torch index selection, test oldest versions of dependencies#413jameslamb wants to merge 17 commits intorapidsai:mainfrom
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
| # (useful in CI scripts where we want to tightly which indices 'pip' uses). | ||
| - matrix: | ||
| include_torch_extra_index: "false" | ||
| packages: |
There was a problem hiding this comment.
rapids-dependency-file-generator uses the first matching matrix (see https://github.com/rapidsai/dependency-file-generator?tab=readme-ov-file#how-dependency-lists-are-merged).
This will only affect cases where include_torch_extra_index=false is passed (as in CI here). Other cases (like RAPIDS devcontainers) will fall through to othe groups that pull in --extra-index-url lines.
So this should not break any other uses of this file.
| build_type: pull-request | ||
| script: ci/test_wheel_pylibwholegraph.sh | ||
| matrix_filter: map(select(.ARCH == "amd64")) | ||
| matrix_type: 'nightly' |
There was a problem hiding this comment.
| matrix_type: 'nightly' |
TODO: revert this. Just added here for testing, to confirm this will fix the issues we've been seeing in nightlies.
| build_type: pull-request | ||
| script: ci/test_wheel_cugraph-pyg.sh | ||
| matrix_filter: map(select(.ARCH == "amd64")) | ||
| matrix_type: 'nightly' |
There was a problem hiding this comment.
| matrix_type: 'nightly' |
Revert before merging.
This comment was marked as resolved.
This comment was marked as resolved.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
| - output_types: [conda] | ||
| packages: | ||
| - torchdata | ||
| - pydantic |
There was a problem hiding this comment.
Just moving this here so depends_on_pytorch only ever contains torch / pytorch.
This test_python_common group is used everywhere that depends_on_pytorch is.
dependencies.yaml
Outdated
| # 2.6.0 is the oldest version on https://download.pytorch.org/whl/cu126 with CUDA wheels | ||
| - torch==2.6.0 |
There was a problem hiding this comment.
@alexbarghi-nv see this note.
There aren't CUDA 12 wheels available for PyTorch older than 2.6.0.
pip download \
--isolated \
--no-deps \
--index-url=https://download.pytorch.org/whl/cu126 \
'torch==2.3.0'
# ERROR: Could not find a version that satisfies the requirement torch==2.3.0
# (from versions: 2.6.0+cu126, 2.7.0+cu126, 2.7.1+cu126, 2.8.0+cu126, 2.9.0+cu126, 2.9.1+cu126, 2.10.0+cu126)Do you want to bump the floor in dependency metadata here to >=2.6.0? Or to leave it at >=2.3 so that these libraries are still installable alongside older PyTorch releases (for example, if people build PyTorch 2.4 from source)?
Your call.
There was a problem hiding this comment.
I would vote for bumping the floor to >=2.6.0. It's a little over a year old at this point. https://github.com/pytorch/pytorch/releases/tag/v2.6.0
| --output requirements \ | ||
| --file-key "test_cugraph_pyg" \ | ||
| --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES};include_torch_extra_index=false" \ | ||
| | tee "${PIP_CONSTRAINT}" |
There was a problem hiding this comment.
This is a new one for me 😭
× Dependency resolution exceeded maximum depth
╰─> Pip cannot resolve the current dependencies as the dependency graph is too complex for pip to solve efficiently.
hint: Try adding lower bounds to constrain your dependencies, for example: 'package>=2.0.0' instead of just 'package'.
All cugraph-pyg wheel tests are failing like this, not only the oldest dependencies one.
Example constraints file (not including all the requirements of all these packages):
--extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple
--extra-index-url=https://pypi.nvidia.com/
cudf==26.4.*,>=0.0.0a0
cugraph==26.4.*,>=0.0.0a0
cuml==26.4.*,>=0.0.0a0
ogb
pylibwholegraph==26.4.*,>=0.0.0a0
pytest-benchmark
pytest-cov
pytest-xdist
pytest<9.0.0
sentence-transformers
torch>=2.9.0
I'll try that advice from the error message, let's see if it'll help us get a little farther.
There was a problem hiding this comment.
Getting further with local testing, adding more pins to force out some solver errors.
test code (click me)
docker run \
--rm \
--gpus all \
--env GH_TOKEN=$(gh auth token) \
--env RAPIDS_BUILD_TYPE="pull-request" \
--env RAPIDS_REPOSITORY="rapidsai/cugraph-gnn" \
-v $(pwd):/opt/work \
-w /opt/work \
-it rapidsai/citestwheel:26.04-cuda12.9.1-rockylinux8-py3.11 \
bash
source rapids-init-pip
package_name="cugraph-pyg"
RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
# Download the libwholegraph, pylibwholegraph, and cugraph-pyg built in the previous step
COMMIT_ID=843296e5e99ebb017e3a4a63b046abfc672ce279
LIBWHOLEGRAPH_WHEELHOUSE=$(
RAPIDS_PY_WHEEL_NAME="libwholegraph_${RAPIDS_PY_CUDA_SUFFIX}" rapids-get-pr-artifact cugraph-gnn 413 cpp wheel "${COMMIT_ID}"
)
PYLIBWHOLEGRAPH_WHEELHOUSE=$(
rapids-get-pr-artifact cugraph-gnn 413 python wheel --pkg_name pylibwholegraph --stable "${COMMIT_ID}"
)
CUGRAPH_PYG_WHEELHOUSE=$(
RAPIDS_PY_WHEEL_NAME="cugraph-pyg_cu12" RAPIDS_PY_WHEEL_PURE="1" rapids-get-pr-artifact cugraph-gnn 413 python wheel "${COMMIT_ID}"
)
# generate constraints, accounting for 'oldest' and 'latest' dependencies
rapids-dependency-file-generator \
--output requirements \
--file-key "test_cugraph_pyg" \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES};include_torch_extra_index=false" \
| tee "${PIP_CONSTRAINT}"
# ensure a CUDA variant of 'torch' is used
./ci/download-torch-wheels.sh
# notes:
#
# * echo to expand wildcard before adding `[extra]` requires for pip
# * '--extra-index-url pypi.nvidia.com' can be removed when 'cugraph' and
# its dependencies are available from pypi.org
#
rapids-pip-retry install \
--dry-run \
-v \
--constraint "${PIP_CONSTRAINT}" \
--extra-index-url 'https://pypi.nvidia.com' \
"${LIBWHOLEGRAPH_WHEELHOUSE}"/*.whl \
"$(echo "${PYLIBWHOLEGRAPH_WHEELHOUSE}"/pylibwholegraph_"${RAPIDS_PY_CUDA_SUFFIX}"*.whl)" \
"$(echo "${CUGRAPH_PYG_WHEELHOUSE}"/cugraph_pyg_"${RAPIDS_PY_CUDA_SUFFIX}"*.whl)[test]" \
"cuda-bindings[all]==12.9.4" \
"cudf-cu12==26.4.0a289" \
"cugraph-cu12==26.4.0a30" \
"cuml-cu12==26.4.0a77" \
"dask-cuda==26.4.0a18" \
"distributed-ucxx-cu12==0.49.0a20" \
"libcudf-cu12==26.4.0a289" \
"libcugraph-cu12==26.4.0a30" \
"libcuml-cu12==26.4.0a77" \
"libucxx-cu12==0.49.0a20" \
"numba-cuda[cu12]==0.27.0" \
"pylibcugraph-cu12==26.4.0a30" \
"pylibcudf-cu12==26.4.0a289" \
"pylibraft-cu12==26.4.0a34" \
"raft-dask-cu12==26.4.0a33" \
"rapids-dask-dependency==26.4.0a7" \
"rmm-cu12==26.4.0a30" \
"ucxx-cu12==0.49.0a20"I think torch's very tight pinnings are leading to these expensive solves.
TORCH_WHEEL_DIR=$(mktemp -d)
rapids-pip-retry download \
--prefer-binary \
--no-deps \
-d "${TORCH_WHEEL_DIR}" \
--index-url "https://download.pytorch.org/whl/cu126" \
'torch==2.10'
pushd "${TORCH_WHEEL_DIR}"
pip install pkginfo$ pkginfo --json *.whl
"cuda-bindings==12.9.4; platform_system == \"Linux\"",
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == \"Linux\"",
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == \"Linux\"",
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == \"Linux\"",
"nvidia-cudnn-cu12==9.10.2.21; platform_system == \"Linux\"",
"nvidia-cublas-cu12==12.6.4.1; platform_system == \"Linux\"",
"nvidia-cufft-cu12==11.3.0.4; platform_system == \"Linux\"",
"nvidia-curand-cu12==10.3.7.77; platform_system == \"Linux\"",
"nvidia-cusolver-cu12==11.7.1.2; platform_system == \"Linux\"",
"nvidia-cusparse-cu12==12.5.4.2; platform_system == \"Linux\"",
"nvidia-cusparselt-cu12==0.7.1; platform_system == \"Linux\"",
"nvidia-nccl-cu12==2.27.5; platform_system == \"Linux\"",
"nvidia-nvshmem-cu12==3.4.5; platform_system == \"Linux\"",
"nvidia-nvtx-cu12==12.6.77; platform_system == \"Linux\"",
"nvidia-nvjitlink-cu12==12.6.85; platform_system == \"Linux\"",
"nvidia-cufile-cu12==1.11.1.6; platform_system == \"Linux\"",
"triton==3.6.0; platform_system == \"Linux\"",Pinning to the latest versions of RAPIDS nightlies as well as a few other packages is yielding solver errors like this:
ERROR: Cannot install cuda-bindings[all]==12.9.4, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.0.0, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.0.1, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.1.0, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.1.1, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.2.0, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.2.1, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.2.2, cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc]==12.9.1, cudf-cu12==26.4.0a289, cuml-cu12==26.4.0a77, libcuml-cu12==26.4.0a77 and numba-cuda[cu12]==0.27.0 because these package versions have conflicting dependencies.
The conflict is caused by:
cudf-cu12 26.4.0a289 depends on cuda-toolkit==12.*
cuml-cu12 26.4.0a77 depends on cuda-toolkit==12.*
libcuml-cu12 26.4.0a77 depends on cuda-toolkit==12.*
numba-cuda[cu12] 0.27.0 depends on cuda-toolkit==12.*; extra == "cu12"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.9.1 depends on cuda-toolkit 12.9.1 (from https://pypi.nvidia.com/cuda-toolkit/cuda_toolkit-12.9.1-py2.py3-none-any.whl#sha256=0c8636dfacbecfe9867a949a211864f080a805bc54023ce4a361aa4e1fd8738b (from https://pypi.nvidia.com/cuda-toolkit/))
cuda-bindings[all] 12.9.4 depends on nvidia-nvjitlink-cu12>=12.3; extra == "all"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.2.2 depends on nvidia-nvjitlink-cu12==12.2.140.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "nvjitlink"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.2.1 depends on nvidia-nvjitlink-cu12==12.2.128.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "nvjitlink"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.2.0 depends on nvidia-nvjitlink-cu12==12.2.91.*; (sys_platform == "linux" or sys_platform == "win32") and extra == "nvjitlink"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.1.1 depends on nvidia-nvjitlink-cu12==12.1.105.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "nvjitlink"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.1.0 depends on nvidia-nvjitlink-cu12==12.1.55.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "nvjitlink"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.0.1 depends on nvidia-nvjitlink-cu12==12.0.140.*; (sys_platform == "linux" or sys_platform == "win32") and extra == "nvjitlink"
cuda-toolkit[cccl,cudart,nvcc,nvjitlink,nvrtc] 12.0.0 depends on nvidia-nvjitlink-cu12==12.0.76.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "nvjitlink"
Additionally, some packages in these conflicts have no matching distributions available for your environment:
cuda-toolkit
nvidia-nvjitlink-cu12
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip to attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
Looks like in recent successful runs on main, the jobs are falling back to torch==2.9.1 wheels even though 2.10.0 wheels are available: https://github.com/rapidsai/cugraph-gnn/actions/runs/22192581186/job/64185894306#step:13:838
There was a problem hiding this comment.
I've pushed eb6be78 adding a ceiling of torch<2.10.
Let's just see if that allows all the environments to be solved. If it does, maybe it's worth putting that ceiling in place temporarily and handling removing it as a follow-up issue / PR (to at least get nightly tests working again here).
There was a problem hiding this comment.
Oy, this is brutal.
CI is still failing here and I see pip backtracking over a bunch of different versions of cuda-pathfinder, cuda-toolkit, and RAPIDS libraries.
I'm still testing locally, let's see if I can find a different path through this.
There was a problem hiding this comment.
Ah! Ok had an idea, I think this gets us further along here.
Setting that locally-downloaded torch file as a constraint means it enters pip's resolution algorithm pretty late in the process. Passing it as a requirement upfront gets it and all of its requirements into pip's solution early, which makes the search space small enough that instead of resolution-too-deep, we get a more informative solver error.
Pushed a commit doing that: 4e923d4
Locally, I got something like this:
The conflict is caused by:
cudf-cu12 26.4.0a289 depends on cuda-toolkit==12.*
cuml-cu12 26.4.0a78 depends on cuda-toolkit==12.*
libcuml-cu12 26.4.0a78 depends on cuda-toolkit==12.*
libraft-cu12 26.4.0a33 depends on cuda-toolkit==12.*
cuda-toolkit[cublas,cufft,curand,cusolver,cusparse,nvjitlink] 12.9.1 depends on cuda-toolkit 12.9.1 (from https://pypi.nvidia.com/cuda-toolkit/cuda_toolkit-12.9.1-py2.py3-none-any.whl#sha256=0c8636dfacbecfe9867a949a211864f080a805bc54023ce4a361aa4e1fd8738b (from https://pypi.nvidia.com/cuda-toolkit/))
torch 2.9.1+cu126 depends on nvidia-cublas-cu12==12.6.4.1; platform_system == "Linux"
nvidia-cudnn-cu12 9.10.2.21 depends on nvidia-cublas-cu12
nvidia-cusolver-cu12 11.7.1.2 depends on nvidia-cublas-cu12
cuda-toolkit[cublas,cufft,curand,cusolver,cusparse,nvjitlink] 12.9.0 depends on nvidia-cublas-cu12==12.9.0.13.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "cublas"
...
cuda-toolkit[cublas,cufft,curand,cusolver,cusparse,nvjitlink] 12.6.1 depends on nvidia-cublas-cu12==12.6.1.4.*; (sys_platform == "linux" or sys_platform == "win32") and extra == "cublas"
...
cuda-toolkit[cublas,cufft,curand,cusolver,cusparse,nvjitlink] 12.0.0 depends on nvidia-cublas-cu12==12.0.1.189.*; (sys_platform == "win32" or sys_platform == "linux") and extra == "cublas"
Additionally, some packages in these conflicts have no matching distributions available for your environment:
cuda-toolkit
nvidia-cublas-cu12
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip to attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
This is saying "you asked me to install cuda-toolkit==12.9.1, but its nvidia-cublas-cu12 pin is incompatible with torch's nvidia-cublas-cu12==12.6.4.1".
We can work with this! Just need to figure out where that cuda-toolkit==12.9.1 is coming from.
There was a problem hiding this comment.
Ok here's an interesting clue... looks like in recent successful cugraph-pyg runs, CUDA torch might have been getting replaced with a CPU-only one from pypi.org:
...
Downloading http://pip-cache.local.gha-runners.nvidia.com/packages/56/be/76eaa36c9cd032d3b01b001e2c5a05943df75f26211f68fae79e62f87734/torch-2.9.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB)
...
That would explain why I'm not able to get the environment to solve with similar versions as were found in those jobs!
There was a problem hiding this comment.
This is a known issue (I was just late to it), where some CUDA 12 torch wheels were not installable alongside ANY cuda-toolkit wheels because they mixed == pins across CTK versions.
Documented that here: rapidsai/build-planning#255
I've pushed commits here pinning to specific known-compatible, CUDA variant torch wheels in wheel testing... hopefully that will work.
There was a problem hiding this comment.
Ok lots of wheel tests are passing now! All pylibwholegraph and CUDA 12 cugraph-pyg tests are looking good (using the nightly matrix).
Looks like there was another issue hiding in here though... cugraph-pyg CUDA 13 wheel tests are failing like this:
/__w/cugraph-gnn/cugraph-gnn/python/cugraph-pyg/cugraph_pyg /__w/cugraph-gnn/cugraph-gnn
ImportError while loading conftest '/__w/cugraph-gnn/cugraph-gnn/python/cugraph-pyg/cugraph_pyg/tests/conftest.py'.
tests/conftest.py:9: in <module>
from pylibcugraph.comms import (
/pyenv/versions/3.12.12/lib/python3.12/site-packages/pylibcugraph/__init__.py:15: in <module>
import pylibcugraph.comms
/pyenv/versions/3.12.12/lib/python3.12/site-packages/pylibcugraph/comms/__init__.py:4: in <module>
from .comms_wrapper import init_subcomms
E ImportError: libcugraph.so: cannot open shared object file: No such file or directory
Error: Process completed with exit code 4.
Ignore "No such file or directory", that's misleading (we'll fix that in rapidsai/build-planning#119 at some point).
The real issue is that libcugraph.so cannot be loaded. I've opened an issue about it here: rapidsai/cugraph#5443
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Nightly CI here has been failing for a couple weeks, and the root cause is "some jobs are installing incorrect `torch` wheels". That's tracked in #410 and being worked on in #413. That work unfortunately uncovered some other significant compatibility issues that will require RAPIDS-wide fixes: * rapidsai/build-planning#256 * rapidsai/build-planning#257 * rapidsai/cugraph#5443 As a short-term patch, this proposes allowing `cugraph-gnn` nightlies to fail for a few more weeks, so regular PR CI can be unblocked while we focus on the more permanent fix. Targeting the more permanent fix (and reverting this back to 7 days) for the 26.04 release (so over the next few weeks). Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Gil Forsyth (https://github.com/gforsyth) URL: #419
Contributes to #5443 Related to rapidsai/build-planning#143 `libcugraph.so` dynamically links to several CUDA Toolkit libraries ```console $ ldd /pyenv/versions/3.11.14/lib/python3.11/site-packages/libcugraph/lib64/libcugraph.so ... libcusolver.so.12 => /usr/local/cuda/lib64/libcusolver.so.12 (0x00007c616aba7000) libcublas.so.13 => /usr/local/cuda/lib64/libcublas.so.13 (0x00007c61675d5000) libcublasLt.so.13 => /usr/local/cuda/lib64/libcublasLt.so.13 (0x00007c6143c83000) libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007c613987c000) libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007c6131161000) ... libnvJitLink.so.13 => /usr/local/cuda/lib64/libnvJitLink.so.13 (0x00007c612af5f000) ... ``` This proposes getting them from `cuda-toolkit` wheels, instead of system installations. ## Notes for Reviewers ### Benefits of this change * reduces the risk of multiple copies of the same library being loaded * allows the use of Python package versioning to manage compatibility * consistency with other RAPIDS libraries (see rapidsai/build-planning#35) * reduces the risk of runtime issues with other libraries that use CTK wheels, like `torch` (rapidsai/cugraph-gnn#413 (comment)) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #5444
Fixes #410
There, some nightly wheels tests were failing because CUDA 13 packages were being installed but testing against CUDA 12
pylibwholegraphpackages. This fixes that, along with some other improvements to wheel testing:torchare always installed (no fallback to pypi.org CPU-only packages)torchindex (based on CUDA major version) is used