Skip to content

ensure 'torch' CUDA wheels are installed in CI#2279

Open
jameslamb wants to merge 5 commits intorapidsai:mainfrom
jameslamb:torch-testing
Open

ensure 'torch' CUDA wheels are installed in CI#2279
jameslamb wants to merge 5 commits intorapidsai:mainfrom
jameslamb:torch-testing

Conversation

@jameslamb
Copy link
Member

Description

Contributes to rapidsai/build-planning#256

Broken out from #2270

Proposes a stricter pattern for installing torch wheels, to prevent bugs of the form "accidentally used a CPU-only torch from pypi.org". This should help us to catch compatibility issues, improving release confidence.

Other small changes:

  • splits torch wheel testing into "oldest" (PyTorch 2.9) and "latest" (PyTorch 2.10)
  • introduces a require_gpu_pytorch matrix filter so conda jobs can explicitly request pytorch-gpu (to similarly ensure solvers don't fall back to the GPU-only variant)
  • appends rapids-generate-pip-constraint output to file PIP_CONSTRAINT points
    • (to reduce duplication and the risk of failing to apply constraints)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@jameslamb jameslamb added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Mar 6, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 6, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jameslamb
Copy link
Member Author

/ok to test

Comment on lines +423 to +426
# avoid pulling in 'torch' in places like DLFW builds that prefer to install it other ways
- matrix:
no_pytorch: "true"
packages:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the pattern @trxcllnt has been introducing across RAPIDS: rapidsai/cugraph-gnn#421

Think rmm never needed patches for DLFW and so was missed in that round of PRs because its - depends_on_pytorch group doesn't end up in test_python or similar commonly-used lists.

@jameslamb
Copy link
Member Author

/ok to test

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 92c54388-7bf6-4169-a2b9-e75808aa98f6

📥 Commits

Reviewing files that changed from the base of the PR and between 9eefdea and 9efce26.

📒 Files selected for processing (1)
  • ci/test_wheel_integrations.sh

📝 Walkthrough

Summary by CodeRabbit

  • Chores
    • Added an automated PyTorch wheel downloader to ensure CUDA-targeted wheels are fetched for CI.
    • Centralized and clarified constraint handling for pip installs using an environment-driven constraint source.
    • Expanded dependency declarations to support multiple PyTorch/CUDA matrix combinations and added a new torch-only dependency group; removed the prior test-wheels grouping.
  • Tests
    • Updated GPU test flows to install CUDA-specific wheels and adjusted compatibility checks for newer CUDA versions.

Walkthrough

Adds a CI script to download CUDA-specific PyTorch wheels, updates CI test scripts to use an environment-driven PIP constraint and to download/use CUDA wheels for PyTorch tests, and restructures dependencies.yaml to replace a simple PyTorch entry with a multi-matrix depends_on_pytorch and a new torch_only group.

Changes

Cohort / File(s) Summary
PyTorch wheel downloader
ci/download-torch-wheels.sh
New executable script that generates torch-specific constraints via rapids-dependency-file-generator and downloads CUDA-variant PyTorch wheels with rapids-pip-retry into a specified directory.
CI test scripts
ci/test_python_integrations.sh, ci/test_wheel.sh, ci/test_wheel_integrations.sh
Switch constraint generation/usage to environment-driven ${PIP_CONSTRAINT}, add ;require_gpu_pytorch=true to the PyTorch GPU matrix entry, and refactor integrations flow to download/use CUDA-specific PyTorch wheels for GPU test runs.
Dependency configuration
dependencies.yaml
Remove test_wheels_pytorch file-group, add new torch_only group, and replace the simple depends_on_pytorch common block with a detailed specific multi-matrix declaration covering CUDA versions, GPU/non-GPU variants, and multiple output types (requirements, pyproject, conda).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: ensuring CUDA-enabled torch wheels are installed in CI, which is the primary objective across all modified files.
Description check ✅ Passed The description is directly related to the changeset, explaining the rationale for stricter torch wheel installation patterns, the split into PyTorch versions, the new matrix filter, and constraint handling improvements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
ci/download-torch-wheels.sh (1)

27-40: Keep the generated constraint file inside TORCH_WHEEL_DIR.

Writing torch-constraints.txt to ./ leaves shared state in the working tree even though the caller already gives this helper a per-run temp directory. Keeping the file under ${TORCH_WHEEL_DIR} makes the whole download step self-contained and avoids cross-run collisions.

♻️ Suggested refactor
+TORCH_CONSTRAINTS="${TORCH_WHEEL_DIR}/torch-constraints.txt"
+
 rapids-dependency-file-generator \
     --output requirements \
     --file-key "torch_only" \
     --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES};require_gpu_pytorch=true" \
-| tee ./torch-constraints.txt
+| tee "${TORCH_CONSTRAINTS}"
 
 rapids-pip-retry download \
   --isolated \
   --prefer-binary \
   --no-deps \
   -d "${TORCH_WHEEL_DIR}" \
   --constraint "${PIP_CONSTRAINT}" \
-  --constraint ./torch-constraints.txt \
+  --constraint "${TORCH_CONSTRAINTS}" \
   'torch'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/download-torch-wheels.sh` around lines 27 - 40, The generated constraint
file is written to ./torch-constraints.txt which leaves shared state; change the
rapids-dependency-file-generator pipeline so tee writes into the per-run
directory (use ${TORCH_WHEEL_DIR}/torch-constraints.txt) and update the
subsequent rapids-pip-retry download --constraint argument to point to that file
instead of ./torch-constraints.txt, keeping all references to
torch-constraints.txt, rapids-dependency-file-generator, tee,
${TORCH_WHEEL_DIR}, and the rapids-pip-retry download --constraint option
consistent.
dependencies.yaml (1)

401-409: Mirror the oldest/latest split on the conda path or explain why conda can use relaxed version constraints.

The caller passes dependencies=${RAPIDS_DEPENDENCIES} to the generator for PyTorch conda, but the conda matrices ignore this selector. The requirements (wheel) path pins specific PyTorch versions for dependencies=oldest (e.g., torch==2.9.0+cu129), whereas the conda path always uses relaxed constraints (pytorch-gpu>=2.9). Either add the same oldest/latest branching to the conda matrices, or document why conda can rely on the solver to handle any PyTorch >=2.9 version safely while wheels cannot.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dependencies.yaml` around lines 401 - 409, The conda matrices under
output_types: conda currently ignore the caller's dependencies selector
(dependencies=${RAPIDS_DEPENDENCIES}) and always use relaxed package specs
(packages: - pytorch-gpu>=2.9 / - pytorch>=2.9), which diverges from the wheel
path that branches on oldest/latest and pins versions (e.g.,
torch==2.9.0+cu129); update the conda matrices to mirror the oldest/latest split
(add separate matrix entries for dependencies=oldest that pin exact
pytorch/pytorch-gpu versions and for dependencies=latest that keep >=
constraints) or add a clear comment/documentation explaining why conda can
safely use relaxed constraints and referencing the matrix keys (matrices,
require_gpu_pytorch, packages) and the caller variable RAPIDS_DEPENDENCIES so
reviewers can verify the intended behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/test_wheel_integrations.sh`:
- Around line 40-45: The CI skip log in test_wheel_integrations.sh is out of
sync with the gate that checks CUDA_MAJOR/CUDA_MINOR (the conditional using
CUDA_MAJOR and CUDA_MINOR around the if block) — update the skip/message string
(the message printed around line 66) to match the actual gate: indicate "CUDA
12.9+ (for 12.x) or 13.0" instead of "12.6-12.9 or 13.0" so it accurately
reflects the condition in the if that checks CUDA_MAJOR and CUDA_MINOR.

---

Nitpick comments:
In `@ci/download-torch-wheels.sh`:
- Around line 27-40: The generated constraint file is written to
./torch-constraints.txt which leaves shared state; change the
rapids-dependency-file-generator pipeline so tee writes into the per-run
directory (use ${TORCH_WHEEL_DIR}/torch-constraints.txt) and update the
subsequent rapids-pip-retry download --constraint argument to point to that file
instead of ./torch-constraints.txt, keeping all references to
torch-constraints.txt, rapids-dependency-file-generator, tee,
${TORCH_WHEEL_DIR}, and the rapids-pip-retry download --constraint option
consistent.

In `@dependencies.yaml`:
- Around line 401-409: The conda matrices under output_types: conda currently
ignore the caller's dependencies selector (dependencies=${RAPIDS_DEPENDENCIES})
and always use relaxed package specs (packages: - pytorch-gpu>=2.9 / -
pytorch>=2.9), which diverges from the wheel path that branches on oldest/latest
and pins versions (e.g., torch==2.9.0+cu129); update the conda matrices to
mirror the oldest/latest split (add separate matrix entries for
dependencies=oldest that pin exact pytorch/pytorch-gpu versions and for
dependencies=latest that keep >= constraints) or add a clear
comment/documentation explaining why conda can safely use relaxed constraints
and referencing the matrix keys (matrices, require_gpu_pytorch, packages) and
the caller variable RAPIDS_DEPENDENCIES so reviewers can verify the intended
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ef2a43d-0a13-4967-a6ef-94f870a18ed8

📥 Commits

Reviewing files that changed from the base of the PR and between d1563fc and 5d328ff.

📒 Files selected for processing (5)
  • ci/download-torch-wheels.sh
  • ci/test_python_integrations.sh
  • ci/test_wheel.sh
  • ci/test_wheel_integrations.sh
  • dependencies.yaml

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/test_wheel_integrations.sh`:
- Around line 40-45: The top comment stating "requires CUDA 12.8+" is now
inaccurate relative to the conditional that permits CUDA 12.9+ for 12.x and only
13.0 for 13.x; update that comment above the gating if-block (the block checking
CUDA_MAJOR and CUDA_MINOR) to accurately describe the new policy (e.g., require
CUDA 12.9+ on 12.x and CUDA 13.0 on 13.x) so future triage matches the condition
in the { [ "${CUDA_MAJOR}" -eq 12 ] && [ "${CUDA_MINOR}" -ge 9 ]; } || { [
"${CUDA_MAJOR}" -eq 13 ] && [ "${CUDA_MINOR}" -le 0 ]; } check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23206a42-586c-4680-b0ff-0c16836fa1f9

📥 Commits

Reviewing files that changed from the base of the PR and between 5d328ff and 9eefdea.

📒 Files selected for processing (1)
  • ci/test_wheel_integrations.sh

-v \
"${PIP_INSTALL_SHARED_ARGS[@]}" \
-r test-pytorch-requirements.txt
"${TORCH_WHEEL_DIR}"/torch-*.whl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this is working and pulling in what we want!

CUDA 12.2.2, Python 3.11, arm64, ubuntu22.04, a100, latest-driver, latest-deps

(build link)

  RAPIDS logger » [03/06/26 19:41:00]
  ┌──────────────────────────────────────────────────────────────────────────┐
  |    Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 12.2.2)    |
  └──────────────────────────────────────────────────────────────────────────┘

CUDA 12.9.1, Python 3.11, amd64, ubuntu22.04, l4, latest-driver, oldest-deps

(build link)

  Successfully installed ... torch-2.9.0+cu129 ...

CUDA 12.9.1, Python 3.14, amd64, ubuntu24.04, h100, latest-driver, latest-deps

(build link)

  Successfully installed ... torch-2.10.0+cu129 ...

CUDA 13.0.2, Python 3.12, amd64, ubuntu24.04, l4, latest-driver, latest-deps

(build link)

  Successfully installed ... torch-2.10.0+cu130 ...

CUDA 13.0.2, Python 3.12, arm64, rockylinux8, l4, latest-driver, latest-deps

(build link)

  Successfully installed ... torch-2.10.0+cu130 ...

CUDA 13.1.1, Python 3.13, amd64, rockylinux8, rtxpro6000, latest-driver, latest-deps

(build link)

  RAPIDS logger » [03/06/26 19:35:46]
  ┌──────────────────────────────────────────────────────────────────────────┐
  |    Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 13.1.1)    |
  └──────────────────────────────────────────────────────────────────────────┘

CUDA 13.1.1, Python 3.14, amd64, ubuntu24.04, rtxpro6000, latest-driver, latest-deps

(build link)

  RAPIDS logger » [03/06/26 19:34:37]
  ┌──────────────────────────────────────────────────────────────────────────┐
  |    Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 13.1.1)    |
  └──────────────────────────────────────────────────────────────────────────┘

CUDA 13.1.1, Python 3.14, arm64, ubuntu24.04, l4, latest-driver, latest-deps

(build link)

  RAPIDS logger » [03/06/26 19:36:06]
  ┌──────────────────────────────────────────────────────────────────────────┐
  |    Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 13.1.1)    |
  └──────────────────────────────────────────────────────────────────────────┘

@jameslamb jameslamb changed the title WIP: ensure 'torch' CUDA wheels are installed in CI ensure 'torch' CUDA wheels are installed in CI Mar 6, 2026
@jameslamb jameslamb requested a review from bdice March 6, 2026 19:58
@jameslamb jameslamb marked this pull request as ready for review March 6, 2026 19:58
@jameslamb jameslamb requested review from a team as code owners March 6, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant