Skip to content

Conversation

@EmmaQiaoCh
Copy link
Collaborator

@EmmaQiaoCh EmmaQiaoCh commented Dec 9, 2025

Summary by CodeRabbit

  • Chores
    • Updated Docker base image registry and build tags for multi-stage builds
    • Upgraded CUDA toolkit to version 13.1.0
    • Updated PyTorch from 2.9.0 to 2.9.1
    • Upgraded TensorRT, cuDNN, NCCL, and cuBLAS dependencies to latest compatible versions
    • Updated Polygraphy to version 0.49.26
    • Updated transformers library dependency to 4.57.1

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@EmmaQiaoCh EmmaQiaoCh requested review from a team as code owners December 9, 2025 02:54
@EmmaQiaoCh EmmaQiaoCh marked this pull request as draft December 9, 2025 02:54
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Walkthrough

Version updates across Docker build configuration and Python dependencies. Changes affect base images, CUDA toolkit, PyTorch, TensorRT, Polygraphy, and package versions without modifying any control flow or introducing new features.

Changes

Cohort / File(s) Summary
Docker multi-stage build configuration
docker/Dockerfile.multi
Updated default build arguments: BASE_IMAGE to gitlab-master.nvidia.com:5005/dl/dgx/pytorch, BASE_TAG to 25.12-py3.39451658-devel, TRITON_BASE_TAG to 25.11-py3
Docker CUDA toolkit installer
docker/common/install_cuda_toolkit.sh
Updated CUDA_VER from 13.0.2_580.95.05 to 13.1.0_590.44.01
Docker Polygraphy installer
docker/common/install_polygraphy.sh
Updated Polygraphy version from 0.49.9 to 0.49.26
Docker PyTorch installer
docker/common/install_pytorch.sh
Updated PyTorch version from 2.9.0 to 2.9.1
Docker TensorRT installer
docker/common/install_tensorrt.sh
Updated TensorRT to 10.14.1.48 and CUDA stack versions: CUDA_VER to 13.1, CUDNN_VER to 9.17.0.21-1, NCCL_VER to 2.28.9-1+cuda13.0, CUBLAS_VER to 13.2.0.9-1, NVRTC_VER to 13.1.80-1, CUDA_RUNTIME to 13.1.80-1, CUDA_DRIVER_VERSION to 590.44.01-1.el8
Python dependencies
requirements.txt
Updated tensorrt to ~=10.14.1, torch range to >=2.1.0a0,<=2.9.1, nvidia-nccl-cu13 to 2.28.9, transformers to 4.57.1

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verification focus: Ensure all version updates across Docker scripts and requirements.txt are consistent (e.g., CUDA and TensorRT versions align across multiple files)
  • Cross-file consistency: Check that docker/common/install_tensorrt.sh and requirements.txt TensorRT versions match (10.14.1)
  • Dependency compatibility: Verify PyTorch 2.9.1 and TensorRT 10.14.1 compatibility with CUDA 13.1

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely template boilerplate with no substantive content. All required sections (Description, Test Coverage) are empty; only a checked checklist box is present without actual details. Fill in the Description section explaining which dependencies were updated and why, and complete the Test Coverage section listing relevant tests that validate the dependency updates.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Update dependencies' directly and accurately summarizes the main change—updating multiple dependency versions across Docker files and requirements.txt.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a3a66b and b065e62.

📒 Files selected for processing (6)
  • docker/Dockerfile.multi (1 hunks)
  • docker/common/install_cuda_toolkit.sh (1 hunks)
  • docker/common/install_polygraphy.sh (1 hunks)
  • docker/common/install_pytorch.sh (1 hunks)
  • docker/common/install_tensorrt.sh (1 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-10-17T13:21:31.724Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.

Applied to files:

  • requirements.txt
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • requirements.txt
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • requirements.txt
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • requirements.txt
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • requirements.txt
  • docker/common/install_tensorrt.sh
📚 Learning: 2025-08-20T15:04:42.885Z
Learnt from: dbari
Repo: NVIDIA/TensorRT-LLM PR: 7095
File: docker/Dockerfile.multi:168-168
Timestamp: 2025-08-20T15:04:42.885Z
Learning: In docker/Dockerfile.multi, wildcard COPY for benchmarks (${CPP_BUILD_DIR}/benchmarks/*Benchmark) is intentionally used instead of directory copy because the benchmarks directory contains various other build artifacts during C++ builds, and only specific benchmark executables should be copied to the final image.

Applied to files:

  • docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
docker/Dockerfile.multi (2)

5-5: The Triton container image nvcr.io/nvidia/tritonserver:25.11-py3 is published and available on the NGC registry. The TRITON_BASE_TAG=25.11-py3 update is valid.


2-4: MAJOR: Base image switched from public NGC registry to internal GitLab registry.

Line 2 changes the BASE_IMAGE from the public NVIDIA NGC registry (nvcr.io/nvidia/pytorch) to an internal NVIDIA GitLab registry (gitlab-master.nvidia.com:5005/dl/dgx/pytorch). This prevents external users and CI/CD systems from accessing the image, breaks reproducibility, and contradicts established best practices for TensorRT-LLM which should reference only published NGC container images.

Additionally, Line 4 BASE_TAG includes an internal commit hash (py3.39451658-devel), indicating a development snapshot rather than a stable release.

Confirm whether this is intentional for the main branch, and if a public release is intended, identify and use the corresponding published NGC image instead.

requirements.txt (3)

28-28: NCCL 2.28.9 is available on PyPI.

nvidia-nccl-cu13 version 2.28.9 is confirmed available on PyPI (released Nov 18, 2025) with wheels for x86_64 and aarch64 architectures. No compatibility concerns identified.


22-22: TensorRT 10.14.1 is available on PyPI (tensorrt-10.14.1.48.post1) and is compatible with CUDA 13.1. No compatibility issues identified.


30-30: No action needed — transformers 4.57.1 is available and compatible.

transformers 4.57.1 is available on PyPI (released Oct 14, 2025) and compatible with PyTorch 2.9.1. The package is tested against PyTorch 2.2+, and practical usage confirms it works reliably with torch ≥ 2.3. The version bump from 4.56.0 to 4.57.1 is safe.

# CUDA version is usually aligned with the latest NGC CUDA image tag.
# Only use when public CUDA image is not ready.
CUDA_VER="13.0.2_580.95.05"
CUDA_VER="13.1.0_590.44.01"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

CUDA 13.1.0 driver 590.44.01 NVIDIA availability compatibility

💡 Result:

Short answer: Yes — NVIDIA driver 590.44.01 (beta, released Dec 2, 2025) is a newer driver branch than the R580 baseline required for CUDA 13.x, so it should be compatible with CUDA 13.1.0. (docs.nvidia.com)

Notes:

  • CUDA 13.x requires a driver >= the R580 series (minimum driver ~580.xx); any later driver branch (including 590) is backward‑compatible for running CUDA 13.x applications. (docs.nvidia.com)
  • 590.44.01 is a Linux beta release (Dec 2, 2025) and its release notes / community reports indicate changes to supported hardware (some older GTX 9/10-series behavior on Linux). If you use older Maxwell/Pascal GPUs, test carefully before upgrading. (nvidia.com)

If you want, I can: (a) check the exact CUDA 13.1 toolkit release notes for any additional driver-minimums, or (b) verify whether a specific GPU model you have is listed as supported with the 590.44.01 driver.

Citations:


CUDA 13.1.0 and driver 590.44.01 are compatible and available.

Both versions exist and are legitimate NVIDIA releases. CUDA 13.x requires driver >= R580 series, and driver 590.44.01 is backward-compatible. However, note that 590.44.01 is a beta driver (released Dec 2, 2025) with hardware compatibility changes for older Maxwell/Pascal GPUs—test thoroughly before upgrading on systems using legacy GPU models.

🤖 Prompt for AI Agents
In docker/common/install_cuda_toolkit.sh around line 8, the script sets
CUDA_VER="13.1.0_590.44.01" but lacks documentation about driver compatibility
and the fact that 590.44.01 is a beta release with potential hardware impacts;
add an inline comment above this variable stating that CUDA 13.1 requires driver
R580+ and that driver 590.44.01 is a beta release (Dec 2, 2025) which may affect
older Maxwell/Pascal GPUs and should be tested on legacy hardware before
deployment, so operators are warned to validate hardware compatibility and
consider pinning a stable driver if needed.

# and closest to the version specified in
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
TORCH_VERSION="2.9.0"
TORCH_VERSION="2.9.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyTorch 2.9.1 PyPI availability CUDA 13.1 compatibility

💡 Result:

Short answer: Yes — PyTorch 2.9.1 is published to PyPI (release files uploaded Nov 12, 2025). Official prebuilt CUDA wheels for 2.9.1 on download.pytorch.org are provided for CUDA 13.0 (cu130) — CUDA 13.1 / cu131 is not listed as an official wheel. [1][2][3]

Sources:

  • PyTorch 2.9.1 GA announcement (promotions to PyPI/download done). [1]
  • PyTorch wheel index showing torch-2.9.1+cu130 files. [2]
  • torch (PyPI) release files (torch‑2.9.1 upload entries). [3]

Address CUDA 13.1 compatibility gap for PyTorch 2.9.1.

PyTorch 2.9.1 is available on PyPI but official prebuilt wheels are only provided for CUDA 13.0 (cu130). CUDA 13.1 (cu131) is not officially supported. If this script targets CUDA 13.1, either use the cu130 wheel (with caution), downgrade to a version with CUDA 13.1 support, or add a note documenting this limitation.

🤖 Prompt for AI Agents
docker/common/install_pytorch.sh around line 8: TORCH_VERSION is set to "2.9.1"
but PyTorch 2.9.1 does not have official cu131 wheels; update the script to
explicitly pick a compatible CUDA wheel or document the limitation. Either (A)
set TORCH_VERSION and the corresponding CUDA tag to an officially supported
pairing (e.g., keep 2.9.1 but force the cu130 wheel tag when installing), or (B)
change TORCH_VERSION to a release that officially supports CUDA 13.1, and add a
short comment above the variable explaining which CUDA tag is targeted and that
cu131 has no official 2.9.1 wheel. Ensure the install command uses the matching
CUDA wheel tag and add the comment so future maintainers know the compatibility
choice.

Comment on lines 5 to 18
TRT_VER="10.14.1.48"
# Align with the pre-installed cuDNN / cuBLAS / NCCL versions from
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
CUDA_VER="13.0" # 13.0.2
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-12.html#rel-25-12
CUDA_VER="13.1" # 13.1.0
# Keep the installation for cuDNN if users want to install PyTorch with source codes.
# PyTorch 2.x can compile with cuDNN v9.
CUDNN_VER="9.14.0.64-1"
NCCL_VER="2.27.7-1+cuda13.0"
CUBLAS_VER="13.1.0.3-1"
CUDNN_VER="9.17.0.21-1"
NCCL_VER="2.28.9-1+cuda13.0"
CUBLAS_VER="13.2.0.9-1"
# Align with the pre-installed CUDA / NVCC / NVRTC versions from
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
NVRTC_VER="13.0.88-1"
CUDA_RUNTIME="13.0.96-1"
CUDA_DRIVER_VERSION="580.95.05-1.el8"
NVRTC_VER="13.1.80-1"
CUDA_RUNTIME="13.1.80-1"
CUDA_DRIVER_VERSION="590.44.01-1.el8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

NVIDIA CUDA 13.1 cuDNN 9.17.0 cuBLAS 13.2 compatibility

💡 Result:

Short answer: cuDNN 9.17.0 is officially supported on CUDA 13.1, but cuBLAS 13.2 is tied to CUDA 13.2 — so either use cuBLAS 13.1 with CUDA 13.1 + cuDNN 9.17.0, or upgrade the CUDA toolkit to 13.2 (and matching driver) if you want cuBLAS 13.2. [1][2][3]

Details / references:

  • cuDNN 9.17.0 support matrix — lists "cuDNN 9.17.0 for CUDA 13.x" and explicitly includes CUDA 13.1 (recommended config: cuDNN 9.17.0 with CUDA 13.1). [1]
  • cuBLAS is shipped as part of the CUDA Toolkit releases (cuBLAS 13.2 corresponds to CUDA 13.2 toolchain). Mixing a cuBLAS built for 13.2 with a CUDA 13.1 toolkit is not the recommended configuration; use the cuBLAS that matches your CUDA toolkit or upgrade CUDA. See CUDA/cuBLAS release notes and toolkit components. [2][3]

If you want, tell me which exact OS and driver version you plan to use and I’ll give the precise driver/CUDA/cuBLAS/cuDNN combo to install.

Sources:
[1] NVIDIA cuDNN 9.17.0 support matrix.
[2] NVIDIA CUDA Toolkit / cuBLAS release notes (CUDA 13.x).
[3] NVIDIA Frameworks / container support matrix (shows cuBLAS versions tied to CUDA).


🌐 Web query:

NVIDIA CUDA 13.1 package repository cuDNN 9.17.0.21-1

💡 Result:

You can install cuDNN 9.17 (for CUDA 13.1) from NVIDIA's CUDA package repository. Steps (Ubuntu/Debian example) — replace ${OS} with ubuntu2204|ubuntu2004|debian11 as appropriate:

  1. Add NVIDIA repo and keys:
    wget https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/cuda-${OS}.pin
    sudo mv cuda-${OS}.pin /etc/apt/preferences.d/cuda-repository-pin-600
    sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/3bf863cc.pub
    sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/ /"
    sudo apt-get update. [1][2]

  2. Install the cuDNN 9 (CUDA‑13) packages. Example (specify exact version if you need 9.17.0.21-1):
    sudo apt-get install libcudnn9=9.17.0.21-1+cuda13.1 libcudnn9-dev=9.17.0.21-1+cuda13.1
    or use the meta-package to get the latest cuDNN 9 for CUDA 13:
    sudo apt-get install cudnn9-cuda-13. [2][5]

Notes

  • Official cuDNN 9.17 supports CUDA 13.x (including 13.1) — see support matrix. [1]
  • If you use RHEL/Rocky/SUSE use the dnf/zypper instructions and package names (cudnn9-cuda-13 or libcudnn9-cuda-13). [2][5]
  • If the exact package build string you need (9.17.0.21-1) is not available via repo, download the archived packages directly from NVIDIA’s cuDNN redist/index pages and install locally. [4][6]

References
[1] NVIDIA cuDNN support matrix (cuDNN 9.17 for CUDA 13.x).
[2] NVIDIA cuDNN installation (Linux) — repo + apt/dnf/zypper install instructions.
[4] NVIDIA cuDNN redist/index (direct package archive).
[5] cuDNN base/meta package names (libcudnn9-cuda-13, cudnn9-cuda-13).


🌐 Web query:

NCCL 2.28.9 CUDA 13.1 compatibility

💡 Result:

Short answer: Yes — NCCL 2.28.9 supports CUDA 13.x (so it is compatible with CUDA 13.1). [1][2]

Sources:

  • NCCL 2.28.9 release notes (states "This NCCL release supports CUDA 12.x and CUDA 13.x"). [1]
  • NVIDIA frameworks/support matrix showing CUDA 13.x usage across NVIDIA DL components. [2]

Fix cuBLAS version mismatch with CUDA 13.1.

Line 13 specifies cuBLAS 13.2.0.9-1, but the script uses CUDA 13.1 (line 8). cuBLAS 13.2 is tied to CUDA 13.2; using it with CUDA 13.1 is an unsupported configuration. Either downgrade cuBLAS to 13.1.0.3-1 (or later 13.1.x) to match CUDA 13.1, or upgrade CUDA to 13.2 with a matching driver version.

The other updated versions are compatible: cuDNN 9.17.0.21-1 and NCCL 2.28.9 both support CUDA 13.1.

🤖 Prompt for AI Agents
In docker/common/install_tensorrt.sh around lines 5 to 18, CUBLAS_VER is set to
13.2.0.9-1 while CUDA_VER is 13.1 which is an unsupported mix; change the
CUBLAS_VER to a 13.1-compatible release (e.g., set CUBLAS_VER="13.1.0.3-1" or
another 13.1.x package) so cuBLAS matches CUDA 13.1, or alternatively bump
CUDA_VER (and any corresponding CUDA_RUNTIME/NVRTC/CUDA_DRIVER_VERSION) to 13.2
to match the existing cuBLAS—preferably update CUBLAS_VER to 13.1.0.3-1 to keep
CUDA at 13.1.

requirements.txt Outdated
Comment on lines 23 to 24
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10 uses 2.9.0a0.
torch>=2.9.0a0,<=2.9.0
torch>=2.1.0a0,<=2.9.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's confirm the actual torch version specifications in requirements.txt
cat -n requirements.txt | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 182


🏁 Script executed:

# Check install_tensorrt.sh to understand the PyTorch release notes reference
cat -n docker/common/install_tensorrt.sh | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 182


🏁 Script executed:

# Search for PyTorch version checks or compatibility requirements in the codebase
rg -i "torch.*version" --type py -A 3 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 182


🏁 Script executed:

# Look for CUDA version compatibility specifications
rg -i "cuda.*13\.1\|cuda.*version" --type py --type cpp -A 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 182


🏁 Script executed:

# Check if there are CI configurations testing specific PyTorch versions
fd "\.yml|\.yaml" | xargs grep -l "torch" 2>/dev/null | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 182


🌐 Web query:

PyTorch 2.1 2.2 2.3 CUDA 13.1 compatibility support

💡 Result:

Short answer: Official PyTorch binaries for 2.1 / 2.2 / 2.3 do NOT include CUDA 13.1 (cu131) wheels. PyTorch 2.1–2.3 official releases provide CUDA 11.8 / 12.1 (and similar 12.x) builds — CUDA 13.x wheels are provided only in later PyTorch releases/nightlies. If you need CUDA 13.1 with one of those older PyTorch versions you must build PyTorch from source (or use a later PyTorch wheel compiled for cu131). [1][2][3]

References

  • PyTorch 2.3 RC/installer notes: official CUDA builds listed as CUDA 11.8 and 12.1 for 2.3. [1]
  • PyTorch Get Started / installer page shows the official supported CUDA build tags (stable/pip/conda) and up through 12.x for recent stable releases (no cu131 listed for 2.1–2.3). [2]
  • Community compatibility matrix / discussion showing 2.1–2.3 map to CUDA 12.1/11.8 (no cu131 wheels) and advice to build from source when you need a different local CUDA toolkit. [3]

If you want, tell me which PyTorch version and OS/package manager you use and I’ll give the exact build (pip/conda) command or step‑by‑step source-build instructions for CUDA 13.1.


🌐 Web query:

PyTorch 2.9 CUDA 13.1 support compatibility

💡 Result:

Short answer: PyTorch 2.9 does not officially ship a CUDA 13.1 wheel. PyTorch 2.9 added CUDA 13 support (CUDA 13.0 as a prototype/variant) while shipping stable wheels for CUDA 12.8 (and older 12.x); CUDA 13.0/variant wheels are the supported CUDA‑13 option for 2.9. [1][2][3]

If you need CUDA 13.1 specifically:

  • Use the CUDA 13.0 (cu130) PyTorch 2.9 wheel if that is acceptable. [2]
  • Otherwise you must build PyTorch 2.9 from source against your CUDA 13.1 toolkit (or use a newer PyTorch release that explicitly lists cu131 wheels). Building from source is the way to ensure compatibility when an official binary for a minor CUDA point‑release is not provided. [1][3]

Sources:

  • PyTorch 2.9 release notes / blog (CUDA 13 wheel/variant support). [1]
  • PyTorch previous‑versions / download instructions listing cu130/cu128/cu126 wheels for 2.9. [2]
  • PyTorch CUDA support RFC / release discussion (CUDA 13.0 marked prototype for 2.9). [3]

🌐 Web query:

PyTorch release notes CUDA version requirements 2024 2025

💡 Result:

Summary — PyTorch CUDA requirements (2024–2025)

  • 2024 (PyTorch 2.2 release): official PyTorch release notes / install guidance for the 2.2 timeframe list CUDA 11.8 and CUDA 12.x (12.1) as the primary supported binary runtimes for PyTorch 2.2. See the PyTorch 2.2 announcement and the Get Started installer guidance. [1][2]

  • 2024–2025 (later 2024 & 2025 builds/containers): PyTorch container images and NVIDIA-optimized builds moved to CUDA 12.x series (examples: CUDA 12.4.1 in the 24.05 image, CUDA 12.6.3 in 24.11, CUDA 12.8 in 25.01, and CUDA 12.9 in 25.04). For container/NGC builds verify the specific release’s CUDA toolkits and driver requirements. [3][4][5]

Notes:

  • PyTorch pip/Conda binaries are published for specific CUDA runtimes (e.g., cu118, cu121, etc.); drivers supporting a given CUDA toolkit are required but newer drivers are typically backward-compatible. If you need to run with a different local CUDA toolkit you can build from source. Check the PyTorch “Get Started / Locally” installer for the exact CUDA tags available for each PyTorch release. [2][3]

Sources

  1. PyTorch 2.2 release announcement. [PyTorch blog]. [1] (pytorch.org)
  2. PyTorch Get Started / Locally (installation matrix & CUDA choices). [PyTorch docs]. [2] (pytorch.org)
  3. NVIDIA PyTorch release notes — PyTorch Release 24.05 (CUDA 12.4.1). [3] (docs.nvidia.com)
  4. NVIDIA PyTorch release notes — PyTorch Release 24.11 (CUDA 12.6.3). [4] (docs.nvidia.com)
  5. NVIDIA PyTorch release notes — PyTorch Release 25.04 / 25.01 (CUDA 12.9 / 12.8). [5] (docs.nvidia.com)

If you want, I can list exact PyTorch binary CUDA tags (cu116/cu118/cu121/… ) for each PyTorch release (2024 and 2025) and the matching pip/conda install commands.

Citations:


🌐 Web query:

PyTorch 2.4 2.5 2.6 2.7 2.8 CUDA 13 support official wheels

💡 Result:

Short answer — official PyTorch wheels with CUDA 13 were not provided for 2.4–2.8. PyTorch began publishing official CUDA‑13 wheel support with the 2.9 release (CUDA‑13 added to the wheel-variant matrix); 2.8 focused on CUDA 12.x (and experimental wheel‑variants) and earlier releases do not have CUDA‑13 wheels. [1][2][3][4]

Sources:

  • PyTorch 2.8 release (CUDA 12.x focus, wheel‑variants experimental). [1]
  • PyTorch 2.7 release notes (CUDA 12.8 wheels mentioned). [2]
  • PyTorch 2.9 release (explicitly expands wheel-variant support to include CUDA 13). [3]
  • Community/issue reports noting lack of CUDA‑13 wheels for earlier/nightly builds. [4]

References:
[1] https://pytorch.org/blog/pytorch-2-8/ (PyTorch 2.8)
[2] https://pytorch.org/blog/pytorch-2-7/ (PyTorch 2.7)
[3] https://pytorch.org/blog/pytorch-2-9/ (PyTorch 2.9: CUDA 13 added)
[4] pytorch/pytorch#167817 (community issue re: missing CUDA 13 wheels)


CRITICAL: Torch version range expansion is incompatible with CUDA 13.1 requirements.

The change from torch>=2.9.0a0,<=2.9.0 to torch>=2.1.0a0,<=2.9.1 creates an unresolvable dependency conflict. Official PyTorch wheels for versions 2.1–2.8 do not support CUDA 13.x at all (only CUDA 11.8/12.x). PyTorch 2.9, the first release with CUDA 13 support, only provides CUDA 13.0 (cu130) wheels, not CUDA 13.1. Since the requirements specify nvidia-nccl-cu13 (CUDA 13.1), users installing torch>=2.1.0a0 will receive CUDA 12.x wheels that are incompatible with the specified NCCL library. This breaks the installation for anyone not building PyTorch from source.

If the intent is to broaden PyTorch version support, the lower bound must remain at least >=2.9.0 to ensure CUDA 13.x compatibility. Alternatively, if older PyTorch versions are needed, the NCCL requirement must be adjusted to CUDA 12.x compatible versions, or documentation must explicitly state that source builds of PyTorch are required.

🤖 Prompt for AI Agents
In requirements.txt around lines 23-24, the expanded torch range
(torch>=2.1.0a0,<=2.9.1) creates an impossible combo with the repo's
nvidia-nccl-cu13 (CUDA 13.1) because official PyTorch wheels with CUDA 13
support start at 2.9.0 and only provide cu130, not cu131; fix by restoring a
compatible lower bound (e.g., set torch>=2.9.0,<=2.9.1 or torch==2.9.*) so
installed PyTorch has CUDA 13 support, or else change the NCCL package to a CUDA
12.x variant if you must allow torch 2.1–2.8, or if neither is acceptable, add a
clear comment in requirements and README stating that PyTorch must be built from
source for CUDA 13.1 and that the current pinned combos are incompatible.

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27405 [ run ] triggered by Bot. Commit: b065e62

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27436 [ run ] triggered by Bot. Commit: 66b5473

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27405 [ run ] completed with state ABORTED. Commit: b065e62
LLM/main/L0_MergeRequest_PR #20939 (Blue Ocean) completed with status: ABORTED

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27462 [ run ] triggered by Bot. Commit: fdc2030

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27462 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User EmmaQiaoCh is not in the prioritized list. L0 testing cannot be triggered.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27436 [ run ] completed with state FAILURE. Commit: 66b5473
/LLM/main/L0_MergeRequest_PR pipeline #20964 (Partly Tested) completed with status: 'FAILURE'

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27495 [ run ] triggered by Bot. Commit: fdc2030

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27495 [ run ] completed with state FAILURE. Commit: fdc2030
/LLM/main/L0_MergeRequest_PR pipeline #20975 (Partly Tested) completed with status: 'FAILURE'

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27593 [ run ] triggered by Bot. Commit: c47fa37

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27672 [ run ] triggered by Bot. Commit: 3996eef

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27593 [ run ] completed with state ABORTED. Commit: c47fa37
LLM/main/L0_MergeRequest_PR #21063 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27672 [ run ] completed with state FAILURE. Commit: 3996eef
/LLM/main/L0_MergeRequest_PR pipeline #21125 (Partly Tested) completed with status: 'FAILURE'

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27689 [ run ] triggered by Bot. Commit: 3996eef

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27778 [ run ] triggered by Bot. Commit: 5ab653d

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29878 [ run ] triggered by Bot. Commit: 9fbe4c6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29879 [ run ] triggered by Bot. Commit: 9fbe4c6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29878 [ run ] completed with state ABORTED. Commit: 9fbe4c6

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29903 [ run ] triggered by Bot. Commit: c7537d6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29903 [ run ] completed with state SUCCESS. Commit: c7537d6
/LLM/main/L0_MergeRequest_PR pipeline #22998 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30133 [ run ] triggered by Bot. Commit: bfe1e20

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30133 [ run ] completed with state FAILURE. Commit: bfe1e20
/LLM/main/L0_MergeRequest_PR pipeline #23187 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: qqiao <[email protected]>
@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30165 [ run ] triggered by Bot. Commit: aabca98

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30165 [ run ] completed with state FAILURE. Commit: aabca98
/LLM/main/L0_MergeRequest_PR pipeline #23213 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30177 [ run ] triggered by Bot. Commit: cb0057a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30177 [ run ] completed with state FAILURE. Commit: cb0057a
/LLM/main/L0_MergeRequest_PR pipeline #23224 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30221 [ run ] triggered by Bot. Commit: 894b1bd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30221 [ run ] completed with state FAILURE. Commit: 894b1bd
/LLM/main/L0_MergeRequest_PR pipeline #23265 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: qqiao <[email protected]>
@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30315 [ run ] triggered by Bot. Commit: bff85f4

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30315 [ run ] completed with state SUCCESS. Commit: bff85f4
/LLM/main/L0_MergeRequest_PR pipeline #23349 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants