Skip to content

Commit d366c9a

Browse files
ZhanruiSunChchzblych
authored andcommitted
[TRTLLM-8994][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 (NVIDIA#8838)
Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Yanchao Lu <[email protected]> Signed-off-by: FredricZ-2007 <[email protected]>
1 parent 8777917 commit d366c9a

File tree

22 files changed

+103
-255
lines changed

22 files changed

+103
-255
lines changed

constraints.txt

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,2 @@
1-
# These vulnerabilities were inherited from the base image (pytorch:25.06-py3) and should be removed when the base image
1+
# These vulnerabilities were inherited from the base image (pytorch:25.10-py3) and should be removed when the base image
22
# is updated.
3-
4-
# WAR against https://github.com/advisories/GHSA-8qvm-5x2c-j2w7
5-
protobuf>=4.25.8

docker/Dockerfile.multi

Lines changed: 7 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# Multi-stage Dockerfile
22
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch
33
ARG TRITON_IMAGE=nvcr.io/nvidia/tritonserver
4-
ARG BASE_TAG=25.08-py3
5-
ARG TRITON_BASE_TAG=25.08-py3
4+
ARG BASE_TAG=25.10-py3
5+
# [TODO] Update to NVIDIA Triton 25.10 when it's available
6+
ARG TRITON_BASE_TAG=25.09-py3
67
ARG DEVEL_IMAGE=devel
78

89
FROM ${BASE_IMAGE}:${BASE_TAG} AS base
@@ -40,6 +41,9 @@ COPY docker/common/install.sh \
4041
docker/common/install_polygraphy.sh \
4142
docker/common/install_mpi4py.sh \
4243
docker/common/install_pytorch.sh \
44+
docker/common/install_ucx.sh \
45+
docker/common/install_nixl.sh \
46+
docker/common/install_etcd.sh \
4347
./
4448

4549
RUN GITHUB_MIRROR=${GITHUB_MIRROR} \
@@ -71,36 +75,15 @@ RUN GITHUB_MIRROR=${GITHUB_MIRROR} bash ./install.sh --mpi4py && rm install_mpi4
7175
ARG TORCH_INSTALL_TYPE="skip"
7276
RUN TORCH_INSTALL_TYPE=${TORCH_INSTALL_TYPE} bash ./install.sh --pytorch && rm install_pytorch.sh
7377

74-
RUN bash ./install.sh --opencv && bash ./install.sh --protobuf && rm install.sh
75-
76-
# wait for new triton to be published
77-
# Rename pytorch_triton package to triton
78-
RUN if [ -f /etc/redhat-release ]; then \
79-
echo "Rocky8 detected, skipping symlink and ldconfig steps"; \
80-
else \
81-
cd /usr/local/lib/python3.12/dist-packages/ && \
82-
ls -la | grep pytorch_triton && \
83-
mv pytorch_triton-3.3.1+gitc8757738.dist-info triton-3.3.1+gitc8757738.dist-info && \
84-
cd triton-3.3.1+gitc8757738.dist-info && \
85-
echo "Current directory: $(pwd)" && \
86-
echo "Files in directory:" && \
87-
ls -la && \
88-
sed -i 's/^Name: pytorch-triton/Name: triton/' METADATA && \
89-
sed -i 's|pytorch_triton-3.3.1+gitc8757738.dist-info/|triton-3.3.1+gitc8757738.dist-info/|g' RECORD && \
90-
echo "METADATA after update:" && \
91-
grep "^Name:" METADATA; \
92-
fi
78+
RUN bash ./install.sh --opencv && rm install.sh
9379

9480
# Install UCX first
95-
COPY docker/common/install_ucx.sh install_ucx.sh
9681
RUN GITHUB_MIRROR=${GITHUB_MIRROR} bash ./install_ucx.sh && rm install_ucx.sh
9782

9883
# Install NIXL
99-
COPY docker/common/install_nixl.sh install_nixl.sh
10084
RUN GITHUB_MIRROR=${GITHUB_MIRROR} bash ./install_nixl.sh && rm install_nixl.sh
10185

10286
# Install etcd
103-
COPY docker/common/install_etcd.sh install_etcd.sh
10487
RUN bash ./install_etcd.sh && rm install_etcd.sh
10588

10689
FROM ${TRITON_IMAGE}:${TRITON_BASE_TAG} AS triton
@@ -115,9 +98,6 @@ COPY --from=triton /opt/tritonserver/caches /opt/tritonserver/caches
11598

11699
# Copy all installation scripts at once to reduce layers
117100
COPY docker/common/install_triton.sh \
118-
docker/common/install_ucx.sh \
119-
docker/common/install_nixl.sh \
120-
docker/common/install_etcd.sh \
121101
./
122102

123103
RUN bash ./install_triton.sh && rm install_triton.sh

docker/Makefile

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -192,16 +192,17 @@ jenkins-rockylinux8_%: PYTHON_VERSION_TAG_ID = $(if $(findstring 3.12,${PYTHON_V
192192
jenkins-rockylinux8_%: IMAGE_WITH_TAG = $(shell . ../jenkins/current_image_tags.properties && echo $$LLM_ROCKYLINUX8_${PYTHON_VERSION_TAG_ID}_DOCKER_IMAGE)
193193
jenkins-rockylinux8_%: STAGE = tritondevel
194194
jenkins-rockylinux8_%: BASE_IMAGE = nvcr.io/nvidia/cuda
195-
jenkins-rockylinux8_%: BASE_TAG = 13.0.0-devel-rockylinux8
195+
# [TODO] Update to NVIDIA CUDA 13.0.2 when it's available
196+
jenkins-rockylinux8_%: BASE_TAG = 13.0.1-devel-rockylinux8
196197

197198
rockylinux8_%: STAGE = tritondevel
198199
rockylinux8_%: BASE_IMAGE = nvcr.io/nvidia/cuda
199-
rockylinux8_%: BASE_TAG = 13.0.0-devel-rockylinux8
200+
rockylinux8_%: BASE_TAG = 13.0.1-devel-rockylinux8
200201

201202
# For x86_64 and aarch64
202203
ubuntu22_%: STAGE = tritondevel
203204
ubuntu22_%: BASE_IMAGE = nvcr.io/nvidia/cuda
204-
ubuntu22_%: BASE_TAG = 13.0.0-devel-ubuntu22.04
205+
ubuntu22_%: BASE_TAG = 13.0.1-devel-ubuntu22.04
205206

206207
trtllm_%: STAGE = release
207208
trtllm_%: PUSH_TO_STAGING := 0

docker/common/install.sh

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ polygraphy=0
1616
mpi4py=0
1717
pytorch=0
1818
opencv=0
19-
protobuf=0
2019

2120
while [[ $# -gt 0 ]]; do
2221
case $1 in
@@ -56,10 +55,6 @@ while [[ $# -gt 0 ]]; do
5655
opencv=1
5756
shift 1
5857
;;
59-
--protobuf)
60-
protobuf=1
61-
shift 1
62-
;;
6358
--all)
6459
base=1
6560
cmake=1
@@ -70,7 +65,6 @@ while [[ $# -gt 0 ]]; do
7065
mpi4py=1
7166
pytorch=1
7267
opencv=1
73-
protobuf=1
7468
shift 1
7569
;;
7670
*)
@@ -135,10 +129,3 @@ if [ $opencv -eq 1 ]; then
135129
rm -rf /usr/local/lib/python3*/dist-packages/cv2/
136130
pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
137131
fi
138-
139-
# WARs against security issues inherited from pytorch:25.06
140-
# * https://github.com/advisories/GHSA-8qvm-5x2c-j2w7
141-
if [ $protobuf -eq 1 ]; then
142-
pip3 install --upgrade --no-cache-dir \
143-
"protobuf>=4.25.8"
144-
fi

docker/common/install_cuda_toolkit.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ set -ex
55
# This script is used for reinstalling CUDA on Rocky Linux 8 with the run file.
66
# CUDA version is usually aligned with the latest NGC CUDA image tag.
77
# Only use when public CUDA image is not ready.
8-
CUDA_VER="13.0.0_580.65.06"
8+
CUDA_VER="13.0.2_580.95.05"
99
CUDA_VER_SHORT="${CUDA_VER%_*}"
1010

1111
NVCC_VERSION_OUTPUT=$(nvcc --version)

docker/common/install_mpi4py.sh

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,15 @@ diff --git a/src/mpi4py/futures/_lib.py b/src/mpi4py/futures/_lib.py
2727
index f14934d1..eebfb8fc 100644
2828
--- a/src/mpi4py/futures/_lib.py
2929
+++ b/src/mpi4py/futures/_lib.py
30-
@@ -278,6 +278,40 @@ def _manager_comm(pool, options, comm, full=True):
30+
@@ -278,6 +278,43 @@ def _manager_comm(pool, options, comm, full=True):
3131
3232
3333
def _manager_split(pool, options, comm, root):
3434
+ if(os.getenv("TRTLLM_USE_MPI_KVCACHE")=="1"):
35-
+ from cuda import cudart
35+
+ try:
36+
+ from cuda.bindings import runtime as cudart
37+
+ except ImportError:
38+
+ from cuda import cudart
3639
+ has_slurm_rank=False
3740
+ has_ompi_rank=False
3841
+ slurm_rank=0
@@ -71,6 +74,10 @@ index f14934d1..eebfb8fc 100644
7174
EOF
7275

7376
# Install with pip and clean up cache
77+
ARCH=$(uname -m)
78+
if [ "$ARCH" = "aarch64" ]; then
79+
pip3 install --no-cache-dir Cython==0.29.37
80+
fi
7481
pip3 install --no-cache-dir "$TMP_DIR/mpi4py-${MPI4PY_VERSION}"
7582

7683
# Clean up

docker/common/install_pytorch.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ set -ex
44

55
# Use latest stable version from https://pypi.org/project/torch/#history
66
# and closest to the version specified in
7-
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-08.html#rel-25-08
8-
TORCH_VERSION="2.8.0"
7+
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
8+
TORCH_VERSION="2.9.0"
99
SYSTEM_ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
1010

1111
prepare_environment() {
@@ -69,8 +69,8 @@ install_from_pypi() {
6969
if [ "$ARCH" = "amd64" ];then ARCH="x86_64";fi
7070
if [ "$ARCH" = "aarch64" ];then ARCH="sbsa";fi
7171

72-
pip3 uninstall -y torch torchvision torchaudio
73-
pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
72+
pip3 uninstall -y torch torchvision
73+
pip3 install torch==${TORCH_VERSION} torchvision --index-url https://download.pytorch.org/whl/cu130
7474
}
7575

7676
case "$1" in

docker/common/install_tensorrt.sh

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,20 @@
22

33
set -ex
44

5-
TRT_VER="10.13.2.6"
5+
TRT_VER="10.13.3.9"
66
# Align with the pre-installed cuDNN / cuBLAS / NCCL versions from
7-
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-08.html#rel-25-08
8-
CUDA_VER="13.0" # 13.0.0
7+
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
8+
CUDA_VER="13.0" # 13.0.2
99
# Keep the installation for cuDNN if users want to install PyTorch with source codes.
1010
# PyTorch 2.x can compile with cuDNN v9.
11-
CUDNN_VER="9.12.0.46-1"
12-
# NCCL version 2.26.x used in the NGC PyTorch 25.05 image but has a performance regression issue.
13-
# Use NCCL version 2.27.5 which has the fixes.
11+
CUDNN_VER="9.14.0.64-1"
1412
NCCL_VER="2.27.7-1+cuda13.0"
15-
# Use cuBLAS version 13.0.0.19 instead.
16-
CUBLAS_VER="13.0.0.19-1"
13+
CUBLAS_VER="13.1.0.3-1"
1714
# Align with the pre-installed CUDA / NVCC / NVRTC versions from
1815
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
19-
NVRTC_VER="13.0.48-1"
20-
CUDA_RUNTIME="13.0.48-1"
21-
CUDA_DRIVER_VERSION="580.65.06-1.el8"
16+
NVRTC_VER="13.0.88-1"
17+
CUDA_RUNTIME="13.0.96-1"
18+
CUDA_DRIVER_VERSION="580.95.05-1.el8"
2219

2320
for i in "$@"; do
2421
case $i in

docs/source/installation/build-from-source-linux.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -147,11 +147,6 @@ check <https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>.
147147

148148
## Build TensorRT LLM
149149

150-
```{tip}
151-
:name: build-from-source-tip-cuda-version
152-
TensorRT LLM 1.1 supports both CUDA 12.9 and 13.0 while some dependency changes are required. The `requirements.txt` contains dependencies needed by CUDA 13.0. If you are using CUDA 12.9, please uncomment lines end with `# <For CUDA 12.9>` and comment out the next lines.
153-
```
154-
155150
### Option 1: Full Build with C++ Compilation
156151

157152
The following command compiles the C++ code and packages the compiled libraries along with the Python files into a wheel. When developing C++ code, you need this full build command to apply your code changes.

docs/source/installation/linux.md

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,16 @@
1212
Install CUDA Toolkit following the [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) and
1313
make sure `CUDA_HOME` environment variable is properly set.
1414

15-
```{tip}
16-
:name: installation-linux-tip-cuda-version
17-
TensorRT LLM 1.1 supports both CUDA 12.9 and 13.0. The wheel package release only supports CUDA 12.9, while CUDA 13.0 is only supported through NGC container release.
18-
```
19-
2015
```bash
21-
# Optional step: Only required for NVIDIA Blackwell GPUs and SBSA platform
22-
pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
16+
# By default, PyTorch CUDA 12.8 package is installed. Install PyTorch CUDA 13.0 package to align with the CUDA version used for building TensorRT LLM wheels.
17+
pip3 install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
2318

2419
sudo apt-get -y install libopenmpi-dev
2520

2621
# Optional step: Only required for disagg-serving
2722
sudo apt-get -y install libzmq3-dev
2823
```
2924

30-
PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell GPUs and SBSA platform. On prior GPUs or Linux x86_64 platform, this extra installation is not required.
31-
3225
```{tip}
3326
Instead of manually installing the preqrequisites as described
3427
above, it is also possible to use the pre-built [TensorRT LLM Develop container

0 commit comments

Comments
 (0)