Skip to content

Commit 631850e

Browse files
Merge branch 'main' into 5628848
2 parents 26d2e16 + 78bb245 commit 631850e

File tree

133 files changed

+1035
-492
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

133 files changed

+1035
-492
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
1010
[![python](https://img.shields.io/badge/python-3.10-green)](https://www.python.org/downloads/release/python-31012/)
1111
[![cuda](https://img.shields.io/badge/cuda-13.0.0-green)](https://developer.nvidia.com/cuda-downloads)
1212
[![torch](https://img.shields.io/badge/torch-2.9.0-green)](https://pytorch.org)
13-
[![version](https://img.shields.io/badge/release-1.2.0rc7-green)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/version.py)
13+
[![version](https://img.shields.io/badge/release-1.2.0rc8-green)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/version.py)
1414
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/LICENSE)
1515

1616
[Architecture](https://nvidia.github.io/TensorRT-LLM/developer-guide/overview.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Performance](https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-overview.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](https://nvidia.github.io/TensorRT-LLM/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Roadmap](https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap)
Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,23 @@
11
# Feature Combination Matrix
22

3-
| Feature | Overlap Scheduler | CUDA Graph | Attention Data Parallelism | Disaggregated Serving | Chunked Prefill | MTP | EAGLE-3(One Model Engine) | EAGLE-3(Two Model Engine) | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Slide Window Attention | Logits Post Processor | Guided Decoding | LoRA |
4-
| -------------------------- | ----------------- | ---------- | -------------------------- | --------------------- | --------------- | -------- | ------------------------- | ------------------------- | ------------- | ---------------- | -------------- | ---------------------- | --------------------- | --------------- | ---- |
5-
| Overlap Scheduler | --- | | | | | | | | | | | | | | |
6-
| CUDA Graph | Yes | --- | | | | | | | | | | | | | |
7-
| Attention Data Parallelism | Yes | Yes | --- | | | | | | | | | | | | |
8-
| Disaggregated Serving | Yes | Yes | Yes | --- | | | | | | | | | | | |
9-
| Chunked Prefill | Yes | Yes | Yes | Yes | --- | | | | | | | | | | |
10-
| MTP | Yes | Yes | Yes | Yes | Yes | --- | | | | | | | | | |
11-
| EAGLE-3(One Model Engine) | Yes | Yes | Yes | Yes | Yes | No | --- | | | | | | | | |
12-
| EAGLE-3(Two Model Engine) | Yes | Yes | Yes | Yes | Yes | No | No | --- | | | | | | | |
13-
| Torch Sampler | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | | | |
14-
| TLLM C++ Sampler | Yes | Yes | Yes | Yes | Yes | No | No | No | No | --- | | | | | |
15-
| KV Cache Reuse | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | |
16-
| Slide Window Attention | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | |
17-
| Logits Post Processor | Yes | Yes | Yes | No | Yes | No | No | No | Yes | Yes | Yes | Yes | --- | | |
18-
| Guided Decoding | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | |
19-
| LoRA | Yes | No | Untested | Untested | Untested | Untested | Untested | Untested | Yes | Yes | Yes | Yes | Yes | Untested | --- |
3+
| Feature | Overlap Scheduler | CUDA Graph | Tensor Parallelism | Pipeline Parallelism | Expert Parallelism | Helix Parallelism | Attention Data Parallelism | Disaggregated Serving | Chunked Prefill | MTP | EAGLE-3(One Model Engine) | EAGLE-3(Two Model Engine) | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Slide Window Attention | Logits Post Processor | Guided Decoding | LoRA |
4+
| -------------------------- | ----------------- | ---------- | ------------------ | -------------------- | ------------------ | ----------------- | -------------------------- | --------------------- | --------------- | -------- | ------------------------- | ------------------------- | ------------- | ---------------- | -------------- | ---------------------- | --------------------- | --------------- | -------- |
5+
| Overlap Scheduler | --- | | | | | | | | | | | | | | | | | | |
6+
| CUDA Graph | Yes | --- | | | | | | | | | | | | | | | | | |
7+
| Tensor Parallelism | Yes | Yes | --- | | | | | | | | | | | | | | | | |
8+
| Pipeline Parallelism | Yes | Yes | Yes | --- | | | | | | | | | | | | | | | |
9+
| Expert Parallelism | Yes | Yes | Yes | Yes | --- | | | | | | | | | | | | | | |
10+
| Helix Parallelism | Untested | Yes | Yes | Yes | Yes | --- | | | | | | | | | | | | | |
11+
| Attention Data Parallelism | Yes | Yes | Yes | Yes | Yes | Known issues | --- | | | | | | | | | | | | |
12+
| Disaggregated Serving | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | | | | | | | | |
13+
| Chunked Prefill | Yes | Yes | Yes | Untested | Yes | Yes | Yes | Yes | --- | | | | | | | | | | |
14+
| MTP | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | --- | | | | | | | | | |
15+
| EAGLE-3(One Model Engine) | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | No | --- | | | | | | | | |
16+
| EAGLE-3(Two Model Engine) | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | No | No | --- | | | | | | | |
17+
| Torch Sampler | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | | | |
18+
| TLLM C++ Sampler | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | --- | | | | | |
19+
| KV Cache Reuse | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | |
20+
| Slide Window Attention | Yes | Yes | Yes | Yes | Yes | Untested | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | |
21+
| Logits Post Processor | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | No | No | No | Yes | Yes | Yes | Yes | --- | | |
22+
| Guided Decoding | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | |
23+
| LoRA | Yes | No | Yes | Yes | Untested | Untested | Untested | Untested | Yes | Untested | Untested | Untested | Yes | Yes | Yes | Yes | Yes | Untested | --- |

docs/source/models/supported-models.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,11 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
4040
| `Qwen3MoeForCausalLM` | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | N/A | Yes | Yes |
4141
| `Qwen3NextForCausalLM` | Yes | Yes | No | Untested | Yes | No | No | No | Yes | Yes | No | No | Untested | Untested |
4242
| `Llama4ForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Untested | N/A | Yes | Yes |
43-
| `GptOssForCausalLM` | Yes | Yes | Yes | Yes | No | No | Yes | No | Yes | Yes | No | N/A | Yes | Yes |
43+
| `GptOssForCausalLM` | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes [^3] | Yes | Yes | Yes | N/A | Yes | Yes |
4444

4545
[^1]: Chunked Prefill for MLA can only be enabled on SM100/SM103.
4646
[^2]: KV cache reuse for MLA can only be enabled on SM90/SM100/SM103 and in BF16/FP8 KV cache dtype.
47+
[^3]: Overlap scheduler isn't supported when using EAGLE-3(Two Model Engine) for GPT-OSS.
4748

4849

4950
# Multimodal Feature Support Matrix (PyTorch Backend)

examples/constraints.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
tensorrt_llm==1.2.0rc7
1+
tensorrt_llm==1.2.0rc8
22
evaluate~=0.4.1
33
rouge_score~=0.1.2

examples/layer_wise_benchmarks/run.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,11 @@
1010
import yaml
1111

1212
from tensorrt_llm._torch.autotuner import AutoTuner, autotune
13+
from tensorrt_llm._torch.distributed import MPIDist, TorchDist
1314
from tensorrt_llm._torch.modules.fused_moe.fused_moe_cutlass import CutlassFusedMoE
1415
from tensorrt_llm._torch.modules.fused_moe.interface import AlltoallMethodType
1516
from tensorrt_llm._torch.modules.multi_stream_utils import with_multi_stream
16-
from tensorrt_llm._utils import local_mpi_rank, mpi_rank, mpi_world_size
17+
from tensorrt_llm._utils import local_mpi_rank, mpi_disabled, mpi_rank, mpi_world_size
1718
from tensorrt_llm.logger import logger
1819
from tensorrt_llm.tools.layer_wise_benchmarks import BalanceMethod, get_runner_cls, mark_ranges
1920

@@ -173,6 +174,8 @@ def comma_separated_floats(s):
173174
)
174175
if args.enable_autotuner:
175176
cache_path = os.getenv("TLLM_AUTOTUNER_CACHE_PATH") or None
177+
dist = TorchDist(mapping=mapping) if mpi_disabled() else MPIDist(mapping=mapping)
178+
AutoTuner.get().setup_distributed_state(mapping, dist)
176179
with autotune(cache_path=cache_path):
177180
run_pack()
178181
else:

examples/models/core/mistral_large_3/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickst
1919
--max_tokens 100 \
2020
--checkpoint_format mistral \
2121
--model_type mistral_large_3 \
22-
--moe_backend TRTLLM
22+
--moe_backend TRTLLM \
23+
--image_format pil
2324
```
2425

2526
## LLM-only run

jenkins/L0_Test.groovy

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -808,7 +808,7 @@ def getPytestBaseCommandLine(
808808
portEnvVars,
809809
pytestUtil,
810810
"pytest",
811-
"-v",
811+
"-vv",
812812
testFilter[(DETAILED_LOG)] ? "-s" : "",
813813
"--timeout-method=thread",
814814
"--apply-test-list-correction",

security_scanning/docs/poetry.lock

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)