You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/features/sampling.md
+4-28Lines changed: 4 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,30 +16,6 @@ The PyTorch backend supports a wide variety of features, listed below:
16
16
17
17
## General usage
18
18
19
-
There are two sampling backends available.
20
-
21
-
* Torch Sampler
22
-
* TRTLLM Sampler
23
-
24
-
Torch Sampler currently supports a superset of features of TRTLLM Sampler, and is intended as the long-term solution. One can specify which sampler to use explicitly with:
By default, the sampling backend is chosen to be `auto`. This will use:
39
-
40
-
* TRTLLM Sampler when using Beam Search.
41
-
* Torch Sampler otherwise.
42
-
43
19
Here is an example to run a model with basic usage of sampling parameters. This example prepares two identical prompts which will give different results due to the sampling parameters chosen:
44
20
45
21
```python
@@ -73,7 +49,7 @@ llm.generate(["Hello, my name is",
73
49
sampling_params_1])
74
50
```
75
51
76
-
### LLM API sampling behavior when using Torch Sampler
52
+
### LLM API sampling behavior
77
53
78
54
* The sampling is controlled via `SamplingParams`.
79
55
@@ -105,17 +81,17 @@ llm.generate(["Hello, my name is",
105
81
106
82
### Performance
107
83
108
-
The Torch Sampler leverages the optimized sampling kernels provided by
84
+
The sampler leverages the optimized sampling kernels provided by
109
85
[FlashInfer](https://docs.flashinfer.ai/api/sampling.html). The sampler
110
86
also uses the [sorting-free implementations](https://flashinfer.ai/2025/03/10/sampling.html)
111
87
whenever possible. This optimization does not compute the complete set of token sampling probabilities
112
88
(after top-k / top-p masking etc.), which typically can be omitted unless requested by the user or
113
89
required for speculative decoding (rejection sampling).
114
-
In case of unexpected problems, the use of FlashInfer in Torch Sampler can
90
+
In case of unexpected problems, the use of FlashInfer in the sampler can
115
91
be disabled via the `disable_flashinfer_sampling` config option (note that this option is likely
116
92
to be removed in a future TensorRT LLM release).
117
93
118
-
Moreover, Torch Sampler internally batches requests with compatible sampling parameters. This
94
+
Moreover, the sampler internally batches requests with compatible sampling parameters. This
119
95
can greatly reduce the overall latency of the sampling step when request batches are comprised
120
96
of requests with very heterogeneous sampling strategies (e.g. a mix of requests using greedy and top-p-after-top-k sampling).
Copy file name to clipboardExpand all lines: tensorrt_llm/llmapi/llm_args.py
+3-2Lines changed: 3 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -3014,8 +3014,9 @@ class TorchLlmArgs(BaseLlmArgs):
3014
3014
sampler_type: Union[str, SamplerType] =Field(
3015
3015
default=SamplerType.auto,
3016
3016
description=
3017
-
"The type of sampler to use. Options are TRTLLMSampler, TorchSampler or auto. Defaults to auto, which will use TorchSampler unless BeamSearch is requested.",
3018
-
status="beta")
3017
+
"The type of sampler to use. Options are TRTLLMSampler, TorchSampler or auto. Defaults to auto, which will use TorchSampler. "
3018
+
"TRTLLMSampler is deprecated and will be removed in release 1.4.",
0 commit comments