Skip to content

Commit 301b78b

Browse files
authored
Add documentation for eagle3+disagg+dynamo (NVIDIA#6072)
Signed-off-by: Iman Tabrizian <[email protected]>
1 parent e30d7be commit 301b78b

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

docs/source/advanced/speculative-decoding.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
- [Limitations](#limitations)
1111
- [ReDrafter](#redrafter)
1212
- [EAGLE](#eagle)
13+
- [Disaggregated Serving](#disaggregated-serving)
1314
- [Lookahead decoding](#lookahead-decoding)
1415

1516
## About Speculative Sampling
@@ -169,6 +170,10 @@ The EAGLE approach enhances the single-model Medusa method by predicting and ver
169170

170171
Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine. EAGLE-1 and EAGLE-2 are both supported, while EAGLE-2 is currently in the experimental stage. Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
171172

173+
### Disaggregated Serving
174+
175+
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.
176+
172177
## Lookahead Decoding
173178

174179
Lookahead decoding algorithm operates through two parallel computation branches within the same model: a lookahead branch that generates n-grams using a fixed-sized 2D window, and a verification branch that validates promising n-gram candidates. This approach eliminates the necessity for additional model training or fine-tuning and can be enabled for any autoregressive model. Refer to the [Lookahead decoding README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/lookahead/README.md) for information about building and running the model.

0 commit comments

Comments
 (0)