[None][fix] Use dynamic tree SpecTreeManager in kv_cache_relocation test + add docs

sunnyqgg · sunnyqgg · commit 34a3f35461e1 · 2026-03-31T21:38:52.000-07:00
Switch SpecTreeManager in test_llama_verification_with_kv_cache_relocation
from static tree (use_dynamic_tree=False) to dynamic tree mode, removing
the eagle_choices parameter. Fixes RuntimeError on H100 (sm&lt;100) where
flat single-level eagle_choices produced empty top_k_list tensors.

Also add EAGLE3 dynamic tree mode documentation to speculative-decoding.md
per reviewer request.

Signed-off-by: qgai &lt;qgai@nvidia.com&gt;
diff --git a/docs/source/features/speculative-decoding.md b/docs/source/features/speculative-decoding.md
@@ -28,7 +28,7 @@ llm = LLM("/path/to/target_model", speculative_config=speculative_config, disabl
 ### EAGLE 3
 
 The EAGLE 3 algorithm is described in the paper [EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test](https://arxiv.org/pdf/2503.01840).
-TRT-LLM supports a modified version of the algorithm presented in the paper: tree structures for draft sequences are not supported. Instead, each request uses a single sequence of draft tokens with length `max_draft_len`.
+By default, each request uses a single sequence (linear chain) of draft tokens with length `max_draft_len`. Optionally, dynamic tree draft generation can be enabled to improve acceptance rates — see [Dynamic Tree Mode](#dynamic-tree-mode) below.
 
 The following draft model checkpoints can be used for EAGLE 3:
 * Llama 3 variants: [use the checkpoints from the authors of the original EAGLE 3 paper](https://huggingface.co/yuhuili).
@@ -50,6 +50,36 @@ llm = LLM(model, speculative_config=speculative_config)
 
 EAGLE 3 can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
 
+#### Dynamic Tree Mode
+
+Dynamic tree mode enables tree-structured draft generation for EAGLE 3, where the drafter expands multiple candidate tokens at each layer instead of a single token. This can improve acceptance rates compared to linear drafting at the cost of additional compute per generation step.
+
+To enable dynamic tree mode, set `use_dynamic_tree=True` on the `Eagle3DecodingConfig` and provide the following parameters:
+
+* `use_dynamic_tree` (`bool`): Enables dynamic tree draft generation. Mutually exclusive with `eagle_choices` (static tree).
+* `dynamic_tree_max_topK` (`int`): Maximum number of tokens to expand per node at each draft layer.
+* `max_total_draft_tokens` (`int`, optional): Total draft token budget for the tree. Must satisfy `max_draft_len <= max_total_draft_tokens <= dynamic_tree_max_topK * max_draft_len`. Defaults to `dynamic_tree_max_topK * max_draft_len` if not set.
+* `max_batch_size` (`int`): Required when `use_dynamic_tree=True` for pre-allocating dynamic tree CUDA buffers.
+
+```python
+from tensorrt_llm.llmapi import Eagle3DecodingConfig
+
+speculative_config = Eagle3DecodingConfig(
+    max_draft_len=6,
+    speculative_model="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
+    use_dynamic_tree=True,
+    dynamic_tree_max_topK=10,
+    max_total_draft_tokens=60,
+    max_batch_size=4,
+)
+
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)
+```
+
+```{note}
+Dynamic tree mode is currently **not supported** for models that use sliding window attention or MLA (Multi-Latent Attention), such as DeepSeek and gpt-oss models.
+```
+
 ### NGram
 
 The NGram method is an implementation of [this Prompt Lookup Decoding algorithm](https://github.com/apoorvumang/prompt-lookup-decoding).
@@ -199,6 +229,18 @@ speculative_config:
   speculative_model: /path/to/draft/model
 ```
 
+```yaml
+# Dynamic tree mode
+speculative_config:
+  decoding_type: Eagle3
+  max_draft_len: 6
+  speculative_model: /path/to/eagle3_model
+  use_dynamic_tree: true
+  dynamic_tree_max_topK: 10
+  max_total_draft_tokens: 60
+  max_batch_size: 4
+```
+
 ```yaml
 # SA combination: enable Suffix Automaton enhancement with any supported technique
 speculative_config:
diff --git a/tests/unittest/_torch/modeling/test_modeling_llama.py b/tests/unittest/_torch/modeling/test_modeling_llama.py
@@ -611,13 +611,12 @@ def run_forward(input_ids, position_ids, attn_metadata):
         spec_metadata_phase1 = None
         if is_tree_phase1:
             max_draft_1 = gen_input_ids_1.size(-1) - 1
-            eagle_choices_phase1 = [[i] for i in range(max_draft_1)]
             spec_tree_mgr_phase1 = SpecTreeManager(
                 max_num_requests=1,
-                use_dynamic_tree=False,
+                use_dynamic_tree=True,
                 max_total_draft_tokens=max_draft_1,
                 max_draft_len=max_draft_1,
-                eagle_choices=eagle_choices_phase1,
+                eagle_choices=None,
                 dynamic_tree_max_topK=10,
             )
             spec_metadata_phase1 = SpecMetadata(
@@ -630,7 +629,7 @@ def run_forward(input_ids, position_ids, attn_metadata):
             batch_size=batch_size,
             is_spec_decoding_enabled=is_spec_decoding_enabled,
             is_spec_dec_tree=is_tree_phase1,
-            is_spec_dec_dynamic_tree=False,
+            is_spec_dec_dynamic_tree=is_tree_phase1,
             max_draft_len=gen_input_ids_1.size(-1) - 1,
             max_total_draft_tokens=gen_input_ids_1.size(-1) - 1,
             model_is_wrapped=False,
@@ -687,13 +686,12 @@ def run_forward(input_ids, position_ids, attn_metadata):
         spec_metadata_ref = None
         if is_tree_ref:
             max_draft_ref = gen_input_ids_ref.size(-1) - 1
-            eagle_choices_ref = [[i] for i in range(max_draft_ref)]
             spec_tree_mgr_ref = SpecTreeManager(
                 max_num_requests=1,
-                use_dynamic_tree=False,
+                use_dynamic_tree=True,
                 max_total_draft_tokens=max_draft_ref,
                 max_draft_len=max_draft_ref,
-                eagle_choices=eagle_choices_ref,
+                eagle_choices=None,
                 dynamic_tree_max_topK=10,
             )
             spec_metadata_ref = SpecMetadata(
@@ -706,7 +704,7 @@ def run_forward(input_ids, position_ids, attn_metadata):
             batch_size=batch_size,
             is_spec_decoding_enabled=is_spec_decoding_enabled,
             is_spec_dec_tree=is_tree_ref,
-            is_spec_dec_dynamic_tree=False,
+            is_spec_dec_dynamic_tree=is_tree_ref,
             max_draft_len=gen_input_ids_ref.size(-1) - 1,
             max_total_draft_tokens=gen_input_ids_ref.size(-1) - 1,
             model_is_wrapped=False,