NVIDIA · venkywonka · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025 · Dec 22, 2025
@@ -84,7 +84,7 @@ kv_cache_config:
   enable_block_reuse: false
   free_gpu_memory_fraction: 0.8
 speculative_config:
-  decoding_type: Eagle
+  decoding_type: Eagle3
   max_draft_len: 3
   speculative_model_dir: /config/models/eagle/
 cuda_graph_config:

@@ -68,7 +68,7 @@ docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
     -p 8000:8000 --gpus=all -e "TRTLLM_ENABLE_PDL=1" \
     -v /path/to/maverick:/config/models/maverick -v /path/to/eagle:/config/models/eagle \
     docker.io/<username>/tensorrt_llm:main sh \
-        -c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n  max_batch_size: 8\nspeculative_config:\n  decoding_type: Eagle\n  max_draft_len: 3\n  speculative_model_dir: /config/models/eagle\n  eagle3_one_model: true\nkv_cache_config:\n  enable_block_reuse: false' > c.yaml && \
+        -c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n  max_batch_size: 8\nspeculative_config:\n  decoding_type: Eagle3\n  max_draft_len: 3\n  speculative_model_dir: /config/models/eagle\n  eagle3_one_model: true\nkv_cache_config:\n  enable_block_reuse: false' > c.yaml && \
         TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
         trtllm-serve /config/models/maverick \
             --host 0.0.0.0 --port 8000 \

@@ -53,12 +53,12 @@ The following draft model checkpoints can be used for EAGLE 3:
 * Llama 4 Maverick: [use the checkpoint from the NVIDIA HuggingFace repository](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3).
 
 ```python
-from tensorrt_llm.llmapi import EagleDecodingConfig
+from tensorrt_llm.llmapi import Eagle3DecodingConfig
 
 # Enable to use the faster one-model implementation for Llama 4.
 eagle3_one_model = False
 
-speculative_config = EagleDecodingConfig(
+speculative_config = Eagle3DecodingConfig(
     max_draft_len=3, speculative_model_dir="/path/to/draft_model", eagle3_one_model=eagle3_one_model)
 
 # Only need to disable overlap scheduler if eagle3_one_model is False.
@@ -131,16 +131,18 @@ llm = LLM("/path/to/target_model", speculative_config=speculative_config)
 Speculative decoding options must be specified via `--config config.yaml` for both `trtllm-bench` and `trtllm-serve`. All speculative decoding options can be specified in this YAML file. An additional `decoding_type` option is used to specify the type of speculation to use. The available options are:
 
 * `MTP`
-* `Eagle` (for EAGLE 3)
+* `Eagle3`
 * `NGram`
 * `DraftTarget`
 
+> Note: The PyTorch backend supports only `Eagle3`. `decoding_type: Eagle` is accepted as a backward-compatible alias for `Eagle3`, but EAGLE (v1/v2) draft checkpoints are incompatible.
+
 The rest of the argument names/valid values are the same as in their corresponding configuration class described in the Quick Start section. For example, a YAML configuration could look like this:
 
 ```
 disable_overlap_scheduler: true
 speculative_config:
-  decoding_type: Eagle
+  decoding_type: Eagle3
   max_draft_len: 4
   speculative_model: /path/to/draft/model
 ```

@@ -96,7 +96,7 @@ speculative_config:
   mtp_eagle_one_model: False # Not supported
 
 speculative_config:
-  decoding_type: "Eagle"
+  decoding_type: "Eagle3"
   eagle3_one_model: False # Not supported
 ```
 

@@ -171,6 +171,8 @@ The EAGLE approach enhances the single-model Medusa method by predicting and ver
 
 Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
 
+> **EAGLE3 note.** If the EAGLE3 draft head config omits `draft_vocab_size`, TensorRT-LLM assumes it matches `vocab_size` and emits a warning. Set `draft_vocab_size` explicitly if the draft head uses a different vocabulary.
+
 ### Disaggregated Serving
 
 [Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.

@@ -6,7 +6,7 @@
 import click
 
 from tensorrt_llm import LLM, SamplingParams
-from tensorrt_llm.llmapi import (EagleDecodingConfig, KvCacheConfig,
+from tensorrt_llm.llmapi import (Eagle3DecodingConfig, KvCacheConfig,
                                  MTPDecodingConfig, NGramDecodingConfig)
 
 prompts = [
@@ -33,7 +33,7 @@ def run_MTP(model: Optional[str] = None):
 
 
 def run_Eagle3():
-    spec_config = EagleDecodingConfig(
+    spec_config = Eagle3DecodingConfig(
         max_draft_len=3,
         speculative_model_dir="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
         eagle3_one_model=True)

@@ -5,7 +5,7 @@
 from tensorrt_llm import LLM, SamplingParams
 from tensorrt_llm.llmapi import (AttentionDpConfig, AutoDecodingConfig,
                                  CudaGraphConfig, DraftTargetDecodingConfig,
-                                 EagleDecodingConfig, KvCacheConfig, MoeConfig,
+                                 Eagle3DecodingConfig, KvCacheConfig, MoeConfig,
                                  MTPDecodingConfig, NGramDecodingConfig,
                                  TorchCompileConfig)
 
@@ -222,7 +222,7 @@ def setup_llm(args, **kwargs):
             mtp_eagle_one_model=args.use_one_model,
             speculative_model_dir=args.model_dir)
     elif spec_decode_algo == "EAGLE3":
-        spec_config = EagleDecodingConfig(
+        spec_config = Eagle3DecodingConfig(
             max_draft_len=args.spec_decode_max_draft_len,
             speculative_model_dir=args.draft_model_dir,
             eagle3_one_model=args.use_one_model,

@@ -837,8 +837,8 @@ settings for your specific use case.
 
 Qwen3 now supports Eagle3 (Speculative Decoding with Eagle3). To enable Eagle3 on Qwen3, you need to set the following arguments when running `trtllm-bench` or `trtllm-serve`:
 
-- `speculative_config.decoding_type: Eagle`
-  Set the decoding type to "Eagle" to enable Eagle3 speculative decoding.
+- `speculative_config.decoding_type: Eagle3`
+  Set the decoding type to `Eagle3` to enable Eagle3 speculative decoding.
 - `speculative_config.max_draft_len: 3`
   Set the maximum number of draft tokens generated per step (this value can be adjusted as needed).
 - `speculative_config.speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>`
@@ -855,7 +855,7 @@ Example `config.yml` snippet for Eagle3:
 echo "
 enable_attention_dp: false
 speculative_config:
-    decoding_type: Eagle
+    decoding_type: Eagle3
     max_draft_len: 3
     speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>
 kv_cache_config:

@@ -24,12 +24,18 @@ def from_config(
             vision_encoder_cls, vlm_base_model = vision_encoder_info
             return vision_encoder_cls(config, vlm_base_model)
 
-        # Hack to detect eagle3 checkpoints. TODO: should we provide
-        # our own checkpoints with the correct arch? It would let us
-        # avoid nasty stuff like this.
-        model_arch = model_arch.replace("Eagle3",
-                                        "")  # Strip the appended EAGLE3
+        # Hack to detect eagle3 checkpoints.
+        # Why it exists:
+        # - Eagle3 checkpoints have draft_vocab_size in config.json (even if None)
+        # - Some community checkpoints append "Eagle3" to architecture names ("LlamaForCausalLMEagle3")
+        # - Some checkpoints don't include "Eagle3" in arch name at all ("LlamaForCausalLM")
+        # - TensorRT-LLM's MODEL_CLASS_MAPPING expects prefixed names like EAGLE3LlamaForCausalLM
+        # - Hence: LlamaForCausalLMEagle3 -> EAGLE3LlamaForCausalLM
+        #         LlamaForCausalLM (with draft_vocab_size) -> EAGLE3LlamaForCausalLM
+        # TODO: should we provide our own checkpoints with the correct arch? It would let us avoid nasty stuff like this.
         if hasattr(config.pretrained_config, "draft_vocab_size"):
+            # It's an Eagle3 checkpoint - strip "Eagle3" suffix if present, then add prefix
+            model_arch = model_arch.replace("Eagle3", "")
             model_arch = "EAGLE3" + model_arch
         if model_arch in (
                 "DeepseekV3ForCausalLM", "Glm4MoeForCausalLM"

@@ -4,6 +4,8 @@
 from torch import nn
 from transformers import LlamaConfig, PretrainedConfig
 
+from tensorrt_llm.logger import logger
+
 from ...functional import PositionEmbeddingType
 from ..attention_backend import AttentionMetadata
 from ..attention_backend.interface import PositionalEmbeddingParams, RopeParams
@@ -24,6 +26,18 @@
                              register_auto_model)
 
 
+def _ensure_draft_vocab_size(config: PretrainedConfig) -> None:
+    if hasattr(config,
+               "draft_vocab_size") and config.draft_vocab_size is not None:
+        return
+
+    logger.warning(
+        "Missing 'draft_vocab_size' in pretrained config; defaulting to 'vocab_size'. "
+        "Set 'draft_vocab_size' explicitly if the draft head uses a different vocabulary."
+    )
+    config.draft_vocab_size = config.vocab_size
+
+
 class Eagle3Attention(Attention):
 
     def __init__(
@@ -417,9 +431,8 @@ def __init__(
         model_config: ModelConfig[PretrainedConfig],
         start_layer_idx: int = 0,
     ):
-        draft_vocab_size = model_config.pretrained_config.vocab_size
-        if model_config.pretrained_config.draft_vocab_size is not None:
-            draft_vocab_size = model_config.pretrained_config.draft_vocab_size
+        config = model_config.pretrained_config
+        _ensure_draft_vocab_size(config)
 
         # Determine if we should use MLA attention based on config
         # MLA is used for DeepSeekV3-style models that have kv_lora_rank
@@ -435,8 +448,8 @@ def __init__(
         super().__init__(
             draft_model,
             config=model_config,
-            hidden_size=model_config.pretrained_config.hidden_size,
-            vocab_size=draft_vocab_size,
+            hidden_size=config.hidden_size,
+            vocab_size=config.draft_vocab_size,
         )
         self.load_lm_head_from_target = True
 
@@ -598,6 +611,7 @@ def forward(
 
 
 # We use MistralLarge3 as the base architecture for EAGLE3 draft layers
+# NOTE: Class name says "Eagle" not "Eagle3" to match checkpoint naming (e.g., "Mistral-Large-3-675B-Instruct-2512-Eagle")
 @register_auto_model("MistralLarge3EagleForCausalLM")
 class MistralLarge3EagleForCausalLM(DecoderModelForCausalLM):
 

@@ -10,10 +10,11 @@
                        CapacitySchedulerPolicy, ContextChunkingPolicy,
                        CudaGraphConfig, DeepSeekSparseAttentionConfig,
                        DraftTargetDecodingConfig, DynamicBatchConfig,
-                       EagleDecodingConfig, ExtendedRuntimePerfKnobConfig,
-                       KvCacheConfig, LlmArgs, LookaheadDecodingConfig,
-                       MedusaDecodingConfig, MoeConfig, MTPDecodingConfig,
-                       NGramDecodingConfig, RocketSparseAttentionConfig,
+                       Eagle3DecodingConfig, EagleDecodingConfig,
+                       ExtendedRuntimePerfKnobConfig, KvCacheConfig, LlmArgs,
+                       LookaheadDecodingConfig, MedusaDecodingConfig, MoeConfig,
+                       MTPDecodingConfig, NGramDecodingConfig,
+                       RocketSparseAttentionConfig,
                        SaveHiddenStatesDecodingConfig, SchedulerConfig,
                        SkipSoftmaxAttentionConfig, TorchCompileConfig,
                        TorchLlmArgs, TrtLlmArgs, UserProvidedDecodingConfig)
@@ -38,6 +39,7 @@
     'LookaheadDecodingConfig',
     'MedusaDecodingConfig',
     'EagleDecodingConfig',
+    'Eagle3DecodingConfig',
     'MTPDecodingConfig',
     'SchedulerConfig',
     'CapacitySchedulerPolicy',

@@ -708,6 +708,8 @@ def _validate_acceptance_length_threshold(cls, v: Optional[float]):
     _allow_chain_drafter: bool = PrivateAttr(True)
     # If set, drafting uses greedy sampling, irrespective of sampling parameters.
     _allow_greedy_draft_tokens: bool = PrivateAttr(True)
+    # Internal: record decoding_type alias used during parsing (for warnings).
+    _decoding_type_alias: Optional[str] = PrivateAttr(default=None)
 
     @field_validator('draft_len_schedule')
     @classmethod
@@ -755,13 +757,14 @@ def validate_draft_len_schedule_and_sort(cls, v, info):
         return v
 
     @classmethod
-    def from_dict(cls, data: dict):
+    def from_dict(cls, data: dict, backend: Optional[str] = None):
         # dispatch to the correct decoding config
         decoding_type = data.get("decoding_type")
         config_classes = {
             "MTP": MTPDecodingConfig,
             "Medusa": MedusaDecodingConfig,
             "Eagle": EagleDecodingConfig,
+            "Eagle3": Eagle3DecodingConfig,
             "Lookahead": LookaheadDecodingConfig,
             "NGram": NGramDecodingConfig,
             "DraftTarget": DraftTargetDecodingConfig,
@@ -770,6 +773,14 @@ def from_dict(cls, data: dict):
             "AUTO": AutoDecodingConfig,
         }
 
+        backend = backend.lower() if isinstance(backend, str) else backend
+        if decoding_type == "Eagle" and backend in ("pytorch", "_autodeploy"):
+            data = dict(data)
+            data.pop("decoding_type")
+            spec_cfg = Eagle3DecodingConfig(**data)
+            spec_cfg._decoding_type_alias = "Eagle"
+            return spec_cfg
+
         config_class = config_classes.get(decoding_type)
         if config_class is None:
             raise ValueError(f"Invalid decoding type: {decoding_type}")
@@ -966,6 +977,10 @@ def is_linear_tree(self) -> bool:
         return False
 
 
+class Eagle3DecodingConfig(EagleDecodingConfig):
+    decoding_type: ClassVar[str] = "Eagle3"
+
+
 class SaveHiddenStatesDecodingConfig(DecodingBaseConfig):
     output_directory: str
     write_interval: int = 20
@@ -2506,9 +2521,14 @@ def validate_speculative_config(self):
                     decoding_mode=DecodingMode.Medusa(),
                     medusa_choices=self.speculative_config.medusa_choices)
 
+            elif isinstance(self.speculative_config, Eagle3DecodingConfig):
+                raise ValueError(
+                    "speculative_config.decoding_type 'Eagle3' is only supported on the PyTorch backend. "
+                    "Use decoding_type 'Eagle' for the TensorRT backend.")
+
             elif isinstance(self.speculative_config, EagleDecodingConfig):
                 assert self.speculative_config.max_draft_len > 0
-                assert self.speculative_config.speculative_model_dir is not None, "Path to EAGLE3 weights must be specified."
+                assert self.speculative_config.speculative_model_dir is not None, "Path to EAGLE weights must be specified."
                 self.build_config.max_draft_len = self.speculative_config.max_draft_len
                 self.build_config.speculative_decoding_mode = SpeculativeDecodingMode.EAGLE
                 eagle_config = _EagleConfig(
@@ -3024,6 +3044,14 @@ def validate_speculative_config(self):
                     f"support backend {self.backend}")
 
             if isinstance(self.speculative_config, EagleDecodingConfig):
+                if (getattr(self.speculative_config, "_decoding_type_alias",
+                            None) == "Eagle" or type(self.speculative_config)
+                        is EagleDecodingConfig):
+                    logger.warning(
+                        "speculative_config.decoding_type 'Eagle' is not supported on the PyTorch backend; only 'Eagle3' is supported. "
+                        "'Eagle' is treated as 'Eagle3' for backward compatibility. "
+                        "EAGLE (v1/v2) draft checkpoints are incompatible with Eagle3—use an Eagle3 draft model."
+                    )
                 assert self.speculative_config.max_draft_len > 0
                 assert self.speculative_config.speculative_model_dir is not None, "Path to EAGLE3 weights must be specified."
             elif isinstance(self.speculative_config, NGramDecodingConfig):
@@ -3323,8 +3351,14 @@ def update_llm_args_with_extra_dict(
         if field_name in llm_args_dict:
             # Some fields need to be converted manually.
             if field_name in ["speculative_config", "sparse_attention_config"]:
-                llm_args_dict[field_name] = field_type.from_dict(
-                    llm_args_dict[field_name])
+                if field_name == "speculative_config":
+                    backend = llm_args_dict.get("backend") or llm_args.get(
+                        "backend")
+                    llm_args_dict[field_name] = field_type.from_dict(
+                        llm_args_dict[field_name], backend=backend)
+                else:
+                    llm_args_dict[field_name] = field_type.from_dict(
+                        llm_args_dict[field_name])
             else:
                 llm_args_dict[field_name] = field_type(
                     **llm_args_dict[field_name])

@@ -30,8 +30,8 @@
 from .build_cache import (BuildCache, BuildCacheConfig, CachedStage,
                           get_build_cache_config_from_env)
 from .llm_args import (CalibConfig, CudaGraphConfig, DraftTargetDecodingConfig,
-                       EagleDecodingConfig, KvCacheConfig, LlmArgs,
-                       LookaheadDecodingConfig, MedusaDecodingConfig,
+                       Eagle3DecodingConfig, EagleDecodingConfig, KvCacheConfig,
+                       LlmArgs, LookaheadDecodingConfig, MedusaDecodingConfig,
                        MTPDecodingConfig, NGramDecodingConfig,
                        UserProvidedDecodingConfig, _ModelFormatKind,
                        _ModelWrapper, _ParallelConfig,
@@ -923,6 +923,7 @@ class LlmBuildStats:
     'KvCacheConfig',
     'CachedModelLoader',
     'EagleDecodingConfig',
+    'Eagle3DecodingConfig',
     'update_llm_args_with_extra_dict',
     'update_llm_args_with_extra_options',
 ]
diff --git a/tests/integration/defs/accuracy/test_cli_flow.py b/tests/integration/defs/accuracy/test_cli_flow.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 import pytest
 
-from tensorrt_llm.llmapi import (EagleDecodingConfig, LookaheadDecodingConfig,
+from tensorrt_llm.llmapi import (Eagle3DecodingConfig, LookaheadDecodingConfig,
                                  MedusaDecodingConfig)
 from tensorrt_llm.quantization import QuantAlgo
 
@@ -476,7 +476,7 @@ def test_eagle(self, cuda_graph, chunked_context, typical_acceptance,
             extra_summarize_args.extend(
                 ["--eagle_posterior_threshold=0.09", "--temperature=0.7"])
 
-        self.run(spec_dec_algo=EagleDecodingConfig.decoding_type,
+        self.run(spec_dec_algo=Eagle3DecodingConfig.decoding_type,
                  extra_convert_args=[
                      f"--eagle_model_dir={self.EAGLE_MODEL_PATH}",
                      "--max_draft_len=63", "--num_eagle_layers=4",
@@ -503,7 +503,7 @@ def test_eagle_2(self, cuda_graph, chunked_context, mocker):
         if chunked_context:
             extra_summarize_args.append("--enable_chunked_context")
 
-        self.run(spec_dec_algo=EagleDecodingConfig.decoding_type,
+        self.run(spec_dec_algo=Eagle3DecodingConfig.decoding_type,
                  extra_convert_args=[
                      f"--eagle_model_dir={self.EAGLE_MODEL_PATH}",
                      "--max_draft_len=63", "--num_eagle_layers=4",