Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
speculative_config:
decoding_type: Eagle
decoding_type: Eagle3
max_draft_len: 3
speculative_model_dir: /config/models/eagle/
cuda_graph_config:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 --gpus=all -e "TRTLLM_ENABLE_PDL=1" \
-v /path/to/maverick:/config/models/maverick -v /path/to/eagle:/config/models/eagle \
docker.io/<username>/tensorrt_llm:main sh \
-c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n max_batch_size: 8\nspeculative_config:\n decoding_type: Eagle\n max_draft_len: 3\n speculative_model_dir: /config/models/eagle\n eagle3_one_model: true\nkv_cache_config:\n enable_block_reuse: false' > c.yaml && \
-c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n max_batch_size: 8\nspeculative_config:\n decoding_type: Eagle3\n max_draft_len: 3\n speculative_model_dir: /config/models/eagle\n eagle3_one_model: true\nkv_cache_config:\n enable_block_reuse: false' > c.yaml && \
TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
trtllm-serve /config/models/maverick \
--host 0.0.0.0 --port 8000 \
Expand Down
10 changes: 6 additions & 4 deletions docs/source/features/speculative-decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,12 @@ The following draft model checkpoints can be used for EAGLE 3:
* Llama 4 Maverick: [use the checkpoint from the NVIDIA HuggingFace repository](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3).

```python
from tensorrt_llm.llmapi import EagleDecodingConfig
from tensorrt_llm.llmapi import Eagle3DecodingConfig

# Enable to use the faster one-model implementation for Llama 4.
eagle3_one_model = False

speculative_config = EagleDecodingConfig(
speculative_config = Eagle3DecodingConfig(
max_draft_len=3, speculative_model_dir="/path/to/draft_model", eagle3_one_model=eagle3_one_model)

# Only need to disable overlap scheduler if eagle3_one_model is False.
Expand Down Expand Up @@ -131,16 +131,18 @@ llm = LLM("/path/to/target_model", speculative_config=speculative_config)
Speculative decoding options must be specified via `--config config.yaml` for both `trtllm-bench` and `trtllm-serve`. All speculative decoding options can be specified in this YAML file. An additional `decoding_type` option is used to specify the type of speculation to use. The available options are:

* `MTP`
* `Eagle` (for EAGLE 3)
* `Eagle3`
* `NGram`
* `DraftTarget`

> Note: The PyTorch backend supports only `Eagle3`. `decoding_type: Eagle` is accepted as a backward-compatible alias for `Eagle3`, but EAGLE (v1/v2) draft checkpoints are incompatible.

The rest of the argument names/valid values are the same as in their corresponding configuration class described in the Quick Start section. For example, a YAML configuration could look like this:

```
disable_overlap_scheduler: true
speculative_config:
decoding_type: Eagle
decoding_type: Eagle3
max_draft_len: 4
speculative_model: /path/to/draft/model
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ speculative_config:
mtp_eagle_one_model: False # Not supported

speculative_config:
decoding_type: "Eagle"
decoding_type: "Eagle3"
eagle3_one_model: False # Not supported
```

Expand Down
2 changes: 2 additions & 0 deletions docs/source/legacy/advanced/speculative-decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ The EAGLE approach enhances the single-model Medusa method by predicting and ver

Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.

> **EAGLE3 note.** If the EAGLE3 draft head config omits `draft_vocab_size`, TensorRT-LLM assumes it matches `vocab_size` and emits a warning. Set `draft_vocab_size` explicitly if the draft head uses a different vocabulary.

### Disaggregated Serving

[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.
Expand Down
4 changes: 2 additions & 2 deletions examples/llm-api/llm_speculative_decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import click

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import (EagleDecodingConfig, KvCacheConfig,
from tensorrt_llm.llmapi import (Eagle3DecodingConfig, KvCacheConfig,
MTPDecodingConfig, NGramDecodingConfig)

prompts = [
Expand All @@ -33,7 +33,7 @@ def run_MTP(model: Optional[str] = None):


def run_Eagle3():
spec_config = EagleDecodingConfig(
spec_config = Eagle3DecodingConfig(
max_draft_len=3,
speculative_model_dir="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
eagle3_one_model=True)
Expand Down
4 changes: 2 additions & 2 deletions examples/llm-api/quickstart_advanced.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import (AttentionDpConfig, AutoDecodingConfig,
CudaGraphConfig, DraftTargetDecodingConfig,
EagleDecodingConfig, KvCacheConfig, MoeConfig,
Eagle3DecodingConfig, KvCacheConfig, MoeConfig,
MTPDecodingConfig, NGramDecodingConfig,
TorchCompileConfig)

Expand Down Expand Up @@ -222,7 +222,7 @@ def setup_llm(args, **kwargs):
mtp_eagle_one_model=args.use_one_model,
speculative_model_dir=args.model_dir)
elif spec_decode_algo == "EAGLE3":
spec_config = EagleDecodingConfig(
spec_config = Eagle3DecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.draft_model_dir,
eagle3_one_model=args.use_one_model,
Expand Down
6 changes: 3 additions & 3 deletions examples/models/core/qwen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -837,8 +837,8 @@ settings for your specific use case.

Qwen3 now supports Eagle3 (Speculative Decoding with Eagle3). To enable Eagle3 on Qwen3, you need to set the following arguments when running `trtllm-bench` or `trtllm-serve`:

- `speculative_config.decoding_type: Eagle`
Set the decoding type to "Eagle" to enable Eagle3 speculative decoding.
- `speculative_config.decoding_type: Eagle3`
Set the decoding type to `Eagle3` to enable Eagle3 speculative decoding.
- `speculative_config.max_draft_len: 3`
Set the maximum number of draft tokens generated per step (this value can be adjusted as needed).
- `speculative_config.speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>`
Expand All @@ -855,7 +855,7 @@ Example `config.yml` snippet for Eagle3:
echo "
enable_attention_dp: false
speculative_config:
decoding_type: Eagle
decoding_type: Eagle3
max_draft_len: 3
speculative_model_dir: <EAGLE3_DRAFT_MODEL_PATH>
kv_cache_config:
Expand Down
16 changes: 11 additions & 5 deletions tensorrt_llm/_torch/models/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,18 @@ def from_config(
vision_encoder_cls, vlm_base_model = vision_encoder_info
return vision_encoder_cls(config, vlm_base_model)

# Hack to detect eagle3 checkpoints. TODO: should we provide
# our own checkpoints with the correct arch? It would let us
# avoid nasty stuff like this.
model_arch = model_arch.replace("Eagle3",
"") # Strip the appended EAGLE3
# Hack to detect eagle3 checkpoints.
# Why it exists:
# - Eagle3 checkpoints have draft_vocab_size in config.json (even if None)
# - Some community checkpoints append "Eagle3" to architecture names ("LlamaForCausalLMEagle3")
# - Some checkpoints don't include "Eagle3" in arch name at all ("LlamaForCausalLM")
# - TensorRT-LLM's MODEL_CLASS_MAPPING expects prefixed names like EAGLE3LlamaForCausalLM
# - Hence: LlamaForCausalLMEagle3 -> EAGLE3LlamaForCausalLM
# LlamaForCausalLM (with draft_vocab_size) -> EAGLE3LlamaForCausalLM
# TODO: should we provide our own checkpoints with the correct arch? It would let us avoid nasty stuff like this.
if hasattr(config.pretrained_config, "draft_vocab_size"):
# It's an Eagle3 checkpoint - strip "Eagle3" suffix if present, then add prefix
model_arch = model_arch.replace("Eagle3", "")
model_arch = "EAGLE3" + model_arch
if model_arch in (
"DeepseekV3ForCausalLM", "Glm4MoeForCausalLM"
Expand Down
24 changes: 19 additions & 5 deletions tensorrt_llm/_torch/models/modeling_speculative.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
from torch import nn
from transformers import LlamaConfig, PretrainedConfig

from tensorrt_llm.logger import logger

from ...functional import PositionEmbeddingType
from ..attention_backend import AttentionMetadata
from ..attention_backend.interface import PositionalEmbeddingParams, RopeParams
Expand All @@ -24,6 +26,18 @@
register_auto_model)


def _ensure_draft_vocab_size(config: PretrainedConfig) -> None:
if hasattr(config,
"draft_vocab_size") and config.draft_vocab_size is not None:
return

logger.warning(
"Missing 'draft_vocab_size' in pretrained config; defaulting to 'vocab_size'. "
"Set 'draft_vocab_size' explicitly if the draft head uses a different vocabulary."
)
config.draft_vocab_size = config.vocab_size


class Eagle3Attention(Attention):

def __init__(
Expand Down Expand Up @@ -417,9 +431,8 @@ def __init__(
model_config: ModelConfig[PretrainedConfig],
start_layer_idx: int = 0,
):
draft_vocab_size = model_config.pretrained_config.vocab_size
if model_config.pretrained_config.draft_vocab_size is not None:
draft_vocab_size = model_config.pretrained_config.draft_vocab_size
config = model_config.pretrained_config
_ensure_draft_vocab_size(config)

# Determine if we should use MLA attention based on config
# MLA is used for DeepSeekV3-style models that have kv_lora_rank
Expand All @@ -435,8 +448,8 @@ def __init__(
super().__init__(
draft_model,
config=model_config,
hidden_size=model_config.pretrained_config.hidden_size,
vocab_size=draft_vocab_size,
hidden_size=config.hidden_size,
vocab_size=config.draft_vocab_size,
)
self.load_lm_head_from_target = True

Expand Down Expand Up @@ -598,6 +611,7 @@ def forward(


# We use MistralLarge3 as the base architecture for EAGLE3 draft layers
# NOTE: Class name says "Eagle" not "Eagle3" to match checkpoint naming (e.g., "Mistral-Large-3-675B-Instruct-2512-Eagle")
@register_auto_model("MistralLarge3EagleForCausalLM")
class MistralLarge3EagleForCausalLM(DecoderModelForCausalLM):

Expand Down
10 changes: 6 additions & 4 deletions tensorrt_llm/llmapi/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@
CapacitySchedulerPolicy, ContextChunkingPolicy,
CudaGraphConfig, DeepSeekSparseAttentionConfig,
DraftTargetDecodingConfig, DynamicBatchConfig,
EagleDecodingConfig, ExtendedRuntimePerfKnobConfig,
KvCacheConfig, LlmArgs, LookaheadDecodingConfig,
MedusaDecodingConfig, MoeConfig, MTPDecodingConfig,
NGramDecodingConfig, RocketSparseAttentionConfig,
Eagle3DecodingConfig, EagleDecodingConfig,
ExtendedRuntimePerfKnobConfig, KvCacheConfig, LlmArgs,
LookaheadDecodingConfig, MedusaDecodingConfig, MoeConfig,
MTPDecodingConfig, NGramDecodingConfig,
RocketSparseAttentionConfig,
SaveHiddenStatesDecodingConfig, SchedulerConfig,
SkipSoftmaxAttentionConfig, TorchCompileConfig,
TorchLlmArgs, TrtLlmArgs, UserProvidedDecodingConfig)
Expand All @@ -38,6 +39,7 @@
'LookaheadDecodingConfig',
'MedusaDecodingConfig',
'EagleDecodingConfig',
'Eagle3DecodingConfig',
'MTPDecodingConfig',
'SchedulerConfig',
'CapacitySchedulerPolicy',
Expand Down
42 changes: 38 additions & 4 deletions tensorrt_llm/llmapi/llm_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -708,6 +708,8 @@ def _validate_acceptance_length_threshold(cls, v: Optional[float]):
_allow_chain_drafter: bool = PrivateAttr(True)
# If set, drafting uses greedy sampling, irrespective of sampling parameters.
_allow_greedy_draft_tokens: bool = PrivateAttr(True)
# Internal: record decoding_type alias used during parsing (for warnings).
_decoding_type_alias: Optional[str] = PrivateAttr(default=None)

@field_validator('draft_len_schedule')
@classmethod
Expand Down Expand Up @@ -755,13 +757,14 @@ def validate_draft_len_schedule_and_sort(cls, v, info):
return v

@classmethod
def from_dict(cls, data: dict):
def from_dict(cls, data: dict, backend: Optional[str] = None):
# dispatch to the correct decoding config
decoding_type = data.get("decoding_type")
config_classes = {
"MTP": MTPDecodingConfig,
"Medusa": MedusaDecodingConfig,
"Eagle": EagleDecodingConfig,
"Eagle3": Eagle3DecodingConfig,
"Lookahead": LookaheadDecodingConfig,
"NGram": NGramDecodingConfig,
"DraftTarget": DraftTargetDecodingConfig,
Expand All @@ -770,6 +773,14 @@ def from_dict(cls, data: dict):
"AUTO": AutoDecodingConfig,
}

backend = backend.lower() if isinstance(backend, str) else backend
if decoding_type == "Eagle" and backend in ("pytorch", "_autodeploy"):
data = dict(data)
data.pop("decoding_type")
spec_cfg = Eagle3DecodingConfig(**data)
spec_cfg._decoding_type_alias = "Eagle"
return spec_cfg

config_class = config_classes.get(decoding_type)
if config_class is None:
raise ValueError(f"Invalid decoding type: {decoding_type}")
Expand Down Expand Up @@ -966,6 +977,10 @@ def is_linear_tree(self) -> bool:
return False


class Eagle3DecodingConfig(EagleDecodingConfig):
decoding_type: ClassVar[str] = "Eagle3"


class SaveHiddenStatesDecodingConfig(DecodingBaseConfig):
output_directory: str
write_interval: int = 20
Expand Down Expand Up @@ -2506,9 +2521,14 @@ def validate_speculative_config(self):
decoding_mode=DecodingMode.Medusa(),
medusa_choices=self.speculative_config.medusa_choices)

elif isinstance(self.speculative_config, Eagle3DecodingConfig):
raise ValueError(
"speculative_config.decoding_type 'Eagle3' is only supported on the PyTorch backend. "
"Use decoding_type 'Eagle' for the TensorRT backend.")

elif isinstance(self.speculative_config, EagleDecodingConfig):
assert self.speculative_config.max_draft_len > 0
assert self.speculative_config.speculative_model_dir is not None, "Path to EAGLE3 weights must be specified."
assert self.speculative_config.speculative_model_dir is not None, "Path to EAGLE weights must be specified."
self.build_config.max_draft_len = self.speculative_config.max_draft_len
self.build_config.speculative_decoding_mode = SpeculativeDecodingMode.EAGLE
eagle_config = _EagleConfig(
Expand Down Expand Up @@ -3024,6 +3044,14 @@ def validate_speculative_config(self):
f"support backend {self.backend}")

if isinstance(self.speculative_config, EagleDecodingConfig):
if (getattr(self.speculative_config, "_decoding_type_alias",
None) == "Eagle" or type(self.speculative_config)
is EagleDecodingConfig):
logger.warning(
"speculative_config.decoding_type 'Eagle' is not supported on the PyTorch backend; only 'Eagle3' is supported. "
"'Eagle' is treated as 'Eagle3' for backward compatibility. "
"EAGLE (v1/v2) draft checkpoints are incompatible with Eagle3—use an Eagle3 draft model."
)
assert self.speculative_config.max_draft_len > 0
assert self.speculative_config.speculative_model_dir is not None, "Path to EAGLE3 weights must be specified."
elif isinstance(self.speculative_config, NGramDecodingConfig):
Expand Down Expand Up @@ -3323,8 +3351,14 @@ def update_llm_args_with_extra_dict(
if field_name in llm_args_dict:
# Some fields need to be converted manually.
if field_name in ["speculative_config", "sparse_attention_config"]:
llm_args_dict[field_name] = field_type.from_dict(
llm_args_dict[field_name])
if field_name == "speculative_config":
backend = llm_args_dict.get("backend") or llm_args.get(
"backend")
llm_args_dict[field_name] = field_type.from_dict(
llm_args_dict[field_name], backend=backend)
else:
llm_args_dict[field_name] = field_type.from_dict(
llm_args_dict[field_name])
else:
llm_args_dict[field_name] = field_type(
**llm_args_dict[field_name])
Expand Down
5 changes: 3 additions & 2 deletions tensorrt_llm/llmapi/llm_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@
from .build_cache import (BuildCache, BuildCacheConfig, CachedStage,
get_build_cache_config_from_env)
from .llm_args import (CalibConfig, CudaGraphConfig, DraftTargetDecodingConfig,
EagleDecodingConfig, KvCacheConfig, LlmArgs,
LookaheadDecodingConfig, MedusaDecodingConfig,
Eagle3DecodingConfig, EagleDecodingConfig, KvCacheConfig,
LlmArgs, LookaheadDecodingConfig, MedusaDecodingConfig,
MTPDecodingConfig, NGramDecodingConfig,
UserProvidedDecodingConfig, _ModelFormatKind,
_ModelWrapper, _ParallelConfig,
Expand Down Expand Up @@ -923,6 +923,7 @@ class LlmBuildStats:
'KvCacheConfig',
'CachedModelLoader',
'EagleDecodingConfig',
'Eagle3DecodingConfig',
'update_llm_args_with_extra_dict',
'update_llm_args_with_extra_options',
]
6 changes: 3 additions & 3 deletions tests/integration/defs/accuracy/test_cli_flow.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# limitations under the License.
import pytest

from tensorrt_llm.llmapi import (EagleDecodingConfig, LookaheadDecodingConfig,
from tensorrt_llm.llmapi import (Eagle3DecodingConfig, LookaheadDecodingConfig,
MedusaDecodingConfig)
from tensorrt_llm.quantization import QuantAlgo

Expand Down Expand Up @@ -476,7 +476,7 @@ def test_eagle(self, cuda_graph, chunked_context, typical_acceptance,
extra_summarize_args.extend(
["--eagle_posterior_threshold=0.09", "--temperature=0.7"])

self.run(spec_dec_algo=EagleDecodingConfig.decoding_type,
self.run(spec_dec_algo=Eagle3DecodingConfig.decoding_type,
extra_convert_args=[
f"--eagle_model_dir={self.EAGLE_MODEL_PATH}",
"--max_draft_len=63", "--num_eagle_layers=4",
Expand All @@ -503,7 +503,7 @@ def test_eagle_2(self, cuda_graph, chunked_context, mocker):
if chunked_context:
extra_summarize_args.append("--enable_chunked_context")

self.run(spec_dec_algo=EagleDecodingConfig.decoding_type,
self.run(spec_dec_algo=Eagle3DecodingConfig.decoding_type,
extra_convert_args=[
f"--eagle_model_dir={self.EAGLE_MODEL_PATH}",
"--max_draft_len=63", "--num_eagle_layers=4",
Expand Down
Loading