-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias #10099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias #10099
Conversation
659d01e to
b4d5011
Compare
📝 WalkthroughWalkthroughThis PR renames the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/unittest/llmapi/test_llm_args.py (1)
448-460: Alias test correctly validates speculative_model_dir → speculative_modelThe test covers the intended alias path (EagleDecodingConfig with
speculative_model_dirflowing intoTorchLlmArgs.speculative_model) and looks correct.If you touch imports in this file again, consider replacing the
from tensorrt_llm.llmapi.llm_args import *star import with explicit names to satisfy Ruff F405 and match the namespace-import guideline, but no need to block on it here.examples/models/core/qwen/README.md (1)
840-864: Eagle3 docs correctly describe speculative_model; consider wrapping bare URLThe updated text and YAML example correctly document
speculative_config.speculative_modelas taking either a Hugging Face ID or a local path, aligned with the code changes.To satisfy markdownlint (MD034) and improve readability, you could wrap the bare Hugging Face URL in link syntax and tweak the wording slightly:
Optional doc tweak for the Eagle3 bullet
- Specify the Eagle3 draft model either as a Huggingface model ID or a local path. You can find ready-to-use Eagle3 draft models at https://huggingface.co/collections/nvidia/speculative-decoding-modules. + Specify the Eagle3 draft model either as a Hugging Face model ID or a local path. You can find ready-to-use Eagle3 draft models in [this Hugging Face collection](https://huggingface.co/collections/nvidia/speculative-decoding-modules).tensorrt_llm/llmapi/llm_args.py (1)
621-626: Consider documenting the field alias in the comment.The field alias implementation is correct and provides backward compatibility. However, the comment doesn't mention that
speculative_model_diris still supported as an alias. Consider adding a note like:# The speculative (draft) model. Accepts either: # - A HuggingFace Hub model ID (str), e.g., "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" # which will be automatically downloaded. # - A local filesystem path to a downloaded model directory. # Note: speculative_model_dir is supported as an alias for backward compatibility.This would help users who are migrating from the old parameter name.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (37)
docs/source/features/speculative-decoding.md(3 hunks)examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py(1 hunks)examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py(1 hunks)examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py(1 hunks)examples/llm-api/llm_speculative_decoding.py(1 hunks)examples/llm-api/quickstart_advanced.py(2 hunks)examples/models/core/qwen/README.md(2 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py(1 hunks)tensorrt_llm/_torch/models/modeling_speculative.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_loader.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(7 hunks)tensorrt_llm/llmapi/llm_utils.py(3 hunks)tensorrt_llm/llmapi/utils.py(1 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(2 hunks)tests/integration/defs/accuracy/test_llm_api.py(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(12 hunks)tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py(1 hunks)tests/integration/defs/examples/serve/test_spec_decoding_metrics.py(2 hunks)tests/integration/defs/examples/test_ad_speculative_decoding.py(1 hunks)tests/integration/defs/perf/pytorch_model_config.py(1 hunks)tests/integration/defs/perf/test_perf.py(3 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/scripts/perf-sanity/l0_dgx_b200.yaml(1 hunks)tests/scripts/perf-sanity/run_benchmark_serve.py(5 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py(1 hunks)tests/unittest/_torch/speculative/test_draft_len_schedule.py(3 hunks)tests/unittest/_torch/speculative/test_draft_target.py(1 hunks)tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py(1 hunks)tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py(1 hunks)tests/unittest/_torch/speculative/test_draft_token_tree_verification.py(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py(1 hunks)tests/unittest/_torch/speculative/test_eagle3.py(9 hunks)tests/unittest/_torch/speculative/test_kv_cache_reuse.py(1 hunks)tests/unittest/_torch/speculative/test_spec_gate.py(1 hunks)tests/unittest/llmapi/test_llm.py(4 hunks)tests/unittest/llmapi/test_llm_args.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tensorrt_llm/_torch/models/modeling_speculative.pytests/scripts/perf-sanity/run_benchmark_serve.pytests/integration/defs/accuracy/test_disaggregated_serving.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/defs/examples/serve/test_spec_decoding_metrics.pytests/unittest/_torch/speculative/test_draft_len_schedule.pytests/unittest/_torch/speculative/test_kv_cache_reuse.pytests/integration/defs/accuracy/test_llm_api.pytests/integration/defs/disaggregated/test_disaggregated_single_gpu.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytests/integration/defs/perf/test_perf.pytests/unittest/llmapi/test_llm.pytensorrt_llm/llmapi/utils.pytests/unittest/_torch/speculative/test_spec_gate.pyexamples/llm-api/_tensorrt_engine/llm_eagle2_decoding.pytests/unittest/_torch/speculative/test_draft_token_tree_sampling.pyexamples/llm-api/llm_speculative_decoding.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.pytests/integration/defs/examples/test_ad_speculative_decoding.pytests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.pyexamples/llm-api/_tensorrt_engine/llm_medusa_decoding.pytests/unittest/_torch/speculative/test_draft_token_tree_verification.pyexamples/llm-api/_tensorrt_engine/llm_eagle_decoding.pytensorrt_llm/_torch/pyexecutor/model_loader.pytests/integration/defs/perf/pytorch_model_config.pytests/unittest/_torch/speculative/test_dynamic_spec_decode.pytests/integration/defs/test_e2e.pytests/unittest/llmapi/test_llm_args.pyexamples/llm-api/quickstart_advanced.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/llmapi/llm_utils.pytests/unittest/_torch/speculative/test_eagle3.pytests/unittest/_torch/speculative/test_draft_target.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tensorrt_llm/_torch/models/modeling_speculative.pytests/scripts/perf-sanity/run_benchmark_serve.pytests/integration/defs/accuracy/test_disaggregated_serving.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/defs/examples/serve/test_spec_decoding_metrics.pytests/unittest/_torch/speculative/test_draft_len_schedule.pytests/unittest/_torch/speculative/test_kv_cache_reuse.pytests/integration/defs/accuracy/test_llm_api.pytests/integration/defs/disaggregated/test_disaggregated_single_gpu.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytests/integration/defs/perf/test_perf.pytests/unittest/llmapi/test_llm.pytensorrt_llm/llmapi/utils.pytests/unittest/_torch/speculative/test_spec_gate.pyexamples/llm-api/_tensorrt_engine/llm_eagle2_decoding.pytests/unittest/_torch/speculative/test_draft_token_tree_sampling.pyexamples/llm-api/llm_speculative_decoding.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.pytests/integration/defs/examples/test_ad_speculative_decoding.pytests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.pyexamples/llm-api/_tensorrt_engine/llm_medusa_decoding.pytests/unittest/_torch/speculative/test_draft_token_tree_verification.pyexamples/llm-api/_tensorrt_engine/llm_eagle_decoding.pytensorrt_llm/_torch/pyexecutor/model_loader.pytests/integration/defs/perf/pytorch_model_config.pytests/unittest/_torch/speculative/test_dynamic_spec_decode.pytests/integration/defs/test_e2e.pytests/unittest/llmapi/test_llm_args.pyexamples/llm-api/quickstart_advanced.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/llmapi/llm_utils.pytests/unittest/_torch/speculative/test_eagle3.pytests/unittest/_torch/speculative/test_draft_target.py
🧠 Learnings (13)
📓 Common learnings
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pytests/unittest/llmapi/test_llm_args.pytensorrt_llm/llmapi/llm_utils.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/llmapi/llm_utils.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/pyexecutor/model_loader.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pyexamples/llm-api/_tensorrt_engine/llm_eagle2_decoding.pytensorrt_llm/_torch/pyexecutor/model_loader.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/pyexecutor/model_loader.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.pytests/integration/defs/examples/test_ad_speculative_decoding.pyexamples/llm-api/_tensorrt_engine/llm_eagle_decoding.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/llmapi/llm_utils.py
📚 Learning: 2025-09-03T13:16:06.824Z
Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tensorrt_llm/_torch/models/modeling_llama.py:1315-1315
Timestamp: 2025-09-03T13:16:06.824Z
Learning: The Llama4VisionEncoder.load_weights method signature is `def load_weights(self, weights: Dict)` and should not be confused with Llama4ForConditionalGeneration.load_weights which has a different signature including weight_mapper parameter.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_loader.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/unittest/llmapi/test_llm_args.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/unittest/llmapi/test_llm_args.py
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.
Applied to files:
tensorrt_llm/llmapi/llm_utils.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tensorrt_llm/llmapi/llm_utils.py
🧬 Code graph analysis (29)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/scripts/perf-sanity/run_benchmark_serve.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)tests/scripts/perf-sanity/run_benchmark_serve.py (1)
llm_models_root(173-174)tests/integration/defs/conftest.py (1)
llm_models_root(80-94)
tests/unittest/_torch/speculative/test_kv_cache_reuse.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/integration/defs/accuracy/test_llm_api.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)tests/scripts/perf-sanity/run_benchmark_serve.py (1)
llm_models_root(173-174)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)tensorrt_llm/_torch/models/modeling_llava_next.py (1)
model_path(77-78)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (8)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)
model_path(75-80)tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
model_path(90-91)tests/unittest/_torch/sampler/test_trtllm_sampler.py (1)
model_path(11-12)tensorrt_llm/inputs/registry.py (1)
model_path(348-349)tests/unittest/_torch/executor/test_overlap_scheduler.py (1)
model_path(20-21)tensorrt_llm/_torch/models/modeling_llava_next.py (1)
model_path(77-78)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
model_path(807-808)tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/integration/defs/perf/test_perf.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)tests/scripts/perf-sanity/run_benchmark_serve.py (1)
llm_models_root(173-174)
tests/unittest/llmapi/test_llm.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tensorrt_llm/llmapi/utils.py (1)
tensorrt_llm/logger.py (1)
info(138-139)
tests/unittest/_torch/speculative/test_spec_gate.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
EagleDecodingConfig(815-932)speculative_model(2091-2092)
tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
examples/llm-api/llm_speculative_decoding.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py (2)
examples/auto_deploy/build_and_run_ad.py (1)
ExperimentConfig(126-239)tensorrt_llm/llmapi/llm_args.py (1)
DraftTargetDecodingConfig(1036-1049)
tests/integration/defs/examples/test_ad_speculative_decoding.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
MedusaDecodingConfig(797-812)speculative_model(2091-2092)
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
EagleDecodingConfig(815-932)speculative_model(2091-2092)
tensorrt_llm/_torch/pyexecutor/model_loader.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/integration/defs/test_e2e.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/unittest/llmapi/test_llm_args.py (1)
tensorrt_llm/llmapi/llm_args.py (3)
EagleDecodingConfig(815-932)TorchLlmArgs(2674-3212)speculative_model(2091-2092)
examples/llm-api/quickstart_advanced.py (1)
tensorrt_llm/llmapi/llm_args.py (4)
speculative_model(2091-2092)model_dir(1803-1805)model_dir(1808-1812)EagleDecodingConfig(815-932)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)
model_path(75-80)tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
default(45-50)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tests/scripts/perf-sanity/run_benchmark_serve.py (1)
llm_models_root(173-174)tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
tests/unittest/_torch/speculative/test_draft_target.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
speculative_model(2091-2092)
🪛 markdownlint-cli2 (0.18.1)
examples/models/core/qwen/README.md
845-845: Bare URL used
(MD034, no-bare-urls)
🪛 Ruff (0.14.8)
tests/unittest/llmapi/test_llm_args.py
451-451: EagleDecodingConfig may be undefined, or defined from star imports
(F405)
457-457: TorchLlmArgs may be undefined, or defined from star imports
(F405)
tests/unittest/_torch/speculative/test_eagle3.py
157-157: Unused function argument: request
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (44)
tests/integration/defs/perf/pytorch_model_config.py (1)
226-226: LGTM!The key rename from
speculative_model_dirtospeculative_modelaligns with the broader API change across the codebase.examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py (1)
26-31: LGTM!The parameter rename from
speculative_model_dirtospeculative_modelcorrectly reflects the updated EagleDecodingConfig API, and the comment has been updated accordingly.tests/scripts/perf-sanity/l0_dgx_b200.yaml (1)
283-283: LGTM!The YAML configuration key has been correctly updated from
speculative_model_dirtospeculative_model.examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py (1)
26-31: LGTM!The parameter and comment updates correctly align with the EagleDecodingConfig API change from
speculative_model_dirtospeculative_model.tensorrt_llm/_torch/models/modeling_speculative.py (1)
585-585: LGTM!The property access has been correctly updated from
speculative_model_dirtospeculative_model, consistent with the API changes in the spec_config interface.tests/integration/defs/perf/test_perf.py (3)
597-597: LGTM!The configuration key has been correctly updated to read
speculative_modelinstead ofspeculative_model_dir.
705-706: Database field name retained for schema compatibility.The database field name remains
s_speculative_model_dir(line 706) while the code now reads fromspeculative_model(line 597). This preserves backward compatibility with existing database schema while adopting the new configuration key name.
719-727: LGTM!The configuration path handling has been correctly updated to reference
speculative_modelthroughout the nested dictionary access and path construction logic.tensorrt_llm/llmapi/utils.py (1)
227-235: LGTM!The logging statements added around the HuggingFace model download provide useful visibility into the download process without affecting functionality.
tests/integration/defs/examples/serve/test_spec_decoding_metrics.py (2)
96-96: LGTM!The test configuration has been correctly updated to use
speculative_modelinstead ofspeculative_model_dir.
177-177: LGTM!The test configuration has been correctly updated to use
speculative_modelin the two-model mode test as well.tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1)
26-26: LGTM! Parameter rename applied correctly.The change from
speculative_model_dirtospeculative_modelaligns with the API update across the codebase. The local variableeagle_model_dirappropriately retains its descriptive name.tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
401-401: LGTM! Field access updated correctly.The change correctly accesses the renamed
speculative_modelfield from the spec_config when creating the draft model engine.tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
754-754: LGTM! Field access updated correctly.The change correctly accesses the renamed
speculative_modelfield when creating the draft model engine for the AutoDeploy backend.examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py (1)
51-54: LGTM! Comment and code updated consistently.Both the comment (line 51) and the parameter usage (line 54) have been updated from
speculative_model_dirtospeculative_model, ensuring documentation stays in sync with the code.tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
540-541: LGTM! Configuration dictionary updated correctly.The dictionary key has been updated from
"speculative_model_dir"to"speculative_model"in the speculative decoding configuration, ensuring consistency with the API change.
639-640: LGTM! Configuration dictionary updated correctly.The dictionary key has been updated from
"speculative_model_dir"to"speculative_model"in the speculative decoding configuration, consistent with the same change in the test_eagle3 method.tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (1)
41-41: LGTM! Parameter rename applied correctly.The change from
speculative_model_dirtospeculative_modelis consistent with the API update. The local variableeagle_model_dirappropriately retains its descriptive name.tests/unittest/_torch/speculative/test_draft_target.py (1)
48-48: LGTM! Parameter rename applied correctly.The change from
speculative_model_dirtospeculative_modelin the DraftTargetDecodingConfig is consistent with the API update. The local variabledraft_model_dirappropriately retains its descriptive name.tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)
403-403: LGTM! Parameter change correctly applied.The parameter
speculative_modelis the current field in EagleDecodingConfig, and your change at line 403 correctly uses this parameter. The codebase maintains backwards compatibility through an alias for the deprecatedspeculative_model_dirparameter, so this is a compatible migration rather than a breaking change. Both the old and new parameter names remain functional across the codebase.tests/unittest/_torch/speculative/test_kv_cache_reuse.py (1)
53-57: EagleDecodingConfig keyword rename is consistent with new APISwitching to
speculative_model=eagle_model_dirmatches the updated config field name while preserving behavior, since the value is still a local draft model path.tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (1)
86-94: speculative_model rename in EagleDecodingConfig is correct hereUsing
speculative_model=eagle_model_diraligns this test with the updated EagleDecodingConfig API; the rest of the setup (including Eagle3ResourceManager and SpecTreeManager) remains unaffected.tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)
57-62: Dynamic spec decode test updated to new speculative_model keywordThe change to
speculative_model=eagle_model_dircorrectly tracks the config field rename; draft model selection behavior is unchanged.tensorrt_llm/_torch/pyexecutor/model_loader.py (1)
278-283: Draft weight loading now correctly sources self.spec_config.speculative_modelUsing
self.spec_config.speculative_modelin the draftload_weightscall matches the renamed config field and the validator contract (speculative_model must be set wheneverneed_load_draft_weights()is true). This should keep existing local-path behavior intact while enabling the new HF-ID-based flow for PyTorch/AutoDeploy.tests/integration/defs/test_e2e.py (1)
3345-3350: Eagle3 consistency test now uses speculative_model fieldUpdating the EagleDecodingConfig here to
speculative_model=eagle_model_diris consistent with the new public field name and keeps the test semantics (one-model Eagle3 draft) unchanged.tests/integration/defs/accuracy/test_llm_api.py (1)
472-483: Accuracy tests updated to use speculative_model for Eagle configsBoth Eagle-based Vicuna accuracy tests now construct EagleDecodingConfig with
speculative_model=..., matching the renamed field while preserving the same local draft model directory. This keeps the speculative-decoding behavior and accuracy coverage the same under the new API.Also applies to: 498-505
tests/integration/defs/examples/test_ad_speculative_decoding.py (1)
64-68: DraftTargetDecodingConfig now correctly usesspeculative_model.The test wiring for
DraftTargetDecodingConfigis consistent with the newspeculative_modelfield while preserving the existingspeculative_model_dirargument name in the helper.tests/unittest/_torch/speculative/test_draft_len_schedule.py (1)
78-82: DraftTargetDecodingConfig updates align with renamedspeculative_modelfield.All three
DraftTargetDecodingConfigcallsites now pass the draft model viaspeculative_model, matching the updated API without changing the test behavior.Also applies to: 124-128, 187-191
tests/unittest/_torch/speculative/test_spec_gate.py (1)
48-56: EagleDecodingConfig correctly switched tospeculative_model.The EAGLE3 spec config now uses
speculative_model=eagle_model_dir, matching the updated config interface with no behavioral change.examples/llm-api/llm_speculative_decoding.py (1)
35-47: Example now demonstrates HF-basedspeculative_modelusage.
run_Eagle3correctly uses the newspeculative_modelparameter with a HuggingFace model ID, consistent with the updated API and docs.tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
286-289: Speculative config tests updated consistently tospeculative_model.The various
EagleDecodingConfigandMTPDecodingConfigcallsites now all use thespeculative_modelparameter, covering both local paths and draft checkpoints for multiple models (Llama, Qwen, DeepSeek, Mistral). This keeps test semantics intact while exercising the renamed field.Also applies to: 383-385, 635-636, 1361-1363, 2941-2943, 3419-3421, 3800-3804, 3850-3854, 4949-4953, 4999-5003
examples/llm-api/quickstart_advanced.py (1)
203-228: Quickstart speculative branches correctly mapped tospeculative_model.
setup_llmnow routesargs.model_dir/args.draft_model_dirintospeculative_modelfor MTP, EAGLE3, and DraftTarget, matching the new configuration API while preserving the CLI surface.tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py (1)
16-40: New parametrization cleanly tests HF ID vs local speculative model.The test now covers both a HuggingFace ID and a resolved local path via
speculative_model, which is exactly what’s needed to validate the new auto-download behavior for AutoDeploy + PyTorch.docs/source/features/speculative-decoding.md (1)
40-49: Speculative-decoding docs are aligned with the newspeculative_modelAPI.The updated examples and YAML snippets correctly use
speculative_model(with both HF IDs and local paths) and clearly documentspeculative_model_diras an alias, matching the implementation and tests across the PR.Also applies to: 59-76, 144-151, 153-169
tests/scripts/perf-sanity/run_benchmark_serve.py (1)
221-221: LGTM! Clean rename fromspeculative_model_dirtospeculative_model.The changes consistently update the parameter name throughout the config metrics, class initialization, YAML generation, and parsing logic. The renaming aligns with the PR objective to clarify that the field need not be a directory.
Also applies to: 262-262, 288-288, 348-350, 503-504
tests/unittest/llmapi/test_llm.py (1)
1220-1220: LGTM! Test updates align with the API rename.The parameter name changes from
speculative_model_dirtospeculative_modelin test configurations are consistent with the broader API changes across the codebase.Also applies to: 1259-1259, 1297-1297, 1344-1344
tensorrt_llm/llmapi/llm_utils.py (2)
112-113: LGTM! Consistent rename throughout the file.The changes from
speculative_model_dirtospeculative_modelmaintain consistency with the API rename across attribute access and conditional checks.Also applies to: 443-444
650-664: Error handling is already in place. The_node_download_hf_modelfunction is decorated with@print_traceback_on_error, which wraps the call todownload_hf_modelwith comprehensive exception handling. Any exceptions fromsnapshot_download(network issues, invalid model IDs, etc.) will be caught, logged with a full traceback, and propagated appropriately to the caller.tests/unittest/_torch/speculative/test_eagle3.py (2)
202-202: LGTM! Consistent API updates across all test cases.All occurrences of
speculative_model_dirhave been correctly updated tospeculative_modelin the EagleDecodingConfig instantiations throughout the test file.Also applies to: 277-277, 358-358, 482-482, 583-583, 642-642, 695-695, 763-763
95-95: Verify network-dependent test behavior and unused fixture parameter.The new test variant at line 149 introduces HuggingFace model auto-download functionality, which requires network access during test execution. This could introduce:
- Increased test duration (model download time)
- Test flakiness due to network issues
- CI/CD environment compatibility concerns
Additionally, the static analysis tool reports that the
requestparameter at line 157 is unused. If this pytest fixture isn't required for test isolation or cleanup, consider removing it.Please confirm:
- Is the network-dependent test acceptable for your CI environment, or should it be marked with a specific pytest marker (e.g.,
@pytest.mark.require_network)?- Is the
requestfixture parameter actually needed, or can it be safely removed?Also applies to: 156-169
tensorrt_llm/llmapi/llm_args.py (4)
887-888: LGTM!Validation logic correctly uses the renamed field
speculative_model.
2091-2092: LGTM!Property accessor correctly renamed from
speculative_model_dirtospeculative_model.
2432-2432: LGTM!Assertion correctly uses the renamed field
speculative_model.
2452-2453: LGTM!The
getattrcalls correctly use the field name"speculative_model"rather than the alias. This is proper because the alias only affects input validation and serialization, while attribute access on instantiated objects uses the actual field name.Also applies to: 2981-2982
govind-ramnarayan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we are permitted to put downloads from HuggingFace in the CI. See: https://nvidia.slack.com/archives/C059LSY62BT/p1765940901371369?thread_ts=1765895541.514539&cid=C059LSY62BT
I believe the existing behavior in the the test_ad_speculative_decoding.py smoke test on CI machines would be to just use the local model from llm-models, but please verify this.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py
Show resolved
Hide resolved
Superjomn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the llmapi changes.
govind-ramnarayan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the edits! LGTM
8a5f5f4 to
1827f84
Compare
|
/bot run |
|
PR_Github #29393 [ run ] triggered by Bot. Commit: |
|
PR_Github #29393 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29457 [ run ] triggered by Bot. Commit: |
|
PR_Github #29457 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29485 [ run ] triggered by Bot. Commit: |
|
PR_Github #29485 [ run ] completed with state
|
831cb1a to
8ea8885
Compare
|
PR_Github #30773 [ run ] triggered by Bot. Commit: |
|
PR_Github #30773 [ run ] completed with state
|
QiJune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
76f01fc to
49f7b27
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #30921 [ run ] triggered by Bot. Commit: |
|
PR_Github #30921 [ run ] completed with state
|
|
/bot run |
|
PR_Github #30957 [ run ] triggered by Bot. Commit: |
|
PR_Github #30957 [ run ] completed with state
|
…kend, add speculative_model field alias Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
Signed-off-by: Anish Shanbhag <[email protected]>
49f7b27 to
4511bda
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #31030 [ run ] triggered by Bot. Commit: |
|
PR_Github #31030 [ run ] completed with state
|
|
/bot run |
|
PR_Github #31108 [ run ] triggered by Bot. Commit: |
|
PR_Github #31108 [ run ] completed with state
|
|
/bot run |
|
PR_Github #31117 [ run ] triggered by Bot. Commit: |
|
PR_Github #31117 [ run ] completed with state
|
Background
Currently, for the PyTorch / AutoDeploy backends, the draft model for speculative decoding needs to be provided as a local path. This means that users must manually download the draft model as an additional step.
The C++ backend already supports auto-downloading speculative models from HuggingFace via this logic, but this is not used for PyTorch/AutoDeploy.
Description
To resolve the above gap, this PR makes two main changes:
speculative_config.speculative_model_dirfield inLlmArgstospeculative_config.speculative_modelto make it clear that this doesn't have to be a directory, while keeping the originalspeculative_model_diras an alias to avoid a breaking change.speculative_modelname for the sake of consistency, but the originalspeculative_model_dircan still be used by users. Added this note in the docs to clarify this.Test Coverage
test_llama_eagle3(PyTorch backend) with an additional case that downloads from HFtest_ad_speculative_decoding_smokewith an additional case that downloads from HFtest_speculative_model_aliasto verify that the originalspeculative_model_dirfield is still supportedPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Refactor
speculative_model_dirparameter tospeculative_modelacross decoding configurations for clarity. Backward compatibility maintained via alias.✏️ Tip: You can customize this high-level summary in your review settings.