[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias #10099

anish-shanbhag · 2025-12-18T01:41:22Z

Background

Currently, for the PyTorch / AutoDeploy backends, the draft model for speculative decoding needs to be provided as a local path. This means that users must manually download the draft model as an additional step.

The C++ backend already supports auto-downloading speculative models from HuggingFace via this logic, but this is not used for PyTorch/AutoDeploy.

Description

To resolve the above gap, this PR makes two main changes:

Added support for auto-downloading speculative model from HF via this new logic
Renamed the speculative_config.speculative_model_dir field in LlmArgs to speculative_config.speculative_model to make it clear that this doesn't have to be a directory, while keeping the original speculative_model_dir as an alias to avoid a breaking change.
- All user-facing docs and internal usage has been updated to use the new speculative_model name for the sake of consistency, but the original speculative_model_dir can still be used by users. Added this note in the docs to clarify this.

Test Coverage

Updated test_llama_eagle3 (PyTorch backend) with an additional case that downloads from HF
Updated test_ad_speculative_decoding_smoke with an additional case that downloads from HF
Added test_speculative_model_alias to verify that the original speculative_model_dir field is still supported

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features
- Added support for speculative models via HuggingFace model IDs alongside local paths, enabling easier access to pre-configured draft models.
Documentation
- Updated configuration examples to showcase HuggingFace model ID usage for speculative decoding with Eagle3, Medusa, and Draft-Target configurations.
Refactor
- Renamed speculative_model_dir parameter to speculative_model across decoding configurations for clarity. Backward compatibility maintained via alias.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-18T21:14:36Z

📝 Walkthrough

Walkthrough

This PR renames the speculative_model_dir parameter to speculative_model throughout the TensorRT-LLM speculative decoding API. The change spans core library configuration, example code, utilities, and tests, while maintaining backward compatibility through a field alias for existing code.

Changes

Cohort / File(s)	Summary
Documentation `docs/source/features/speculative-decoding.md`, `examples/models/core/qwen/README.md`	Updated examples and documentation to reflect `speculative_model` parameter naming; added notes explaining backward compatibility via `speculative_model_dir` alias.
Example & Demo Files `examples/llm-api/llm_speculative_decoding.py`, `examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py`, `examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py`, `examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py`, `examples/llm-api/quickstart_advanced.py`	Replaced `speculative_model_dir` with `speculative_model` in `EagleDecodingConfig`, `MedusaDecodingConfig`, `MTPDecodingConfig`, and `DraftTargetDecodingConfig` instantiations.
Core Configuration & API `tensorrt_llm/llmapi/llm_args.py`	Renamed `speculative_model_dir` field to `speculative_model` with `alias="speculative_model_dir"` for backward compatibility; updated validation logic and property accessors; changed `DecodingBaseConfig` to expose `speculative_model` property.
Model Loading & Utilities `tensorrt_llm/llmapi/llm_utils.py`, `tensorrt_llm/llmapi/utils.py`	Updated speculative model path loading to use `speculative_model` instead of `speculative_model_dir`; added HuggingFace model download logging in `download_hf_model`.
Auto-Deploy & PyTorch Executor `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`, `tensorrt_llm/_torch/pyexecutor/model_loader.py`, `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`, `tensorrt_llm/_torch/models/modeling_speculative.py`	Updated draft model engine creation and weight loading to reference `speculative_model` instead of `speculative_model_dir`.
Integration Tests `tests/integration/defs/accuracy/test_disaggregated_serving.py`, `tests/integration/defs/accuracy/test_llm_api.py`, `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py`, `tests/integration/defs/examples/test_ad_speculative_decoding.py`, `tests/integration/defs/examples/serve/test_spec_decoding_metrics.py`, `tests/integration/defs/test_e2e.py`	Updated `EagleDecodingConfig`, `DraftTargetDecodingConfig`, and related config instantiations to use `speculative_model` parameter.
Unit Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py`, `tests/unittest/_torch/speculative/test_*.py`, `tests/unittest/llmapi/test_llm.py`, `tests/unittest/llmapi/test_llm_args.py`	Updated `EagleDecodingConfig` and `DraftTargetDecodingConfig` usage across speculative decoding tests; added parameterization for HuggingFace vs. local speculative models in select tests; added alias backward-compatibility test.
Performance & Testing Scripts `tests/integration/defs/perf/pytorch_model_config.py`, `tests/integration/defs/perf/test_perf.py`, `tests/scripts/perf-sanity/l0_dgx_b200.yaml`, `tests/scripts/perf-sanity/run_benchmark_serve.py`	Updated YAML configuration keys and `ServerConfig` attribute from `speculative_model_dir` to `speculative_model`; updated `__init__`, `to_db_data`, and config generation methods.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Homogeneous changes: The majority of modifications are consistent parameter renames applied repeatedly across many files with identical logic patterns, reducing cognitive load per change.
Backward compatibility: The alias mechanism in llm_args.py maintaining the old parameter name requires verification to ensure it functions correctly and doesn't introduce subtle bugs.
Test parameterization: A few test files introduce new boolean parameters (e.g., use_hf_speculative_model) to support both HuggingFace and local model sources—these warrant attention to ensure parameterization coverage is adequate.
Configuration propagation: Verify that all downstream code consuming the old attribute name has been updated to use the new field (especially in validation paths and model resolution logic in llm_args.py and llm_utils.py).

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.72% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main changes: adding auto-download support for speculative models from HF and renaming the speculative_model_dir field with an alias.
Description check	✅ Passed	The PR description is comprehensive, covering background, implementation approach, test coverage details with links, and completing the provided template with all required checklist items marked.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/unittest/llmapi/test_llm_args.py (1)

448-460: Alias test correctly validates speculative_model_dir → speculative_model

The test covers the intended alias path (EagleDecodingConfig with speculative_model_dir flowing into TorchLlmArgs.speculative_model) and looks correct.

If you touch imports in this file again, consider replacing the from tensorrt_llm.llmapi.llm_args import * star import with explicit names to satisfy Ruff F405 and match the namespace-import guideline, but no need to block on it here.
examples/models/core/qwen/README.md (1)
840-864: Eagle3 docs correctly describe speculative_model; consider wrapping bare URL

The updated text and YAML example correctly document speculative_config.speculative_model as taking either a Hugging Face ID or a local path, aligned with the code changes.

To satisfy markdownlint (MD034) and improve readability, you could wrap the bare Hugging Face URL in link syntax and tweak the wording slightly:
Optional doc tweak for the Eagle3 bullet
-  Specify the Eagle3 draft model either as a Huggingface model ID or a local path. You can find ready-to-use Eagle3 draft models at https://huggingface.co/collections/nvidia/speculative-decoding-modules.
+  Specify the Eagle3 draft model either as a Hugging Face model ID or a local path. You can find ready-to-use Eagle3 draft models in [this Hugging Face collection](https://huggingface.co/collections/nvidia/speculative-decoding-modules).
tensorrt_llm/llmapi/llm_args.py (1)
621-626: Consider documenting the field alias in the comment.

The field alias implementation is correct and provides backward compatibility. However, the comment doesn't mention that speculative_model_dir is still supported as an alias. Consider adding a note like:
# The speculative (draft) model. Accepts either:
# - A HuggingFace Hub model ID (str), e.g., "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
#   which will be automatically downloaded.
# - A local filesystem path to a downloaded model directory.
# Note: speculative_model_dir is supported as an alias for backward compatibility.
This would help users who are migrating from the old parameter name.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fe89ea and b4d5011.

📒 Files selected for processing (37)

docs/source/features/speculative-decoding.md (3 hunks)
examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py (1 hunks)
examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py (1 hunks)
examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py (1 hunks)
examples/llm-api/llm_speculative_decoding.py (1 hunks)
examples/llm-api/quickstart_advanced.py (2 hunks)
examples/models/core/qwen/README.md (2 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_loader.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (7 hunks)
tensorrt_llm/llmapi/llm_utils.py (3 hunks)
tensorrt_llm/llmapi/utils.py (1 hunks)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2 hunks)
tests/integration/defs/accuracy/test_llm_api.py (2 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (12 hunks)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1 hunks)
tests/integration/defs/examples/serve/test_spec_decoding_metrics.py (2 hunks)
tests/integration/defs/examples/test_ad_speculative_decoding.py (1 hunks)
tests/integration/defs/perf/pytorch_model_config.py (1 hunks)
tests/integration/defs/perf/test_perf.py (3 hunks)
tests/integration/defs/test_e2e.py (1 hunks)
tests/scripts/perf-sanity/l0_dgx_b200.yaml (1 hunks)
tests/scripts/perf-sanity/run_benchmark_serve.py (5 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py (1 hunks)
tests/unittest/_torch/speculative/test_draft_len_schedule.py (3 hunks)
tests/unittest/_torch/speculative/test_draft_target.py (1 hunks)
tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (1 hunks)
tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (1 hunks)
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)
tests/unittest/_torch/speculative/test_eagle3.py (9 hunks)
tests/unittest/_torch/speculative/test_kv_cache_reuse.py (1 hunks)
tests/unittest/_torch/speculative/test_spec_gate.py (1 hunks)
tests/unittest/llmapi/test_llm.py (4 hunks)
tests/unittest/llmapi/test_llm_args.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tests/scripts/perf-sanity/run_benchmark_serve.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/examples/serve/test_spec_decoding_metrics.py
tests/unittest/_torch/speculative/test_draft_len_schedule.py
tests/unittest/_torch/speculative/test_kv_cache_reuse.py
tests/integration/defs/accuracy/test_llm_api.py
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tests/integration/defs/perf/test_perf.py
tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/utils.py
tests/unittest/_torch/speculative/test_spec_gate.py
examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py
tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py
examples/llm-api/llm_speculative_decoding.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py
tests/integration/defs/examples/test_ad_speculative_decoding.py
tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py
examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py
examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tests/integration/defs/perf/pytorch_model_config.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/test_llm_args.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/llmapi/llm_utils.py
tests/unittest/_torch/speculative/test_eagle3.py
tests/unittest/_torch/speculative/test_draft_target.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tests/scripts/perf-sanity/run_benchmark_serve.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/examples/serve/test_spec_decoding_metrics.py
tests/unittest/_torch/speculative/test_draft_len_schedule.py
tests/unittest/_torch/speculative/test_kv_cache_reuse.py
tests/integration/defs/accuracy/test_llm_api.py
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tests/integration/defs/perf/test_perf.py
tests/unittest/llmapi/test_llm.py
tensorrt_llm/llmapi/utils.py
tests/unittest/_torch/speculative/test_spec_gate.py
examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py
tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py
examples/llm-api/llm_speculative_decoding.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py
tests/integration/defs/examples/test_ad_speculative_decoding.py
tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py
examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py
examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tests/integration/defs/perf/pytorch_model_config.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/test_llm_args.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/llmapi/llm_utils.py
tests/unittest/_torch/speculative/test_eagle3.py
tests/unittest/_torch/speculative/test_draft_target.py

🧠 Learnings (13)

📓 Common learnings

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/unittest/llmapi/test_llm_args.py
tensorrt_llm/llmapi/llm_utils.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/llmapi/llm_utils.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py
tensorrt_llm/_torch/pyexecutor/model_loader.py

📚 Learning: 2025-12-12T03:27:08.565Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/model_loader.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py
tests/integration/defs/examples/test_ad_speculative_decoding.py
examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/llmapi/llm_utils.py

📚 Learning: 2025-09-03T13:16:06.824Z

Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tensorrt_llm/_torch/models/modeling_llama.py:1315-1315
Timestamp: 2025-09-03T13:16:06.824Z
Learning: The Llama4VisionEncoder.load_weights method signature is `def load_weights(self, weights: Dict)` and should not be confused with Llama4ForConditionalGeneration.load_weights which has a different signature including weight_mapper parameter.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_loader.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/unittest/llmapi/test_llm_args.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/unittest/llmapi/test_llm_args.py

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

tensorrt_llm/llmapi/llm_utils.py

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

tensorrt_llm/llmapi/llm_utils.py

🧬 Code graph analysis (29)

tensorrt_llm/_torch/models/modeling_speculative.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

llm_models_root (173-174)

tests/integration/defs/conftest.py (1)

llm_models_root (80-94)

tests/unittest/_torch/speculative/test_kv_cache_reuse.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/integration/defs/accuracy/test_llm_api.py (2)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

llm_models_root (173-174)

tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tensorrt_llm/_torch/models/modeling_llava_next.py (1)

model_path (77-78)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (8)

tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)

model_path (75-80)

tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)

model_path (90-91)

tests/unittest/_torch/sampler/test_trtllm_sampler.py (1)

model_path (11-12)

tensorrt_llm/inputs/registry.py (1)

model_path (348-349)

tests/unittest/_torch/executor/test_overlap_scheduler.py (1)

model_path (20-21)

tensorrt_llm/_torch/models/modeling_llava_next.py (1)

model_path (77-78)

tensorrt_llm/_torch/models/modeling_phi4mm.py (1)

model_path (807-808)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/integration/defs/perf/test_perf.py (2)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

llm_models_root (173-174)

tests/unittest/llmapi/test_llm.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tensorrt_llm/llmapi/utils.py (1)

tensorrt_llm/logger.py (1)

info (138-139)

tests/unittest/_torch/speculative/test_spec_gate.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py (1)

tensorrt_llm/llmapi/llm_args.py (2)

EagleDecodingConfig (815-932)

speculative_model (2091-2092)

tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

examples/llm-api/llm_speculative_decoding.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py (2)

examples/auto_deploy/build_and_run_ad.py (1)

ExperimentConfig (126-239)

tensorrt_llm/llmapi/llm_args.py (1)

DraftTargetDecodingConfig (1036-1049)

tests/integration/defs/examples/test_ad_speculative_decoding.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py (1)

tensorrt_llm/llmapi/llm_args.py (2)

MedusaDecodingConfig (797-812)

speculative_model (2091-2092)

tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py (1)

tensorrt_llm/llmapi/llm_args.py (2)

EagleDecodingConfig (815-932)

speculative_model (2091-2092)

tensorrt_llm/_torch/pyexecutor/model_loader.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/integration/defs/test_e2e.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/unittest/llmapi/test_llm_args.py (1)

tensorrt_llm/llmapi/llm_args.py (3)

EagleDecodingConfig (815-932)

TorchLlmArgs (2674-3212)

speculative_model (2091-2092)

examples/llm-api/quickstart_advanced.py (1)

tensorrt_llm/llmapi/llm_args.py (4)

speculative_model (2091-2092)

model_dir (1803-1805)

model_dir (1808-1812)

EagleDecodingConfig (815-932)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)

model_path (75-80)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tensorrt_llm/llmapi/llm_args.py (1)

tensorrt_llm/builder.py (1)

default (45-50)

tests/unittest/_torch/speculative/test_eagle3.py (2)

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

llm_models_root (173-174)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

tests/unittest/_torch/speculative/test_draft_target.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

speculative_model (2091-2092)

🪛 markdownlint-cli2 (0.18.1)

examples/models/core/qwen/README.md

845-845: Bare URL used

(MD034, no-bare-urls)

🪛 Ruff (0.14.8)

tests/unittest/llmapi/test_llm_args.py

451-451: EagleDecodingConfig may be undefined, or defined from star imports

(F405)

457-457: TorchLlmArgs may be undefined, or defined from star imports

(F405)

tests/unittest/_torch/speculative/test_eagle3.py

157-157: Unused function argument: request

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (44)

tests/integration/defs/perf/pytorch_model_config.py (1)

226-226: LGTM!

The key rename from speculative_model_dir to speculative_model aligns with the broader API change across the codebase.

examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py (1)

26-31: LGTM!

The parameter rename from speculative_model_dir to speculative_model correctly reflects the updated EagleDecodingConfig API, and the comment has been updated accordingly.

tests/scripts/perf-sanity/l0_dgx_b200.yaml (1)

283-283: LGTM!

The YAML configuration key has been correctly updated from speculative_model_dir to speculative_model.

examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py (1)

26-31: LGTM!

The parameter and comment updates correctly align with the EagleDecodingConfig API change from speculative_model_dir to speculative_model.

tensorrt_llm/_torch/models/modeling_speculative.py (1)

585-585: LGTM!

The property access has been correctly updated from speculative_model_dir to speculative_model, consistent with the API changes in the spec_config interface.

tests/integration/defs/perf/test_perf.py (3)

597-597: LGTM!

The configuration key has been correctly updated to read speculative_model instead of speculative_model_dir.

705-706: Database field name retained for schema compatibility.

The database field name remains s_speculative_model_dir (line 706) while the code now reads from speculative_model (line 597). This preserves backward compatibility with existing database schema while adopting the new configuration key name.

719-727: LGTM!

The configuration path handling has been correctly updated to reference speculative_model throughout the nested dictionary access and path construction logic.

tensorrt_llm/llmapi/utils.py (1)

227-235: LGTM!

The logging statements added around the HuggingFace model download provide useful visibility into the download process without affecting functionality.

tests/integration/defs/examples/serve/test_spec_decoding_metrics.py (2)

96-96: LGTM!

The test configuration has been correctly updated to use speculative_model instead of speculative_model_dir.

177-177: LGTM!

The test configuration has been correctly updated to use speculative_model in the two-model mode test as well.

tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1)

26-26: LGTM! Parameter rename applied correctly.

The change from speculative_model_dir to speculative_model aligns with the API update across the codebase. The local variable eagle_model_dir appropriately retains its descriptive name.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

401-401: LGTM! Field access updated correctly.

The change correctly accesses the renamed speculative_model field from the spec_config when creating the draft model engine.

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

754-754: LGTM! Field access updated correctly.

The change correctly accesses the renamed speculative_model field when creating the draft model engine for the AutoDeploy backend.

examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py (1)

51-54: LGTM! Comment and code updated consistently.

Both the comment (line 51) and the parameter usage (line 54) have been updated from speculative_model_dir to speculative_model, ensuring documentation stays in sync with the code.

tests/integration/defs/accuracy/test_disaggregated_serving.py (2)

540-541: LGTM! Configuration dictionary updated correctly.

The dictionary key has been updated from "speculative_model_dir" to "speculative_model" in the speculative decoding configuration, ensuring consistency with the API change.

639-640: LGTM! Configuration dictionary updated correctly.

The dictionary key has been updated from "speculative_model_dir" to "speculative_model" in the speculative decoding configuration, consistent with the same change in the test_eagle3 method.

tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (1)

41-41: LGTM! Parameter rename applied correctly.

The change from speculative_model_dir to speculative_model is consistent with the API update. The local variable eagle_model_dir appropriately retains its descriptive name.

tests/unittest/_torch/speculative/test_draft_target.py (1)

48-48: LGTM! Parameter rename applied correctly.

The change from speculative_model_dir to speculative_model in the DraftTargetDecodingConfig is consistent with the API update. The local variable draft_model_dir appropriately retains its descriptive name.

tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)

403-403: LGTM! Parameter change correctly applied.

The parameter speculative_model is the current field in EagleDecodingConfig, and your change at line 403 correctly uses this parameter. The codebase maintains backwards compatibility through an alias for the deprecated speculative_model_dir parameter, so this is a compatible migration rather than a breaking change. Both the old and new parameter names remain functional across the codebase.

tests/unittest/_torch/speculative/test_kv_cache_reuse.py (1)

53-57: EagleDecodingConfig keyword rename is consistent with new API

Switching to speculative_model=eagle_model_dir matches the updated config field name while preserving behavior, since the value is still a local draft model path.

tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (1)

86-94: speculative_model rename in EagleDecodingConfig is correct here

Using speculative_model=eagle_model_dir aligns this test with the updated EagleDecodingConfig API; the rest of the setup (including Eagle3ResourceManager and SpecTreeManager) remains unaffected.

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

57-62: Dynamic spec decode test updated to new speculative_model keyword

The change to speculative_model=eagle_model_dir correctly tracks the config field rename; draft model selection behavior is unchanged.

tensorrt_llm/_torch/pyexecutor/model_loader.py (1)

278-283: Draft weight loading now correctly sources self.spec_config.speculative_model

Using self.spec_config.speculative_model in the draft load_weights call matches the renamed config field and the validator contract (speculative_model must be set whenever need_load_draft_weights() is true). This should keep existing local-path behavior intact while enabling the new HF-ID-based flow for PyTorch/AutoDeploy.

tests/integration/defs/test_e2e.py (1)

3345-3350: Eagle3 consistency test now uses speculative_model field

Updating the EagleDecodingConfig here to speculative_model=eagle_model_dir is consistent with the new public field name and keeps the test semantics (one-model Eagle3 draft) unchanged.

tests/integration/defs/accuracy/test_llm_api.py (1)

472-483: Accuracy tests updated to use speculative_model for Eagle configs

Both Eagle-based Vicuna accuracy tests now construct EagleDecodingConfig with speculative_model=..., matching the renamed field while preserving the same local draft model directory. This keeps the speculative-decoding behavior and accuracy coverage the same under the new API.

Also applies to: 498-505

tests/integration/defs/examples/test_ad_speculative_decoding.py (1)

64-68: DraftTargetDecodingConfig now correctly uses speculative_model.

The test wiring for DraftTargetDecodingConfig is consistent with the new speculative_model field while preserving the existing speculative_model_dir argument name in the helper.

tests/unittest/_torch/speculative/test_draft_len_schedule.py (1)

78-82: DraftTargetDecodingConfig updates align with renamed speculative_model field.

All three DraftTargetDecodingConfig callsites now pass the draft model via speculative_model, matching the updated API without changing the test behavior.

Also applies to: 124-128, 187-191

tests/unittest/_torch/speculative/test_spec_gate.py (1)

48-56: EagleDecodingConfig correctly switched to speculative_model.

The EAGLE3 spec config now uses speculative_model=eagle_model_dir, matching the updated config interface with no behavioral change.

examples/llm-api/llm_speculative_decoding.py (1)

35-47: Example now demonstrates HF-based speculative_model usage.

run_Eagle3 correctly uses the new speculative_model parameter with a HuggingFace model ID, consistent with the updated API and docs.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

286-289: Speculative config tests updated consistently to speculative_model.

The various EagleDecodingConfig and MTPDecodingConfig callsites now all use the speculative_model parameter, covering both local paths and draft checkpoints for multiple models (Llama, Qwen, DeepSeek, Mistral). This keeps test semantics intact while exercising the renamed field.

Also applies to: 383-385, 635-636, 1361-1363, 2941-2943, 3419-3421, 3800-3804, 3850-3854, 4949-4953, 4999-5003

examples/llm-api/quickstart_advanced.py (1)

203-228: Quickstart speculative branches correctly mapped to speculative_model.

setup_llm now routes args.model_dir / args.draft_model_dir into speculative_model for MTP, EAGLE3, and DraftTarget, matching the new configuration API while preserving the CLI surface.

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py (1)

16-40: New parametrization cleanly tests HF ID vs local speculative model.

The test now covers both a HuggingFace ID and a resolved local path via speculative_model, which is exactly what’s needed to validate the new auto-download behavior for AutoDeploy + PyTorch.

docs/source/features/speculative-decoding.md (1)

40-49: Speculative-decoding docs are aligned with the new speculative_model API.

The updated examples and YAML snippets correctly use speculative_model (with both HF IDs and local paths) and clearly document speculative_model_dir as an alias, matching the implementation and tests across the PR.

Also applies to: 59-76, 144-151, 153-169

tests/scripts/perf-sanity/run_benchmark_serve.py (1)

221-221: LGTM! Clean rename from speculative_model_dir to speculative_model.

The changes consistently update the parameter name throughout the config metrics, class initialization, YAML generation, and parsing logic. The renaming aligns with the PR objective to clarify that the field need not be a directory.

Also applies to: 262-262, 288-288, 348-350, 503-504

tests/unittest/llmapi/test_llm.py (1)

1220-1220: LGTM! Test updates align with the API rename.

The parameter name changes from speculative_model_dir to speculative_model in test configurations are consistent with the broader API changes across the codebase.

Also applies to: 1259-1259, 1297-1297, 1344-1344

tensorrt_llm/llmapi/llm_utils.py (2)

112-113: LGTM! Consistent rename throughout the file.

The changes from speculative_model_dir to speculative_model maintain consistency with the API rename across attribute access and conditional checks.

Also applies to: 443-444

650-664: Error handling is already in place. The _node_download_hf_model function is decorated with @print_traceback_on_error, which wraps the call to download_hf_model with comprehensive exception handling. Any exceptions from snapshot_download (network issues, invalid model IDs, etc.) will be caught, logged with a full traceback, and propagated appropriately to the caller.

tests/unittest/_torch/speculative/test_eagle3.py (2)

202-202: LGTM! Consistent API updates across all test cases.

All occurrences of speculative_model_dir have been correctly updated to speculative_model in the EagleDecodingConfig instantiations throughout the test file.

Also applies to: 277-277, 358-358, 482-482, 583-583, 642-642, 695-695, 763-763

95-95: Verify network-dependent test behavior and unused fixture parameter.

The new test variant at line 149 introduces HuggingFace model auto-download functionality, which requires network access during test execution. This could introduce:

Increased test duration (model download time)

Test flakiness due to network issues

CI/CD environment compatibility concerns

Additionally, the static analysis tool reports that the request parameter at line 157 is unused. If this pytest fixture isn't required for test isolation or cleanup, consider removing it.

Please confirm:

Is the network-dependent test acceptable for your CI environment, or should it be marked with a specific pytest marker (e.g., @pytest.mark.require_network)?

Is the request fixture parameter actually needed, or can it be safely removed?

Also applies to: 156-169

tensorrt_llm/llmapi/llm_args.py (4)

887-888: LGTM!

Validation logic correctly uses the renamed field speculative_model.

2091-2092: LGTM!

Property accessor correctly renamed from speculative_model_dir to speculative_model.

2432-2432: LGTM!

Assertion correctly uses the renamed field speculative_model.

2452-2453: LGTM!

The getattr calls correctly use the field name "speculative_model" rather than the alias. This is proper because the alias only affects input validation and serialization, while attribute access on instantiated objects uses the actual field name.

Also applies to: 2981-2982

govind-ramnarayan

I'm not sure if we are permitted to put downloads from HuggingFace in the CI. See: https://nvidia.slack.com/archives/C059LSY62BT/p1765940901371369?thread_ts=1765895541.514539&cid=C059LSY62BT

I believe the existing behavior in the the test_ad_speculative_decoding.py smoke test on CI machines would be to just use the local model from llm-models, but please verify this.

tensorrt_llm/llmapi/llm_utils.py

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py

Superjomn

LGTM on the llmapi changes.

govind-ramnarayan

Thanks for the edits! LGTM

anish-shanbhag · 2025-12-22T08:09:04Z

/bot run

tensorrt-cicd · 2025-12-22T08:22:52Z

PR_Github #29393 [ run ] triggered by Bot. Commit: 1827f84

tensorrt-cicd · 2025-12-22T10:44:15Z

PR_Github #29393 [ run ] completed with state FAILURE. Commit: 1827f84
/LLM/main/L0_MergeRequest_PR pipeline #22580 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

anish-shanbhag · 2025-12-22T19:42:25Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-22T19:48:32Z

PR_Github #29457 [ run ] triggered by Bot. Commit: acb11e1

tensorrt-cicd · 2025-12-22T22:43:06Z

PR_Github #29457 [ run ] completed with state SUCCESS. Commit: acb11e1
/LLM/main/L0_MergeRequest_PR pipeline #22639 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

anish-shanbhag · 2025-12-23T02:08:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T02:14:40Z

PR_Github #29485 [ run ] triggered by Bot. Commit: 7b42f69

tensorrt-cicd · 2025-12-23T06:27:53Z

PR_Github #29485 [ run ] completed with state SUCCESS. Commit: 7b42f69
/LLM/main/L0_MergeRequest_PR pipeline #22666 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

tensorrt-cicd · 2026-01-06T18:07:34Z

PR_Github #30773 [ run ] triggered by Bot. Commit: 76f01fc

tensorrt-cicd · 2026-01-06T22:30:55Z

PR_Github #30773 [ run ] completed with state SUCCESS. Commit: 76f01fc
/LLM/main/L0_MergeRequest_PR pipeline #23756 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

QiJune

LGTM

anish-shanbhag · 2026-01-07T17:38:51Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-07T17:44:25Z

PR_Github #30921 [ run ] triggered by Bot. Commit: 49f7b27

tensorrt-cicd · 2026-01-08T00:02:37Z

PR_Github #30921 [ run ] completed with state SUCCESS. Commit: 49f7b27
/LLM/main/L0_MergeRequest_PR pipeline #23883 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

anish-shanbhag · 2026-01-08T01:34:45Z

/bot run

tensorrt-cicd · 2026-01-08T01:47:02Z

PR_Github #30957 [ run ] triggered by Bot. Commit: 49f7b27

tensorrt-cicd · 2026-01-08T03:43:43Z

PR_Github #30957 [ run ] completed with state SUCCESS. Commit: 49f7b27
/LLM/main/L0_MergeRequest_PR pipeline #23918 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

…kend, add speculative_model field alias Signed-off-by: Anish Shanbhag <[email protected]>

Signed-off-by: Anish Shanbhag <[email protected]>

anish-shanbhag · 2026-01-08T07:43:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-08T07:53:38Z

PR_Github #31030 [ run ] triggered by Bot. Commit: 4511bda

tensorrt-cicd · 2026-01-08T16:55:23Z

PR_Github #31030 [ run ] completed with state SUCCESS. Commit: 4511bda
/LLM/main/L0_MergeRequest_PR pipeline #23976 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

anish-shanbhag · 2026-01-08T18:42:34Z

/bot run

tensorrt-cicd · 2026-01-08T18:48:59Z

PR_Github #31108 [ run ] triggered by Bot. Commit: 4511bda

tensorrt-cicd · 2026-01-08T19:47:10Z

PR_Github #31108 [ run ] completed with state SUCCESS. Commit: 4511bda
/LLM/main/L0_MergeRequest_PR pipeline #24025 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

anish-shanbhag · 2026-01-08T20:10:35Z

/bot run

tensorrt-cicd · 2026-01-08T20:16:15Z

PR_Github #31117 [ run ] triggered by Bot. Commit: 4511bda

tensorrt-cicd · 2026-01-08T22:29:35Z

PR_Github #31117 [ run ] completed with state SUCCESS. Commit: 4511bda
/LLM/main/L0_MergeRequest_PR pipeline #24034 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

anish-shanbhag force-pushed the ashanbhag/spec-dec-auto-download branch from 659d01e to b4d5011 Compare December 18, 2025 21:06

anish-shanbhag marked this pull request as ready for review December 18, 2025 21:08

anish-shanbhag requested review from a team as code owners December 18, 2025 21:08

anish-shanbhag requested review from Naveassaf, QiJune, brb-nv, govind-ramnarayan, hchings and kaiyux December 18, 2025 21:08

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

govind-ramnarayan reviewed Dec 18, 2025

View reviewed changes

tensorrt_llm/llmapi/llm_utils.py Outdated Show resolved Hide resolved

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_speculative_decoding.py Show resolved Hide resolved

Superjomn approved these changes Dec 19, 2025

View reviewed changes

govind-ramnarayan approved these changes Dec 19, 2025

View reviewed changes

anish-shanbhag force-pushed the ashanbhag/spec-dec-auto-download branch from 8a5f5f4 to 1827f84 Compare December 22, 2025 08:08

venkywonka approved these changes Dec 22, 2025

View reviewed changes

anish-shanbhag force-pushed the ashanbhag/spec-dec-auto-download branch from 831cb1a to 8ea8885 Compare December 30, 2025 21:10

QiJune approved these changes Jan 7, 2026

View reviewed changes

anish-shanbhag force-pushed the ashanbhag/spec-dec-auto-download branch from 76f01fc to 49f7b27 Compare January 7, 2026 17:37

anish-shanbhag added 10 commits January 7, 2026 23:40

[None][feat] Auto download speculative models from HF for pytorch bac…

7c276b4

…kend, add speculative_model field alias Signed-off-by: Anish Shanbhag <[email protected]>

Add logs

26c7e32

Signed-off-by: Anish Shanbhag <[email protected]>

Mock snapshot_download to avoid download from HF

9753a83

Signed-off-by: Anish Shanbhag <[email protected]>

Move download to shared helper

471b54f

Signed-off-by: Anish Shanbhag <[email protected]>

Add missing import

5c6ba64

Signed-off-by: Anish Shanbhag <[email protected]>

Fixes

8816f95

Signed-off-by: Anish Shanbhag <[email protected]>

Use AliasChoices

a82c680

Signed-off-by: Anish Shanbhag <[email protected]>

Use model validators for eagle config

f0ac459

Signed-off-by: Anish Shanbhag <[email protected]>

Remove redundant speculative model download for TRTLLM backend

ec5f662

Signed-off-by: Anish Shanbhag <[email protected]>

Fix import sorting

4511bda

Signed-off-by: Anish Shanbhag <[email protected]>

anish-shanbhag force-pushed the ashanbhag/spec-dec-auto-download branch from 49f7b27 to 4511bda Compare January 8, 2026 07:43

[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias #10099

Are you sure you want to change the base?

[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias #10099

Conversation

anish-shanbhag commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 18, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

govind-ramnarayan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

govind-ramnarayan left a comment

Choose a reason for hiding this comment

Uh oh!

anish-shanbhag commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

anish-shanbhag commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

anish-shanbhag commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

anish-shanbhag commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

anish-shanbhag commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

anish-shanbhag commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

anish-shanbhag commented Dec 18, 2025 •

edited by coderabbitai bot

Loading