[#8241][feat] Support model_kwargs for pytorch backend #10351

taylor-yb-lee · 2025-12-30T23:07:27Z

Description

Currently, pytorch backend does not support model_kwargs, so this PR adds model_kwargs in BaseLlmArgs and updates the new key values in the model_kwargs onto the LlmArgs recursively.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-12-30T23:10:17Z

📝 Walkthrough

Walkthrough

These changes introduce a new model_kwargs parameter for overriding model configuration settings throughout the TensorRT LLM framework. The parameter is added to the public BaseLlmArgs API, propagated through the model loader to the checkpoint loader, and applied recursively to update pretrained configurations with special handling for dtype conversions.

Changes

Cohort / File(s)	Summary
Public API Extension `tensorrt_llm/llmapi/llm_args.py`	Added `model_kwargs: Dict[str, Any]` field to `BaseLlmArgs` with default empty dict and beta-status annotation for passing custom model configuration overrides.
Configuration Override Logic `tensorrt_llm/_torch/model_config.py`	Implements recursive config override mechanism in `from_pretrained` to merge `model_kwargs` into `pretrained_config`, with special handling for torch\_dtype/dtype string-to-object conversion, logging of updates, and warnings for unknown keys. Added `Any` to typing imports.
Model Loader & Checkpoint Integration `tensorrt_llm/_torch/pyexecutor/model_loader.py`	Passes `model_kwargs` to `BaseCheckpointLoader.load_config()` call; introduces deprecation warning for `TLLM_OVERRIDE_LAYER_NUM` environment variable with guidance toward `model_kwargs[num_hidden_layers]`; retains backward compatibility. Updated public method signature for `BaseCheckpointLoader.load_config(...)` to accept `model_kwargs` parameter.

Sequence Diagram

sequenceDiagram
    actor User
    participant BaseLlmArgs as BaseLlmArgs<br/>(llm_args.py)
    participant ModelLoader as ModelLoader<br/>(model_loader.py)
    participant BaseCheckpointLoader as BaseCheckpointLoader<br/>(checkpoint_loader.py)
    participant ModelConfig as from_pretrained<br/>(model_config.py)

    User->>BaseLlmArgs: Create with model_kwargs
    BaseLlmArgs->>ModelLoader: Pass model_kwargs during initialization
    ModelLoader->>BaseCheckpointLoader: Call load_config(model_kwargs)
    Note over BaseCheckpointLoader: Check for deprecated<br/>TLLM_OVERRIDE_LAYER_NUM env var
    BaseCheckpointLoader->>ModelConfig: load_pretrained_config() + model_kwargs
    rect rgb(200, 220, 240)
        Note over ModelConfig: Recursive merge phase
        ModelConfig->>ModelConfig: Merge model_kwargs into<br/>pretrained_config
        ModelConfig->>ModelConfig: Convert dtype strings<br/>to torch.dtype objects
    end
    ModelConfig-->>User: Return updated config

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The PR description is minimal and lacks required sections. While it briefly explains what was added (model_kwargs support), it provides insufficient detail on why this change is needed and has incomplete sections.	Complete the description by: (1) providing the PR title with ticket number and type prefix (e.g., '[TRTLLM-8241][feat]'), (2) expanding the 'Description' section to explain the motivation and design approach, and (3) populating the 'Test Coverage' section with specific test cases or testing strategy that validates the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding model_kwargs support for the PyTorch backend, which is the primary objective reflected across all three modified files.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/model_config.py (2)

459-459: Remove redundant import.

logger is already imported at the module level (lines 14 and 22). This import inside the function is unnecessary.

🔎 Proposed fix

         if model_kwargs:
-            from tensorrt_llm.logger import logger
-
             def _recursive_update_config(config: transformers.PretrainedConfig,

485-496: Improve error handling for invalid dtype strings.

getattr(torch, value_new) will raise an AttributeError if value_new is not a valid torch attribute (e.g., "invalid_dtype"). The subsequent assertion also provides a cryptic message. Consider catching the exception and providing a clearer error message.

🔎 Proposed fix

                     elif (key in ["torch_dtype", "dtype"]
                           and isinstance(value_new, str)
                           and value_new != "auto"):
-                        # check special handling of torch_dtype (DEPRECATED!) and dtype keys to ensure we
-                        # use the correct torch.dtype object instead of a string.
-                        dtype = getattr(torch, value_new)
-                        assert isinstance(dtype,
-                                          torch.dtype), f"Invalid {dtype=}"
+                        # Special handling for torch_dtype/dtype keys to convert string to torch.dtype
+                        dtype = getattr(torch, value_new, None)
+                        if not isinstance(dtype, torch.dtype):
+                            raise ValueError(
+                                f"model_kwargs['{key}']={value_new!r} is not a valid torch dtype. "
+                                f"Expected values like 'float16', 'bfloat16', 'float32', etc."
+                            )
                         setattr(config, key, dtype)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74832a1 and 12ad8c2.

📒 Files selected for processing (3)

tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tensorrt_llm/llmapi/llm_args.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_loader.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_loader.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_loader.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tensorrt_llm/llmapi/llm_args.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_loader.py

📚 Learning: 2025-12-12T03:27:08.565Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_loader.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/model_config.py (2)

tensorrt_llm/models/modeling_utils.py (1)

PretrainedConfig (369-570)

tensorrt_llm/logger.py (2)

warning (132-133)

info (138-139)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_torch/pyexecutor/model_loader.py (2)

364-365: LGTM!

The model_kwargs parameter is correctly propagated to load_config, enabling model-specific configuration overrides through the new API.

377-384: LGTM!

The deprecation warning provides clear guidance, showing users the exact model_kwargs syntax to use as a replacement for the environment variable.

tensorrt_llm/llmapi/llm_args.py (1)

1870-1878: LGTM!

The model_kwargs field is well-defined:

Uses default_factory=dict correctly for mutable defaults

Comprehensive description explaining the precedence order (defaults → config file → model_kwargs)

Appropriately marked as beta status for a new feature

Placed in BaseLlmArgs for shared access across backends

taylor-yb-lee · 2025-12-30T23:12:31Z

@coderabbitai generate docstrings

coderabbitai · 2025-12-30T23:12:36Z

✅ Actions performed

Initiated docstring generation; will generate only if new commits exist.

coderabbitai · 2025-12-30T23:12:39Z

Caution

Docstrings generation - FAILED

An unexpected error occurred while opening a pull request: Reference update failed - https://docs.github.com/rest/git/refs#create-a-reference

tensorrt_llm/llmapi/llm_args.py

taylor-yb-lee · 2026-01-03T01:24:13Z

/bot run

tensorrt-cicd · 2026-01-03T01:29:59Z

PR_Github #30421 [ run ] triggered by Bot. Commit: 6ea5a1f

tensorrt-cicd · 2026-01-03T02:48:51Z

PR_Github #30421 [ run ] completed with state SUCCESS. Commit: 6ea5a1f
/LLM/main/L0_MergeRequest_PR pipeline #23449 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

taylor-yb-lee · 2026-01-04T06:27:56Z

/bot run

github-actions · 2026-01-05T01:01:37Z