Skip to content

[None][chore] AutoDeploy: cleanup old inference optimizer configs#8039

Merged
lucaslie merged 9 commits intoNVIDIA:mainfrom
nv-auto-deploy:haoguo/old-optimizer-cleanup
Oct 17, 2025
Merged

[None][chore] AutoDeploy: cleanup old inference optimizer configs#8039
lucaslie merged 9 commits intoNVIDIA:mainfrom
nv-auto-deploy:haoguo/old-optimizer-cleanup

Conversation

@h-guo18
Copy link
Collaborator

@h-guo18 h-guo18 commented Sep 28, 2025

Summary by CodeRabbit

  • New Features
    • Added optional graph visualization step (disabled by default).
    • Expanded configuration: attention/rope layout (bsnd), cache memory ratio, backend selection for attention, and compile options including CUDA graph batch sizes.
  • Performance
    • Frees GPU memory after optimization to reduce fragmentation.
  • Refactor
    • Simplified and consolidated configuration under transforms; removed deprecated optimizer entrypoint and legacy fields.
    • Streamlined internal imports and utility usage without changing behavior.
  • Chores
    • Updated dependency constraints for CUDA 12.9 compatibility.
  • Tests
    • Refactored helpers and parametrization to use the new transforms-based configuration.

Description

This PR clean up old Inference optimizer in AD. Main changes include:

  • Removed old inference optimizer codes;
  • Moved transform-related configs from ADConfig to ADConfig.transforms; Set their values in transformer.yaml and default.yaml;
  • Move visualization to new inference optimizer as a new transform; TODO: there's no unit tests to it yet.
  • Updated unit tests according to the change of ADConfig;

Test Coverage

TRTLLM-bench

  • Tested that free_mem_ratio, attn_backend, compile_backend pasing correctly from extra_llm_args.yaml to inference optimizer.
  • Identical benchmark performance before/after change.
  • Checked that old bench.yaml is also still compatible

NOTE: absolute numbers are not indicative at the moment due to repeat_kv bug

  • TODO: insert benchmark numbers

bench command

trtllm-bench --model meta-llama/Meta-Llama-3.1-8B  throughput --dataset /tmp/synthetic_128_128.txt --backend _autodeploy --extra_llm_api_options examples/auto_deploy/bench.yaml --tp 4 --max_batch_size=512 

NOTE: existing bench.yaml is compatible but the new one can also be used and is preferred moving forward

bench.yaml format after changes:

runtime: trtllm
skip_loading_weights: true
transforms:
  resize_kv_cache:
    free_mem_ratio: 0.8
  compile_model:
    cuda_graph_batch_sizes: [1,2,4,8,16,32,64,128,256,512]
    backend: torch-cudagraph

Results with new bench.yaml

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     89.2243
Total Output Throughput (tokens/sec):             11420.7070
Total Token Throughput (tokens/sec):              22841.4139
Total Latency (ms):                               11207.7125
Average request latency (ms):                     7764.3731
Per User Output Throughput [w/ ctx] (tps/user):   18.3602
Per GPU Output Throughput (tps/gpu):              2855.1767

bench_old.yaml:

compile_backend: torch-simple
runtime: trtllm
skip_loading_weights: true
free_mem_ratio: 0.7
cuda_graph_batch_sizes: [1,2,4,8,16,32,64,128,256,512]

Results with bench_old.yaml on this branch:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     88.7720
Total Output Throughput (tokens/sec):             11362.8199
Total Token Throughput (tokens/sec):              22725.6399
Total Latency (ms):                               11264.8093
Average request latency (ms):                     7814.5593
Per User Output Throughput [w/ ctx] (tps/user):   18.1956
Per GPU Output Throughput (tps/gpu):              2840.7050

Results with bench_old.yaml before these changes

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     88.8474
Total Output Throughput (tokens/sec):             11372.4676
Total Token Throughput (tokens/sec):              22744.9351
Total Latency (ms):                               11255.2530
Average request latency (ms):                     7822.8995
Per User Output Throughput [w/ ctx] (tps/user):   18.1545
Per GPU Output Throughput (tps/gpu):              2843.1169

TRTLLM-serve

  • Tested that attn_backend, compile_backend, free_mem_ratio passing correctly from extra_llm_args.yaml to infrence optimizer;
  • Tested that request to service returns reasonable response.

commands:

trtllm-serve meta-llama/Meta-Llama-3.1-8B --backend _autodeploy --extra_llm_api_options examples/auto_deploy/serve.yaml

runtime: trtllm
world_size: 2
transforms:
resize_kv_cache:
free_mem_ratio: 0.0
compile_model:
cuda_graph_batch_sizes: [1,2,4,8,16,32,64,128,256,512]
backend: torch-cudagraph


--> note: trtllm-serve is broken at the moment. Will post a fix later

## PR Checklist

Please review the following before submitting your PR:
- PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
- PR Follows [TRT-LLM CODING GUIDELINES](https://github.com/NVIDIA/TensorRT-LLM/blob/main/CODING_GUIDELINES.md) to the best of your knowledge.
- Test cases are provided for new code paths (see [test instructions](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tests#1-how-does-the-ci-work))
- Any new dependencies have been scanned for license and vulnerabilities
- [CODEOWNERS](https://github.com/NVIDIA/TensorRT-LLM/blob/main/.github/CODEOWNERS) updated if ownership changes
- Documentation updated as needed
- The reviewers assigned automatically/manually are appropriate for the PR.


- [x] Please check this after reviewing the above items as appropriate for this PR.

## GitHub Bot Help

`/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...`

Provide a user friendly way for developers to interact with a Jenkins server.

Run `/bot [-h|--help]` to print this help message.

See details below for each supported subcommand.

<details>

`run  [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]`

Launch build/test pipelines. All previously running jobs will be killed.

`--reuse-test (optional)pipeline-id ` *(OPTIONAL)* : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

`--disable-reuse-test ` *(OPTIONAL)* : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

`--disable-fail-fast ` *(OPTIONAL)* : Disable fail fast on build/tests/infra failures.

`--skip-test ` *(OPTIONAL)* : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does **NOT** update GitHub check status.

`--stage-list "A10-PyTorch-1, xxx"` *(OPTIONAL)* : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does **NOT** update GitHub check status.

`--gpu-type "A30, H100_PCIe"` *(OPTIONAL)* : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does **NOT** update GitHub check status.

`--test-backend "pytorch, cpp"` *(OPTIONAL)* : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does **NOT** update GitHub pipeline status.

`--only-multi-gpu-test ` *(OPTIONAL)* : Only run the multi-GPU tests. Note: Does **NOT** update GitHub check status.

`--disable-multi-gpu-test ` *(OPTIONAL)* : Disable the multi-GPU tests. Note: Does **NOT** update GitHub check status.

`--add-multi-gpu-test ` *(OPTIONAL)* : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

`--post-merge ` *(OPTIONAL)* : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

`--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"` *(OPTIONAL)* : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

`--detailed-log ` *(OPTIONAL)* : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

`--debug ` *(OPTIONAL)* : **Experimental feature**. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the `stage-list` parameter to access the appropriate container environment. Note: Does **NOT** update GitHub check status.

For guidance on mapping tests to stage names, see `docs/source/reference/ci-overview.md`
and the `scripts/test_to_stage_mapping.py` helper.

### kill

`kill  `

Kill all running builds associated with pull request.

### skip

`skip --comment COMMENT `

Skip testing for latest commit on pull request. `--comment "Reason for skipping build/test"` is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

### reuse-pipeline

`reuse-pipeline `

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

</details>

@h-guo18 h-guo18 self-assigned this Sep 28, 2025
@h-guo18 h-guo18 changed the title cleanup old AD inference optimizer configs [None][chore]cleanup old AD inference optimizer configs Sep 29, 2025
@h-guo18 h-guo18 changed the title [None][chore]cleanup old AD inference optimizer configs [None][chore]Cleanup old AD inference optimizer configs Sep 29, 2025
@h-guo18 h-guo18 force-pushed the haoguo/old-optimizer-cleanup branch from 3e2ab44 to 860b4e4 Compare September 29, 2025 05:38
@h-guo18 h-guo18 changed the title [None][chore]Cleanup old AD inference optimizer configs [None][chore] Cleanup old AD inference optimizer configs Sep 29, 2025
@h-guo18 h-guo18 requested a review from lucaslie September 29, 2025 07:16
Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging this, let's make sure to inform the team and go through the downstream changes like the dashboard and update it accordingly

@h-guo18 h-guo18 force-pushed the haoguo/old-optimizer-cleanup branch 2 times, most recently from 124d29b to 77a0b83 Compare October 12, 2025 23:26
@h-guo18 h-guo18 marked this pull request as ready for review October 12, 2025 23:27
@h-guo18 h-guo18 requested review from a team as code owners October 12, 2025 23:27
@h-guo18 h-guo18 marked this pull request as draft October 12, 2025 23:28
@h-guo18 h-guo18 requested a review from lucaslie October 12, 2025 23:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 12, 2025

📝 Walkthrough

Walkthrough

Dependency versions updated for CUDA 12.9. Auto-deploy pipeline refactored: configs moved into nested transforms, new VISUALIZE stage and visualization transform added, attention layout config switched to attn_layout, KV cache adds free_mem_ratio validation, load/build device handling revised, imports consolidated under utils._graph, legacy transformations entrypoints removed, tests updated accordingly.

Changes

Cohort / File(s) Summary
Dependencies (CUDA 12.9 alignment)
requirements.txt
Loosened/retargeted CUDA-related package ranges to CUDA 12.x; commented newer variants; no code changes.
Configs (pipeline stages and params)
tensorrt_llm/_torch/auto_deploy/config/default.yaml, .../config/transformers.yaml
Added VISUALIZE stage entry and visualize_namespace transform (disabled), new fields: attn_layout bsnd, expected_layout bsnd, checkpoint_device, attn_backend flashinfer, free_mem_ratio (0.8 default; 0.0 in transformers.yaml), compile backend fields (cuda_graph_batch_sizes, compile_backend).
Optimizer and interface
.../transform/interface.py, .../transform/optimizer.py, .../shim/ad_executor.py
Added VISUALIZE stage to enum; removed attn_backend from SharedConfig; InferenceOptimizer now imported/used from transform.optimizer; optimizer adds cuda and GC cache clears after run.
Attention layout refactor
.../transform/library/attention.py, tests/.../transformations/library/test_attention_matcher*.py
Replaced attention_op descriptor with attn_layout Literal["bsnd","bnsd"]; removed descriptor checks; tests updated to pass attn_layout strings.
KV cache config
.../transform/library/kvcache.py, tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py, tests/.../test_ad_trtllm_bench.py
free_mem_ratio field gains [0.0,1.0] validation; test utilities centralize transforms config; bench test nests compile settings under transforms.compile_model.
Load/build device handling
.../transform/library/load_weights.py, .../transform/library/build_model.py
Renamed device to checkpoint_device; device resolution moved to cm.device; removed some config plumbing in load/build transforms; updated move_to_device usage.
Visualization transform introduction
.../transform/library/visualization.py, .../config/default.yaml, .../transformations/library/__init__.py
New VisualizeNamespace transform registered and optional dependency on model_explorer; old visualize export removed from transformations package; config wires transform disabled by default.
Consolidated graph utils imports
.../export/export.py, .../utils/_graph.py, tests importing move_to_device (tests/.../models/test_*)
Moved imports from transformations._graph to utils._graph; tests updated to new path; no functional changes.
Legacy transformations removal
.../transformations/transform.py, .../transformations/__init__.py
Removed legacy InferenceOptimizer entrypoint and deprecation docstring; codepath now via transform.optimizer.
LLM args cleanup
.../llm_args.py, tensorrt_llm/bench/dataclasses/configuration.py
Removed top-level fields (attn_backend, free_mem_ratio, compile_backend, etc.) from AutoDeployConfig; logic now reads from transforms; bench config stops injecting attn_backend.
Test refactors (config construction and params)
tests/.../unit/multigpu/test_ad_build_small_multi.py, tests/.../unit/singlegpu/test_ad_build_small_single.py, tests/.../unit/singlegpu/models/test_hybrid_patches.py
Tests build transforms via new helper get_transforms_config; parameterization updated; ADEngine build monkeypatch adjusted; removed deprecated helpers/imports.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant ADEngine
  participant InferenceOptimizer
  participant Transforms as Transform Pipeline
  participant CUDA as CUDA/GC

  User->>ADEngine: build_from_config(config)
  ADEngine->>InferenceOptimizer: __call__(cm) with transforms config
  InferenceOptimizer->>Transforms: Execute stages (e.g., load_weights, build, match_layouts)
  alt Optional visualization
    Note over Transforms: VISUALIZE stage (visualize_namespace) if enabled
  end
  Transforms-->>InferenceOptimizer: Optimized GraphModule
  InferenceOptimizer->>CUDA: torch.cuda.empty_cache()
  InferenceOptimizer->>CUDA: gc.collect()
  InferenceOptimizer-->>ADEngine: Optimized module
  ADEngine-->>User: Engine ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • lucaslie

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[None][chore] AutoDeploy: cleanup old inference optimizer configs" is directly related to the main change in this pull request. The changeset involves removing old inference optimizer infrastructure—specifically the deletion of the InferenceOptimizer class from transformations/transform.py, removal of top-level configuration fields from AutoDeployConfig (such as attn_backend, free_mem_ratio, compile_backend, etc.), and reorganization of these configurations into the transforms dictionary structure. The title accurately captures this cleanup and refactoring effort, clearly identifying both the component (AutoDeploy) and the primary action (cleanup of old inference optimizer configs). The title is sufficiently specific and avoids vague terminology, making it understandable in a PR history scan.
Description Check ✅ Passed The pull request description is comprehensive and addresses the key requirements from the template. The Description section clearly explains the main changes: removing old inference optimizer code, relocating transform-related configurations from ADConfig to ADConfig.transforms with values in yaml files, integrating visualization as a new transform, and updating unit tests accordingly. The Test Coverage section provides detailed testing evidence from both TRTLLM-bench and TRTLLM-serve with actual performance numbers and configuration examples. The PR Checklist section is completed with one item marked as reviewed. While the PR title format could be more explicitly stated at the beginning of the description text, the substance of the PR is well-documented with concrete examples, commands, and results that support the described changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (1)

1-1: Add required copyright header.

Per coding guidelines, all Python source files must include the NVIDIA Apache-2.0 copyright header at the top of the file.

Add the copyright header at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Transformation to the graph to render nicely in model_explorer."""

Based on coding guidelines.

♻️ Duplicate comments (1)
tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)

153-153: Address past review comment: set free_mem_ratio to 0.0.

A past review comment suggested setting free_mem_ratio to 0.0 instead of 0.8 to align with the default in the resize_kv_cache transform. Please verify and update if needed.

Based on past review comments.

🧹 Nitpick comments (4)
requirements.txt (1)

32-35: Pin versions for nvidia-nccl-cu12 and nvidia-cuda-nvrtc-cu12.

Both nvidia-nccl-cu12 and nvidia-cuda-nvrtc-cu12 lack version constraints, which can lead to unpredictable dependency resolution and potential breakage when new versions are released.

Apply this diff to add version pins:

-nvidia-nccl-cu12  # <For CUDA 12.9>
+nvidia-nccl-cu12>=2.24.3,<3  # <For CUDA 12.9>
 # nvidia-nccl-cu13
-nvidia-cuda-nvrtc-cu12  # <For CUDA 12.9>
+nvidia-cuda-nvrtc-cu12>=12.9.0,<13  # <For CUDA 12.9>
 # nvidia-cuda-nvrtc

Verify the latest compatible versions:

#!/bin/bash
# Check latest versions of nvidia-nccl-cu12 and nvidia-cuda-nvrtc-cu12
echo "=== nvidia-nccl-cu12 ==="
pip index versions nvidia-nccl-cu12 2>/dev/null | head -10

echo "=== nvidia-cuda-nvrtc-cu12 ==="
pip index versions nvidia-cuda-nvrtc-cu12 2>/dev/null | head -10
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)

442-442: Remove commented code.

This commented line appears to be leftover from development and should be removed to keep the codebase clean.

Apply this diff:

-        # "compile_backend": "torch-simple",
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)

44-119: LGTM! Improved test parameterization with granular backend control.

The refactored parameterization using modle_hub_id and transform_args provides better flexibility for testing different backend combinations.

Note: "modle_hub_id" appears to be a typo. Consider renaming to "model_hub_id" for consistency.

tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (1)

90-96: Prefix unused parameters with underscore.

The parameters factory and shared_config are required by the BaseTransform._apply signature but are not used in this implementation. Per Python convention, prefix them with an underscore to indicate they are intentionally unused.

Apply this diff:

     def _apply(
         self,
         gm: GraphModule,
         cm: CachedSequenceInterface,
-        factory: ModelFactory,
-        shared_config: SharedConfig,
+        _factory: ModelFactory,
+        _shared_config: SharedConfig,
     ) -> Tuple[GraphModule, TransformInfo]:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fac47e2 and 77a0b83.

📒 Files selected for processing (28)
  • requirements.txt (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/config/transformers.yaml (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/export/export.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/__init__.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1 hunks)
  • tensorrt_llm/bench/dataclasses/configuration.py (0 hunks)
  • tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (6 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (3 hunks)
💤 Files with no reviewable changes (4)
  • tensorrt_llm/bench/dataclasses/configuration.py
  • tensorrt_llm/_torch/auto_deploy/transformations/transform.py
  • tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
  • tensorrt_llm/_torch/auto_deploy/transformations/init.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
  • tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py
  • tensorrt_llm/_torch/auto_deploy/export/export.py
  • tensorrt_llm/_torch/auto_deploy/utils/_graph.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
  • tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py
  • tensorrt_llm/_torch/auto_deploy/export/export.py
  • tensorrt_llm/_torch/auto_deploy/utils/_graph.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/attention.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
  • tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py
  • tensorrt_llm/_torch/auto_deploy/export/export.py
  • tensorrt_llm/_torch/auto_deploy/utils/_graph.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
  • tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py
  • tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
  • tensorrt_llm/_torch/auto_deploy/transform/interface.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
🧬 Code graph analysis (16)
tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • build_and_load_model (242-275)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
  • device (218-219)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
  • InferenceOptimizer (24-91)
tensorrt_llm/_torch/auto_deploy/llm.py (1)
  • factory (110-113)
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
  • Field (70-97)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (2)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
  • InferenceOptimizer (24-91)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (234-348)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
  • SharedConfig (61-66)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (2)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
  • move_to_device (135-142)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2)
  • get_small_model_config (524-563)
  • get_transforms_config (491-521)
tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py (3)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
  • move_to_device (135-142)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
  • device (218-219)
  • to (542-550)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
  • to (49-53)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
  • move_to_device (135-142)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
  • move_to_device (135-142)
tensorrt_llm/_torch/auto_deploy/export/export.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (4)
  • canonicalize_graph (174-187)
  • lift_to_meta (79-92)
  • load_buffers_and_params (32-68)
  • tree_to (71-75)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
  • is_op (179-202)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
  • move_to_device (135-142)
tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (3)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • ModelFactory (23-266)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
  • CachedSequenceInterface (11-88)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (6)
  • BaseTransform (185-480)
  • SharedConfig (61-66)
  • TransformInfo (121-146)
  • TransformRegistry (483-511)
  • register (489-496)
  • _apply (470-480)
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2)
  • get_small_model_config (524-563)
  • get_transforms_config (491-521)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (3)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2)
  • get_small_model_config (524-563)
  • get_transforms_config (491-521)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
  • ADEngine (71-297)
  • build_from_config (84-120)
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1)
  • test_build_ad (20-33)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
  • add_graph_input (246-309)
🪛 Ruff (0.13.3)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py

8-8: Unused blanket noqa directive

Remove unused noqa directive

(RUF100)

tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py

94-94: Unused method argument: factory

(ARG002)


95-95: Unused method argument: shared_config

(ARG002)

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py

128-129: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (49)
requirements.txt (2)

1-78: Clarify relationship between dependency updates and PR objectives.

The PR summary states this PR is about "Cleanup old AD inference optimizer configs," but this requirements.txt file only contains CUDA 12.9 dependency updates with no relation to AD optimizer cleanup. This inconsistency suggests either:

  1. The requirements.txt changes should be in a separate PR, or
  2. The PR summary is incomplete.

Please clarify whether these dependency updates are intentional for this PR or should be split into a separate infrastructure/dependency update PR.


1-1: Verify PyTorch wheel availability for CUDA 12.9
requirements.txt line 1: updating --extra-index-url to cu129 returned no wheels; confirm correct index URL for CUDA 12.9 or revert to cu128 to prevent mismatches.

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py (1)

79-85: LGTM! Configuration structure updated to nested transforms.

The change correctly nests the compile_backend under transforms.compile_model with the required stage and cuda_graph_batch_sizes fields, aligning with the new transforms-driven configuration model introduced in this PR.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (3)

4-4: LGTM! Import cleanup.

Removed unused Callable import, keeping the typing imports minimal and focused.


17-17: LGTM! Import path updated to utils module.

The import path for move_to_device has been correctly updated from the legacy location to the new utils._graph module, aligning with the broader refactor across the codebase.


49-49: LGTM! Configuration updated to use attn_layout.

The configuration correctly switches from the removed attention_op field to the new attn_layout field with the string literal "bsnd", matching the API changes in MatchAttentionLayoutConfig.

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (4)

3-3: LGTM! Import added for garbage collection.

The gc module import is correctly added to support the cleanup logic introduced later in the file.


6-6: LGTM! Import added for CUDA cache management.

The torch import is correctly added to support the torch.cuda.empty_cache() cleanup logic.


32-35: LGTM! Formatting improvement for SharedConfig construction.

The multi-line format with explicit trailing comma improves readability and follows Python best practices.


89-90: LGTM! Resource cleanup added after transforms.

The addition of torch.cuda.empty_cache() and gc.collect() helps free GPU memory and Python objects after all transforms complete, which is appropriate for long-running inference optimization pipelines.

tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (3)

5-5: LGTM! Literal type added for layout constraint.

The Literal import is correctly added to support the new attn_layout field with constrained string values.


698-702: LGTM! Configuration simplified to use string literal for layout.

The change from attention_op: Type[AttentionDescriptor] to attn_layout: Literal["bsnd", "bnsd"] simplifies the API by removing the dependency on AttentionDescriptor classes and using explicit string literals instead. This makes the configuration more straightforward and type-safe.


726-726: LGTM! Condition updated to use new attn_layout field.

The condition correctly references self.config.attn_layout to check if the backend expects "bnsd" layout, matching the new configuration structure.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)

1305-1305: LGTM! Test configuration updated to use attn_layout.

The test correctly updates the configuration from "attention_op": MockAttentionDescriptor to "attn_layout": layout, using the parameterized layout variable to test both "bsnd" and "bnsd" values, matching the new API.

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1)

8-8: LGTM! Import path updated to utils module.

The import path for move_to_device has been correctly updated from the legacy transformations._graph module to the new utils._graph module, aligning with the broader import path refactor across the codebase.

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py (1)

7-7: LGTM! Import path updated to utils module.

The import path for move_to_device has been correctly updated from the legacy transformations._graph module to the new utils._graph module, consistent with the import path refactor across the codebase.

tensorrt_llm/_torch/auto_deploy/config/transformers.yaml (1)

26-26: Confirm intentional free_mem_ratio=0.0 override in transformers mode
In default.yaml free_mem_ratio defaults to 0.8, and in kvcache.py values are validated at 0.0–1.0, yet transformers.yaml sets it to 0.0. Reserving no buffer can risk OOM during inference. Confirm this override is intentional for transformers mode and whether its memory management differs to mitigate potential OOM issues.

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

13-13: LGTM: Import path consolidation.

The graph utilities are now imported from ..utils._graph, consolidating graph-related code into a central location. This improves code organization.

tensorrt_llm/_torch/auto_deploy/transform/interface.py (2)

19-25: LGTM: Graph utilities consolidated and expanded.

The import path has been updated to ..utils._graph and now includes placeholders_on_meta and run_shape_prop, which are used by the transform infrastructure for graph cleanup and shape propagation.


50-50: LGTM: New visualization stage added.

The new VISUALIZE stage enables graph visualization as a transform step, aligning with the PR's objective to integrate visualization functionality into the new inference optimizer.

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)

18-19: LGTM: Imports adjusted for relative paths.

The imports have been updated to use relative paths within the utils package, which is appropriate since this module is now the canonical location for graph utilities.

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

28-28: LGTM: Import path updated.

The import path has been updated to reflect the new module structure where InferenceOptimizer is now located in transform.optimizer.


117-117: LGTM: Configuration refactored to nested structure.

The instantiation now correctly passes ad_config.transforms instead of the entire ad_config, reflecting the move to a nested transforms configuration structure.

tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py (1)

80-80: LGTM: Device sourced from context.

The device is now sourced from cm.device instead of self.config.device, which is appropriate for this transform that loads weights. This differs from the parent BuildModel class which still uses self.config.device for building without loading weights.

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (1)

36-38: LGTM: Test updated to use nested transforms config.

The test now uses get_transforms_config helper to build the nested transforms configuration, which is consistent with the refactored configuration structure.

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (2)

14-14: LGTM: Import path updated.

The import has been updated to use ...utils._graph, consistent with the consolidation of graph utilities.


224-224: LGTM: Validation constraints added.

The free_mem_ratio field now includes proper validation constraints (ge=0.0, le=1.0) to ensure the value remains within the valid range for a memory ratio.

tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (6)

7-7: LGTM: Import path updated.

The import has been updated to reflect the new module structure where InferenceOptimizer is located in transform.optimizer.


19-47: LGTM: Test updated for nested transforms configuration.

The test has been correctly updated to use the new nested transforms configuration structure. Configuration fields like free_mem_ratio, attn_backend, and simple_shard_only are now properly nested under their respective transform keys.


53-65: LGTM: Validation logic moved to InferenceOptimizer.

The test now correctly validates free_mem_ratio through InferenceOptimizer instantiation, which is where the validation logic now resides. This ensures the constraints (ge=0.0, le=1.0) are properly enforced.


90-103: LGTM: Test fixture updated for nested configuration.

The test fixture now uses the nested transforms structure, consistent with the refactored configuration approach.


159-167: LGTM: Assertions updated to read from nested structure.

The assertions have been correctly updated to read values from the nested transforms configuration using the appropriate keys.


233-239: LGTM: Test updated to use nested transforms.

The test now correctly passes the transforms dictionary with nested insert_cached_attention configuration, aligning with the new configuration structure.

tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)

491-522: LGTM! Well-structured transform config helper.

The new helper function cleanly separates transform configuration logic for different modes, making test setup more modular and maintainable.

tensorrt_llm/_torch/auto_deploy/config/default.yaml (5)

42-42: LGTM! Attention layout config additions look good.

The explicit attn_layout and expected_layout configuration keys properly specify the BSND layout for attention and RoPE operations, which aligns with the transform refactoring.

Also applies to: 47-47


92-92: LGTM! Checkpoint device configuration is well-designed.

The optional checkpoint_device field with a null default provides flexibility for checkpoint loading while maintaining backward compatibility.


127-131: Verify the TODO comment about visualization errors.

The visualization transform is currently disabled with a TODO noting errors. Ensure that the visualization functionality is properly tested and that leaving it disabled is the intended state for this release.

Is the visualization functionality expected to have errors in this PR, or should this be addressed before merging?


139-139: LGTM! FlashInfer as default attention backend is appropriate.

Setting flashinfer as the default attention backend is a good choice for performance.


159-160: LGTM! Compile model configuration additions are appropriate.

The cuda_graph_batch_sizes and compile_backend fields provide necessary control over compilation behavior.

tensorrt_llm/_torch/auto_deploy/llm_args.py (3)

3-3: LGTM! Import cleanup is appropriate.

The removal of List from typing imports aligns with the removal of list-typed fields from the class.


137-149: LGTM! Config surface refactored to use transforms.

The migration from flat inference optimizer fields to the nested transforms configuration simplifies the config surface and provides better modularity. This is a breaking change but aligns with the PR's objectives to consolidate transform configuration.


169-179: LGTM! Validation logic correctly accesses nested transforms.

The updated update_attn_page_size validation properly navigates the nested transforms configuration for both graph and transformers modes, using safe dictionary access patterns.

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (2)

4-4: LGTM! Import updates align with test refactoring.

The updated imports properly reference the new test utilities and the ADEngine entrypoint from the shim module.

Also applies to: 8-8


120-152: LGTM! Test function properly refactored to use new config helpers.

The test setup correctly builds the transforms configuration and experiment config using the new helper functions, and the monkeypatch logic properly validates the config before delegating to the original build function.

tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py (4)

10-10: LGTM! Import consolidation improves organization.

Moving move_to_device to utils._graph consolidates graph-related utilities in a more appropriate location.


23-26: LGTM! Renamed field clarifies checkpoint loading intent.

The rename from device to checkpoint_device with an optional type better expresses that this is specifically for checkpoint initialization, with a fallback to the runtime device.


48-50: LGTM! Two-stage device handling is well-designed.

The logic correctly uses checkpoint_device (if specified) for initial loading and then ensures the model ends up on cm.device, providing flexibility for checkpoint loading while guaranteeing correct final placement.


68-70: Verify the necessity of this transform.

The TODO comment correctly questions whether this transform is needed, as cm.to(cm.device) appears redundant since cm should already be on its own device. Consider verifying if this transform can be removed or if there's a specific edge case it addresses.

Can you confirm whether this transform is necessary, or if it can be safely removed?

tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1)

23-26: LGTM!

The refactoring to use helper functions get_transforms_config and get_small_model_config improves test maintainability by centralizing configuration logic.

Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double-check if our integrations with trtllm-bench and trtllm-serve still work as expected.

There is two potential issues I see:

  • handling of free_mem_ratio
  • handling of attention backend

If everything looks good let me know and I can set up a dashboard run and look at the remaining issue regarding the device

Copy link
Collaborator

@FrankD412 FrankD412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the one-line change for bench.

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@h-guo18 h-guo18 force-pushed the haoguo/old-optimizer-cleanup branch from da40367 to dd9e29e Compare October 14, 2025 00:47
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie lucaslie self-assigned this Oct 14, 2025
@lucaslie lucaslie moved this from Backlog to In progress in AutoDeploy Board Oct 16, 2025
@lucaslie lucaslie moved this from In progress to In review in AutoDeploy Board Oct 16, 2025
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie lucaslie marked this pull request as ready for review October 16, 2025 22:43
@lucaslie lucaslie requested a review from a team as a code owner October 16, 2025 22:44
Copy link
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM on doc part.

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie
Copy link
Member

/bot run

@lucaslie lucaslie enabled auto-merge (squash) October 17, 2025 15:11
@tensorrt-cicd
Copy link
Collaborator

PR_Github #21713 [ run ] triggered by Bot. Commit: 2e1db16

@lucaslie lucaslie added the AutoDeploy <NV> AutoDeploy Backend label Oct 17, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #21713 [ run ] completed with state SUCCESS. Commit: 2e1db16
/LLM/main/L0_MergeRequest_PR pipeline #16361 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@lucaslie lucaslie merged commit 55fed18 into NVIDIA:main Oct 17, 2025
7 of 9 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Oct 17, 2025
@lucaslie lucaslie deleted the haoguo/old-optimizer-cleanup branch October 17, 2025 20:45
govind-ramnarayan pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Oct 21, 2025
…IDIA#8039)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Oct 24, 2025
…IDIA#8039)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
…IDIA#8039)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…IDIA#8039)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…IDIA#8039)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…IDIA#8039)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AutoDeploy <NV> AutoDeploy Backend

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants