[None][chore] AutoDeploy: cleanup old inference optimizer configs#8039
[None][chore] AutoDeploy: cleanup old inference optimizer configs#8039lucaslie merged 9 commits intoNVIDIA:mainfrom
Conversation
3e2ab44 to
860b4e4
Compare
lucaslie
left a comment
There was a problem hiding this comment.
Before merging this, let's make sure to inform the team and go through the downstream changes like the dashboard and update it accordingly
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
Outdated
Show resolved
Hide resolved
124d29b to
77a0b83
Compare
📝 WalkthroughWalkthroughDependency versions updated for CUDA 12.9. Auto-deploy pipeline refactored: configs moved into nested transforms, new VISUALIZE stage and visualization transform added, attention layout config switched to attn_layout, KV cache adds free_mem_ratio validation, load/build device handling revised, imports consolidated under utils._graph, legacy transformations entrypoints removed, tests updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant ADEngine
participant InferenceOptimizer
participant Transforms as Transform Pipeline
participant CUDA as CUDA/GC
User->>ADEngine: build_from_config(config)
ADEngine->>InferenceOptimizer: __call__(cm) with transforms config
InferenceOptimizer->>Transforms: Execute stages (e.g., load_weights, build, match_layouts)
alt Optional visualization
Note over Transforms: VISUALIZE stage (visualize_namespace) if enabled
end
Transforms-->>InferenceOptimizer: Optimized GraphModule
InferenceOptimizer->>CUDA: torch.cuda.empty_cache()
InferenceOptimizer->>CUDA: gc.collect()
InferenceOptimizer-->>ADEngine: Optimized module
ADEngine-->>User: Engine ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (1)
1-1: Add required copyright header.Per coding guidelines, all Python source files must include the NVIDIA Apache-2.0 copyright header at the top of the file.
Add the copyright header at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Transformation to the graph to render nicely in model_explorer."""Based on coding guidelines.
♻️ Duplicate comments (1)
tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)
153-153: Address past review comment: set free_mem_ratio to 0.0.A past review comment suggested setting
free_mem_ratioto0.0instead of0.8to align with the default in the resize_kv_cache transform. Please verify and update if needed.Based on past review comments.
🧹 Nitpick comments (4)
requirements.txt (1)
32-35: Pin versions for nvidia-nccl-cu12 and nvidia-cuda-nvrtc-cu12.Both
nvidia-nccl-cu12andnvidia-cuda-nvrtc-cu12lack version constraints, which can lead to unpredictable dependency resolution and potential breakage when new versions are released.Apply this diff to add version pins:
-nvidia-nccl-cu12 # <For CUDA 12.9> +nvidia-nccl-cu12>=2.24.3,<3 # <For CUDA 12.9> # nvidia-nccl-cu13 -nvidia-cuda-nvrtc-cu12 # <For CUDA 12.9> +nvidia-cuda-nvrtc-cu12>=12.9.0,<13 # <For CUDA 12.9> # nvidia-cuda-nvrtcVerify the latest compatible versions:
#!/bin/bash # Check latest versions of nvidia-nccl-cu12 and nvidia-cuda-nvrtc-cu12 echo "=== nvidia-nccl-cu12 ===" pip index versions nvidia-nccl-cu12 2>/dev/null | head -10 echo "=== nvidia-cuda-nvrtc-cu12 ===" pip index versions nvidia-cuda-nvrtc-cu12 2>/dev/null | head -10tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)
442-442: Remove commented code.This commented line appears to be leftover from development and should be removed to keep the codebase clean.
Apply this diff:
- # "compile_backend": "torch-simple",tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)
44-119: LGTM! Improved test parameterization with granular backend control.The refactored parameterization using
modle_hub_idandtransform_argsprovides better flexibility for testing different backend combinations.Note: "modle_hub_id" appears to be a typo. Consider renaming to "model_hub_id" for consistency.
tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (1)
90-96: Prefix unused parameters with underscore.The parameters
factoryandshared_configare required by theBaseTransform._applysignature but are not used in this implementation. Per Python convention, prefix them with an underscore to indicate they are intentionally unused.Apply this diff:
def _apply( self, gm: GraphModule, cm: CachedSequenceInterface, - factory: ModelFactory, - shared_config: SharedConfig, + _factory: ModelFactory, + _shared_config: SharedConfig, ) -> Tuple[GraphModule, TransformInfo]:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
requirements.txt(2 hunks)tensorrt_llm/_torch/auto_deploy/config/default.yaml(4 hunks)tensorrt_llm/_torch/auto_deploy/config/transformers.yaml(1 hunks)tensorrt_llm/_torch/auto_deploy/export/export.py(1 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py(3 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/interface.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/attention.py(3 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py(4 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/optimizer.py(3 hunks)tensorrt_llm/_torch/auto_deploy/transformations/__init__.py(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py(0 hunks)tensorrt_llm/_torch/auto_deploy/transformations/transform.py(0 hunks)tensorrt_llm/_torch/auto_deploy/utils/_graph.py(1 hunks)tensorrt_llm/bench/dataclasses/configuration.py(0 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py(2 hunks)tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py(6 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py(3 hunks)
💤 Files with no reviewable changes (4)
- tensorrt_llm/bench/dataclasses/configuration.py
- tensorrt_llm/_torch/auto_deploy/transformations/transform.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/init.py
- tensorrt_llm/_torch/auto_deploy/transformations/init.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.pytensorrt_llm/_torch/auto_deploy/transform/library/build_model.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/auto_deploy/transform/library/attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.pytensorrt_llm/_torch/auto_deploy/transform/optimizer.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.pytensorrt_llm/_torch/auto_deploy/transform/library/load_weights.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.pytensorrt_llm/_torch/auto_deploy/export/export.pytensorrt_llm/_torch/auto_deploy/utils/_graph.pytensorrt_llm/_torch/auto_deploy/llm_args.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.pytests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.pytensorrt_llm/_torch/auto_deploy/transform/library/visualization.pytests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.pytensorrt_llm/_torch/auto_deploy/transform/interface.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.pytensorrt_llm/_torch/auto_deploy/transform/library/build_model.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/auto_deploy/transform/library/attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.pytensorrt_llm/_torch/auto_deploy/transform/optimizer.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.pytensorrt_llm/_torch/auto_deploy/transform/library/load_weights.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.pytensorrt_llm/_torch/auto_deploy/export/export.pytensorrt_llm/_torch/auto_deploy/utils/_graph.pytensorrt_llm/_torch/auto_deploy/llm_args.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.pytests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.pytensorrt_llm/_torch/auto_deploy/transform/library/visualization.pytests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.pytensorrt_llm/_torch/auto_deploy/transform/interface.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.pytensorrt_llm/_torch/auto_deploy/transform/library/build_model.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/auto_deploy/transform/library/attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.pytensorrt_llm/_torch/auto_deploy/transform/optimizer.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.pytensorrt_llm/_torch/auto_deploy/transform/library/load_weights.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.pytensorrt_llm/_torch/auto_deploy/export/export.pytensorrt_llm/_torch/auto_deploy/utils/_graph.pytensorrt_llm/_torch/auto_deploy/llm_args.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.pytests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.pytensorrt_llm/_torch/auto_deploy/transform/library/visualization.pytests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.pytensorrt_llm/_torch/auto_deploy/transform/interface.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
🧬 Code graph analysis (16)
tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
build_and_load_model(242-275)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
device(218-219)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
InferenceOptimizer(24-91)tensorrt_llm/_torch/auto_deploy/llm.py (1)
factory(110-113)
tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field(70-97)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (2)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
InferenceOptimizer(24-91)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
LlmArgs(234-348)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
SharedConfig(61-66)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (2)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
move_to_device(135-142)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2)
get_small_model_config(524-563)get_transforms_config(491-521)
tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py (3)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
move_to_device(135-142)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
device(218-219)to(542-550)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
to(49-53)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
move_to_device(135-142)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
move_to_device(135-142)
tensorrt_llm/_torch/auto_deploy/export/export.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (4)
canonicalize_graph(174-187)lift_to_meta(79-92)load_buffers_and_params(32-68)tree_to(71-75)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
is_op(179-202)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
move_to_device(135-142)
tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py (3)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
ModelFactory(23-266)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
CachedSequenceInterface(11-88)tensorrt_llm/_torch/auto_deploy/transform/interface.py (6)
BaseTransform(185-480)SharedConfig(61-66)TransformInfo(121-146)TransformRegistry(483-511)register(489-496)_apply(470-480)
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2)
get_small_model_config(524-563)get_transforms_config(491-521)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (3)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (2)
get_small_model_config(524-563)get_transforms_config(491-521)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
ADEngine(71-297)build_from_config(84-120)tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1)
test_build_ad(20-33)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
add_graph_input(246-309)
🪛 Ruff (0.13.3)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py
8-8: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
tensorrt_llm/_torch/auto_deploy/transform/library/visualization.py
94-94: Unused method argument: factory
(ARG002)
95-95: Unused method argument: shared_config
(ARG002)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
128-129: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (49)
requirements.txt (2)
1-78: Clarify relationship between dependency updates and PR objectives.The PR summary states this PR is about "Cleanup old AD inference optimizer configs," but this requirements.txt file only contains CUDA 12.9 dependency updates with no relation to AD optimizer cleanup. This inconsistency suggests either:
- The requirements.txt changes should be in a separate PR, or
- The PR summary is incomplete.
Please clarify whether these dependency updates are intentional for this PR or should be split into a separate infrastructure/dependency update PR.
1-1: Verify PyTorch wheel availability for CUDA 12.9
requirements.txt line 1: updating--extra-index-urlto cu129 returned no wheels; confirm correct index URL for CUDA 12.9 or revert to cu128 to prevent mismatches.tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py (1)
79-85: LGTM! Configuration structure updated to nested transforms.The change correctly nests the
compile_backendundertransforms.compile_modelwith the requiredstageandcuda_graph_batch_sizesfields, aligning with the new transforms-driven configuration model introduced in this PR.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (3)
4-4: LGTM! Import cleanup.Removed unused
Callableimport, keeping the typing imports minimal and focused.
17-17: LGTM! Import path updated to utils module.The import path for
move_to_devicehas been correctly updated from the legacy location to the newutils._graphmodule, aligning with the broader refactor across the codebase.
49-49: LGTM! Configuration updated to use attn_layout.The configuration correctly switches from the removed
attention_opfield to the newattn_layoutfield with the string literal"bsnd", matching the API changes inMatchAttentionLayoutConfig.tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (4)
3-3: LGTM! Import added for garbage collection.The
gcmodule import is correctly added to support the cleanup logic introduced later in the file.
6-6: LGTM! Import added for CUDA cache management.The
torchimport is correctly added to support thetorch.cuda.empty_cache()cleanup logic.
32-35: LGTM! Formatting improvement for SharedConfig construction.The multi-line format with explicit trailing comma improves readability and follows Python best practices.
89-90: LGTM! Resource cleanup added after transforms.The addition of
torch.cuda.empty_cache()andgc.collect()helps free GPU memory and Python objects after all transforms complete, which is appropriate for long-running inference optimization pipelines.tensorrt_llm/_torch/auto_deploy/transform/library/attention.py (3)
5-5: LGTM! Literal type added for layout constraint.The
Literalimport is correctly added to support the newattn_layoutfield with constrained string values.
698-702: LGTM! Configuration simplified to use string literal for layout.The change from
attention_op: Type[AttentionDescriptor]toattn_layout: Literal["bsnd", "bnsd"]simplifies the API by removing the dependency onAttentionDescriptorclasses and using explicit string literals instead. This makes the configuration more straightforward and type-safe.
726-726: LGTM! Condition updated to use new attn_layout field.The condition correctly references
self.config.attn_layoutto check if the backend expects"bnsd"layout, matching the new configuration structure.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)
1305-1305: LGTM! Test configuration updated to use attn_layout.The test correctly updates the configuration from
"attention_op": MockAttentionDescriptorto"attn_layout": layout, using the parameterizedlayoutvariable to test both"bsnd"and"bnsd"values, matching the new API.tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1)
8-8: LGTM! Import path updated to utils module.The import path for
move_to_devicehas been correctly updated from the legacytransformations._graphmodule to the newutils._graphmodule, aligning with the broader import path refactor across the codebase.tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py (1)
7-7: LGTM! Import path updated to utils module.The import path for
move_to_devicehas been correctly updated from the legacytransformations._graphmodule to the newutils._graphmodule, consistent with the import path refactor across the codebase.tensorrt_llm/_torch/auto_deploy/config/transformers.yaml (1)
26-26: Confirm intentional free_mem_ratio=0.0 override in transformers mode
Indefault.yamlfree_mem_ratiodefaults to 0.8, and inkvcache.pyvalues are validated at 0.0–1.0, yettransformers.yamlsets it to 0.0. Reserving no buffer can risk OOM during inference. Confirm this override is intentional for transformers mode and whether its memory management differs to mitigate potential OOM issues.tensorrt_llm/_torch/auto_deploy/export/export.py (1)
13-13: LGTM: Import path consolidation.The graph utilities are now imported from
..utils._graph, consolidating graph-related code into a central location. This improves code organization.tensorrt_llm/_torch/auto_deploy/transform/interface.py (2)
19-25: LGTM: Graph utilities consolidated and expanded.The import path has been updated to
..utils._graphand now includesplaceholders_on_metaandrun_shape_prop, which are used by the transform infrastructure for graph cleanup and shape propagation.
50-50: LGTM: New visualization stage added.The new
VISUALIZEstage enables graph visualization as a transform step, aligning with the PR's objective to integrate visualization functionality into the new inference optimizer.tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
18-19: LGTM: Imports adjusted for relative paths.The imports have been updated to use relative paths within the
utilspackage, which is appropriate since this module is now the canonical location for graph utilities.tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
28-28: LGTM: Import path updated.The import path has been updated to reflect the new module structure where
InferenceOptimizeris now located intransform.optimizer.
117-117: LGTM: Configuration refactored to nested structure.The instantiation now correctly passes
ad_config.transformsinstead of the entiread_config, reflecting the move to a nested transforms configuration structure.tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py (1)
80-80: LGTM: Device sourced from context.The device is now sourced from
cm.deviceinstead ofself.config.device, which is appropriate for this transform that loads weights. This differs from the parentBuildModelclass which still usesself.config.devicefor building without loading weights.tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (1)
36-38: LGTM: Test updated to use nested transforms config.The test now uses
get_transforms_confighelper to build the nested transforms configuration, which is consistent with the refactored configuration structure.tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (2)
14-14: LGTM: Import path updated.The import has been updated to use
...utils._graph, consistent with the consolidation of graph utilities.
224-224: LGTM: Validation constraints added.The
free_mem_ratiofield now includes proper validation constraints (ge=0.0, le=1.0) to ensure the value remains within the valid range for a memory ratio.tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (6)
7-7: LGTM: Import path updated.The import has been updated to reflect the new module structure where
InferenceOptimizeris located intransform.optimizer.
19-47: LGTM: Test updated for nested transforms configuration.The test has been correctly updated to use the new nested
transformsconfiguration structure. Configuration fields likefree_mem_ratio,attn_backend, andsimple_shard_onlyare now properly nested under their respective transform keys.
53-65: LGTM: Validation logic moved to InferenceOptimizer.The test now correctly validates
free_mem_ratiothroughInferenceOptimizerinstantiation, which is where the validation logic now resides. This ensures the constraints (ge=0.0, le=1.0) are properly enforced.
90-103: LGTM: Test fixture updated for nested configuration.The test fixture now uses the nested
transformsstructure, consistent with the refactored configuration approach.
159-167: LGTM: Assertions updated to read from nested structure.The assertions have been correctly updated to read values from the nested
transformsconfiguration using the appropriate keys.
233-239: LGTM: Test updated to use nested transforms.The test now correctly passes the
transformsdictionary with nestedinsert_cached_attentionconfiguration, aligning with the new configuration structure.tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)
491-522: LGTM! Well-structured transform config helper.The new helper function cleanly separates transform configuration logic for different modes, making test setup more modular and maintainable.
tensorrt_llm/_torch/auto_deploy/config/default.yaml (5)
42-42: LGTM! Attention layout config additions look good.The explicit
attn_layoutandexpected_layoutconfiguration keys properly specify the BSND layout for attention and RoPE operations, which aligns with the transform refactoring.Also applies to: 47-47
92-92: LGTM! Checkpoint device configuration is well-designed.The optional
checkpoint_devicefield with anulldefault provides flexibility for checkpoint loading while maintaining backward compatibility.
127-131: Verify the TODO comment about visualization errors.The visualization transform is currently disabled with a TODO noting errors. Ensure that the visualization functionality is properly tested and that leaving it disabled is the intended state for this release.
Is the visualization functionality expected to have errors in this PR, or should this be addressed before merging?
139-139: LGTM! FlashInfer as default attention backend is appropriate.Setting
flashinferas the default attention backend is a good choice for performance.
159-160: LGTM! Compile model configuration additions are appropriate.The
cuda_graph_batch_sizesandcompile_backendfields provide necessary control over compilation behavior.tensorrt_llm/_torch/auto_deploy/llm_args.py (3)
3-3: LGTM! Import cleanup is appropriate.The removal of
Listfrom typing imports aligns with the removal of list-typed fields from the class.
137-149: LGTM! Config surface refactored to use transforms.The migration from flat inference optimizer fields to the nested
transformsconfiguration simplifies the config surface and provides better modularity. This is a breaking change but aligns with the PR's objectives to consolidate transform configuration.
169-179: LGTM! Validation logic correctly accesses nested transforms.The updated
update_attn_page_sizevalidation properly navigates the nested transforms configuration for both graph and transformers modes, using safe dictionary access patterns.tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (2)
4-4: LGTM! Import updates align with test refactoring.The updated imports properly reference the new test utilities and the ADEngine entrypoint from the shim module.
Also applies to: 8-8
120-152: LGTM! Test function properly refactored to use new config helpers.The test setup correctly builds the transforms configuration and experiment config using the new helper functions, and the monkeypatch logic properly validates the config before delegating to the original build function.
tensorrt_llm/_torch/auto_deploy/transform/library/load_weights.py (4)
10-10: LGTM! Import consolidation improves organization.Moving
move_to_devicetoutils._graphconsolidates graph-related utilities in a more appropriate location.
23-26: LGTM! Renamed field clarifies checkpoint loading intent.The rename from
devicetocheckpoint_devicewith an optional type better expresses that this is specifically for checkpoint initialization, with a fallback to the runtime device.
48-50: LGTM! Two-stage device handling is well-designed.The logic correctly uses
checkpoint_device(if specified) for initial loading and then ensures the model ends up oncm.device, providing flexibility for checkpoint loading while guaranteeing correct final placement.
68-70: Verify the necessity of this transform.The TODO comment correctly questions whether this transform is needed, as
cm.to(cm.device)appears redundant sincecmshould already be on its own device. Consider verifying if this transform can be removed or if there's a specific edge case it addresses.Can you confirm whether this transform is necessary, or if it can be safely removed?
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py (1)
23-26: LGTM!The refactoring to use helper functions
get_transforms_configandget_small_model_configimproves test maintainability by centralizing configuration logic.
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py
Outdated
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py
Outdated
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
Outdated
Show resolved
Hide resolved
lucaslie
left a comment
There was a problem hiding this comment.
Can you double-check if our integrations with trtllm-bench and trtllm-serve still work as expected.
There is two potential issues I see:
- handling of free_mem_ratio
- handling of attention backend
If everything looks good let me know and I can set up a dashboard run and look at the remaining issue regarding the device
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
Outdated
Show resolved
Hide resolved
FrankD412
left a comment
There was a problem hiding this comment.
Approving the one-line change for bench.
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
da40367 to
dd9e29e
Compare
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
Outdated
Show resolved
Hide resolved
docs/source/features/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
Outdated
Show resolved
Hide resolved
nv-guomingz
left a comment
There was a problem hiding this comment.
Overall LGTM on doc part.
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
|
/bot run |
|
PR_Github #21713 [ run ] triggered by Bot. Commit: |
|
PR_Github #21713 [ run ] completed with state |
…IDIA#8039) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
…IDIA#8039) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
…IDIA#8039) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
…IDIA#8039) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
…IDIA#8039) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
…IDIA#8039) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Summary by CodeRabbit
Description
This PR clean up old Inference optimizer in AD. Main changes include:
ADConfigtoADConfig.transforms; Set their values intransformer.yamlanddefault.yaml;visualizationto new inference optimizer as a new transform; TODO: there's no unit tests to it yet.ADConfig;Test Coverage
TRTLLM-bench
free_mem_ratio,attn_backend,compile_backendpasing correctly fromextra_llm_args.yamlto inference optimizer.bench.yamlis also still compatibleNOTE: absolute numbers are not indicative at the moment due to repeat_kv bug
bench command
NOTE: existing
bench.yamlis compatible but the new one can also be used and is preferred moving forwardbench.yamlformat after changes:Results with new
bench.yamlbench_old.yaml:Results with
bench_old.yamlon this branch:Results with
bench_old.yamlbefore these changesTRTLLM-serve
attn_backend,compile_backend,free_mem_ratiopassing correctly fromextra_llm_args.yamlto infrence optimizer;commands:
runtime: trtllm
world_size: 2
transforms:
resize_kv_cache:
free_mem_ratio: 0.0
compile_model:
cuda_graph_batch_sizes: [1,2,4,8,16,32,64,128,256,512]
backend: torch-cudagraph