-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. #10347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. #10347
Conversation
4a342dd to
7fe7dab
Compare
📝 WalkthroughWalkthroughThis pull request adds support for attention data-parallel (attention DP) mode across multiple TensorRT LLM modules. When enabled, the flag causes conditional initialization of tensor parallelism configurations, adjusted weight mapping for Mamba2 mixers, conditional AllReduce instantiation, modified embedding strategies, and conditional tp_size adjustments in cache management. Changes maintain backward compatibility by preserving original behavior when the flag is disabled. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/mlp.py (1)
43-43: Pre-existing issue:configis accessed before the None-check.Line 26 accesses
config.mapping.enable_attention_dpbefore line 43 whereconfig = config or ModelConfig()provides the fallback. IfconfigisNone, this will raise anAttributeError.This appears to be a pre-existing issue, but the new code makes it more prominent. Consider moving the None-check earlier or adding a guard.
Suggested fix
def __init__(self, *, hidden_size: int, intermediate_size: int, bias: bool, activation: Callable[[torch.Tensor], torch.Tensor] = None, dtype: Optional[torch.dtype] = None, config: Optional[ModelConfig] = None, layer_idx: Optional[int] = None, reduce_output: bool = True): + config = config or ModelConfig() if config.mapping.enable_attention_dp: mapping = Mapping( world_size=config.mapping.pp_size, tp_size=1, pp_size=config.mapping.pp_size, rank=config.mapping.rank, gpus_per_node=config.mapping.gpus_per_node, enable_attention_dp=True, ) else: mapping = config.mapping super().__init__() self.layer_idx = layer_idx self.hidden_size = hidden_size self.intermediate_size = intermediate_size self.activation = activation - config = config or ModelConfig() self.up_lora = LoraLayer([LoraModuleType.MLP_H_TO_4H],tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (2)
59-76: Line 59 assignment is redundant.
self.mappingis assigned at line 59, but it's immediately overwritten in both branches of the conditional (lines 63-70 or line 74). Consider removing line 59 for clarity.Suggested simplification
config = config or ModelConfig() - self.mapping = config.mapping if config.mapping.enable_attention_dp: from tensorrt_llm.mapping import Mapping self.mapping = Mapping( world_size=config.mapping.pp_size, tp_size=1, pp_size=config.mapping.pp_size, rank=config.mapping.rank, gpus_per_node=config.mapping.gpus_per_node, enable_attention_dp=True, ) tp_size = 1 tp_rank = 0 else: self.mapping = config.mapping tp_size = config.mapping.tp_size tp_rank = config.mapping.tp_rank
1-1: Update copyright year to include 2025.Per coding guidelines, the copyright header should include the year of latest meaningful modification. Since this file is being modified in 2025, update the copyright range.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
1-1: Consider adding SPDX copyright header.Per coding guidelines, Python files should contain an NVIDIA copyright header. This file is missing the standard header present in other files.
Suggested addition
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import torchtensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
1-1: Update copyright year to include 2025.Per coding guidelines, the copyright header should reflect the year of latest modification.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_nemotron_h.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_nemotron_h.py
🧠 Learnings (14)
📓 Common learnings
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-12-19T06:31:54.973Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tensorrt_llm/_torch/modules/mlp.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size(64-65)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (2)
tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size(64-65)tensorrt_llm/_torch/utils.py (1)
split(394-402)
tensorrt_llm/_torch/modules/mlp.py (4)
tensorrt_llm/mapping.py (1)
Mapping(363-542)tensorrt_llm/_torch/distributed/communicator.py (2)
pp_size(60-61)tp_size(64-65)tensorrt_llm/_torch/peft/lora/layer.py (2)
LoraLayer(85-156)LoraModuleType(7-82)tensorrt_llm/_torch/modules/linear.py (1)
Linear(2045-2297)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (5)
tensorrt_llm/_torch/modules/linear.py (1)
Linear(2045-2297)tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce(637-850)tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size(64-65)tensorrt_llm/_torch/modules/embedding.py (1)
Embedding(203-283)tensorrt_llm/_torch/model_config.py (1)
torch_dtype(179-184)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
48-48: LGTM!The conditional
tp_sizelogic correctly disables tensor-parallel partitioning for Mamba cache when attention data parallelism is enabled. This aligns with the broader attention DP pattern introduced across the PR.tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
16-20: LGTM!The
mamba_tp_sizeandmamba_tp_rankderivation correctly handles the attention DP case, ensuring Mamba layers receive unsplit weights when attention data parallelism is enabled.tensorrt_llm/_torch/models/modeling_nemotron_h.py (5)
88-103: LGTM!The
reduce_outputparameter correctly propagates to the parentAttentionclass, enabling conditional all-reduce behavior based on attention data parallelism settings.
161-170: LGTM!Setting
reduce_output=Falsefor shared experts is correct since the all-reduce is performed at the MoE level (lines 278-279) after combining shared and routed outputs.
200-207: LGTM!The conditional
AllReduceinstantiation properly handles the attention DP case. The forward method at lines 278-279 correctly guards theallreducecall with bothenable_attention_dpandtp_sizechecks.
323-328: LGTM!The
reduce_outputlogic correctly determines when all-reduce is needed based on both attention DP mode and tensor parallelism size.
371-388: LGTM!The conditional embedding strategy is correct. The comment clearly explains why parallel embedding can't be used with attention DP (varying problem sizes across ranks). Using a simple non-sharded
Embeddingwhen attention DP is enabled avoids the all-reduce incompatibility.tensorrt_llm/_torch/modules/mlp.py (1)
26-36: The rank validation gap whenenable_attention_dp=Trueis intentional but acknowledged as incomplete.When
enable_attention_dp=True, the Mapping constructor intentionally skips rank validation (seetensorrt_llm/mapping.pyline 204-208 TODO comment: "skip check for enable_attention_dp temporarily, will support attention_dp_size"). This meansconfig.mapping.rankfrom the original mapping can exceedpp_size - 1without triggering a validation error.This is a known limitation with a TODO comment in the Mapping class. While the pattern is used consistently in other modules (e.g.,
mamba2_mixer.py), consider either:
- Adding a comment in the code explaining why this rank inconsistency is acceptable in the attention_dp case, or
- Ensuring documentation clarifies the rank constraints when this feature is enabled.
66d1e62 to
697699c
Compare
|
/bot run |
|
PR_Github #30192 [ run ] triggered by Bot. Commit: |
|
PR_Github #30192 [ run ] completed with state
|
697699c to
e40b51d
Compare
|
Checked locally and verified trtllm-serve can run well with this adp. |
yechank-nvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
e40b51d to
ad1924e
Compare
HuiGao-NV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
af00463 to
4562534
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #30578 [ run ] triggered by Bot. Commit: |
|
PR_Github #30578 [ run ] completed with state
|
|
PR_Github #30742 [ run ] completed with state
|
f96e0a8 to
4499603
Compare
|
/bot run --only-multi-gpu-test |
|
PR_Github #30808 [ run ] triggered by Bot. Commit: |
|
PR_Github #30808 [ run ] completed with state
|
|
/bot run --only-multi-gpu-test --disabla-fail-fast |
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #30856 Bot args parsing error: usage: /bot [-h] |
|
PR_Github #30857 [ run ] triggered by Bot. Commit: |
|
PR_Github #30857 [ run ] completed with state
|
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #30962 [ run ] triggered by Bot. Commit: |
|
PR_Github #30962 [ run ] completed with state
|
4499603 to
1077cea
Compare
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #31079 [ run ] triggered by Bot. Commit: |
|
PR_Github #31079 [ run ] completed with state |
1077cea to
8aef50a
Compare
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #31165 [ run ] triggered by Bot. Commit: |
8aef50a to
ad1590c
Compare
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #31180 [ run ] triggered by Bot. Commit: |
|
PR_Github #31165 [ run ] completed with state |
ad1590c to
026eb26
Compare
Signed-off-by: nv-guomingz <[email protected]>
026eb26 to
598de12
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #31282 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.