Skip to content

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Dec 30, 2025

Summary by CodeRabbit

  • Refactor
    • Optimized tensor parallelism strategy for Nemotron-H models with support for attention data parallelism mode, improving distributed training efficiency and weight mapping configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@nv-guomingz nv-guomingz requested review from a team as code owners December 30, 2025 15:32
@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from 4a342dd to 7fe7dab Compare December 30, 2025 15:33
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 30, 2025

📝 Walkthrough

Walkthrough

This pull request adds support for attention data-parallel (attention DP) mode across multiple TensorRT LLM modules. When enabled, the flag causes conditional initialization of tensor parallelism configurations, adjusted weight mapping for Mamba2 mixers, conditional AllReduce instantiation, modified embedding strategies, and conditional tp_size adjustments in cache management. Changes maintain backward compatibility by preserving original behavior when the flag is disabled.

Changes

Cohort / File(s) Summary
Weight Mapping & Configuration
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
Derives mamba_tp_size and mamba_tp_rank based on enable_attention_dp flag; replaces hard tp_size/tp_rank splits with these derived values in _split_mamba2_mixer_in_proj for in_proj, A/D/dt_bias, convolution, and norm.weight paths.
Model Architecture & Initialization
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Adds reduce_output parameter to TransformerLayer; conditionally creates AllReduce in NemotronHMOE (disabled when enable_attention_dp is true); adds explicit reduce_output=False to shared-experts MLP; conditionally initializes Embedding with or without sharding/mapping based on enable_attention_dp.
Mamba2 Mixer Configuration
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
Constructs conditional Mapping when enable_attention_dp is true with world_size=pp_size, tp_size=1, and flag enabled; forces tp_size=1 and tp_rank=0 in this mode; preserves original behavior otherwise.
MLP Layer Handling
tensorrt_llm/_torch/modules/mlp.py
Creates conditional Mapping based on enable_attention_dp; replaces config.mapping.tp_size references with computed mapping.tp_size; propagates new Mapping to up_proj and down_proj layers.
Cache Management
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Conditionally adjusts tp_size to 1 when enable_attention_dp is enabled; otherwise uses standard mapping.tp_size.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is mostly empty—only the template structure remains without actual content explaining the issue, solution, test coverage, or implementation details. Add comprehensive description explaining what attention DP is, why it's needed for Nemotron Super v3, how it's implemented, and which tests validate these changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the feature (attention data-parallel) for a specific model (Nemotron Super v3) with proper JIRA ticket reference and type annotation.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/mlp.py (1)

43-43: Pre-existing issue: config is accessed before the None-check.

Line 26 accesses config.mapping.enable_attention_dp before line 43 where config = config or ModelConfig() provides the fallback. If config is None, this will raise an AttributeError.

This appears to be a pre-existing issue, but the new code makes it more prominent. Consider moving the None-check earlier or adding a guard.

Suggested fix
     def __init__(self,
                  *,
                  hidden_size: int,
                  intermediate_size: int,
                  bias: bool,
                  activation: Callable[[torch.Tensor], torch.Tensor] = None,
                  dtype: Optional[torch.dtype] = None,
                  config: Optional[ModelConfig] = None,
                  layer_idx: Optional[int] = None,
                  reduce_output: bool = True):
+        config = config or ModelConfig()
         if config.mapping.enable_attention_dp:
             mapping = Mapping(
                 world_size=config.mapping.pp_size,
                 tp_size=1,
                 pp_size=config.mapping.pp_size,
                 rank=config.mapping.rank,
                 gpus_per_node=config.mapping.gpus_per_node,
                 enable_attention_dp=True,
             )
         else:
             mapping = config.mapping
         super().__init__()
         self.layer_idx = layer_idx
         self.hidden_size = hidden_size
         self.intermediate_size = intermediate_size
         self.activation = activation

-        config = config or ModelConfig()
         self.up_lora = LoraLayer([LoraModuleType.MLP_H_TO_4H],
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (2)

59-76: Line 59 assignment is redundant.

self.mapping is assigned at line 59, but it's immediately overwritten in both branches of the conditional (lines 63-70 or line 74). Consider removing line 59 for clarity.

Suggested simplification
         config = config or ModelConfig()
-        self.mapping = config.mapping

         if config.mapping.enable_attention_dp:
             from tensorrt_llm.mapping import Mapping
             self.mapping = Mapping(
                 world_size=config.mapping.pp_size,
                 tp_size=1,
                 pp_size=config.mapping.pp_size,
                 rank=config.mapping.rank,
                 gpus_per_node=config.mapping.gpus_per_node,
                 enable_attention_dp=True,
             )
             tp_size = 1
             tp_rank = 0
         else:
             self.mapping = config.mapping
             tp_size = config.mapping.tp_size
             tp_rank = config.mapping.tp_rank

1-1: Update copyright year to include 2025.

Per coding guidelines, the copyright header should include the year of latest meaningful modification. Since this file is being modified in 2025, update the copyright range.

-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)

1-1: Consider adding SPDX copyright header.

Per coding guidelines, Python files should contain an NVIDIA copyright header. This file is missing the standard header present in other files.

Suggested addition
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

1-1: Update copyright year to include 2025.

Per coding guidelines, the copyright header should reflect the year of latest modification.

-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74832a1 and 7fe7dab.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
🧠 Learnings (14)
📓 Common learnings
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-12-19T06:31:54.973Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (64-65)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (2)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (64-65)
tensorrt_llm/_torch/utils.py (1)
  • split (394-402)
tensorrt_llm/_torch/modules/mlp.py (4)
tensorrt_llm/mapping.py (1)
  • Mapping (363-542)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • pp_size (60-61)
  • tp_size (64-65)
tensorrt_llm/_torch/peft/lora/layer.py (2)
  • LoraLayer (85-156)
  • LoraModuleType (7-82)
tensorrt_llm/_torch/modules/linear.py (1)
  • Linear (2045-2297)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (5)
tensorrt_llm/_torch/modules/linear.py (1)
  • Linear (2045-2297)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (637-850)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (64-65)
tensorrt_llm/_torch/modules/embedding.py (1)
  • Embedding (203-283)
tensorrt_llm/_torch/model_config.py (1)
  • torch_dtype (179-184)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)

48-48: LGTM!

The conditional tp_size logic correctly disables tensor-parallel partitioning for Mamba cache when attention data parallelism is enabled. This aligns with the broader attention DP pattern introduced across the PR.

tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)

16-20: LGTM!

The mamba_tp_size and mamba_tp_rank derivation correctly handles the attention DP case, ensuring Mamba layers receive unsplit weights when attention data parallelism is enabled.

tensorrt_llm/_torch/models/modeling_nemotron_h.py (5)

88-103: LGTM!

The reduce_output parameter correctly propagates to the parent Attention class, enabling conditional all-reduce behavior based on attention data parallelism settings.


161-170: LGTM!

Setting reduce_output=False for shared experts is correct since the all-reduce is performed at the MoE level (lines 278-279) after combining shared and routed outputs.


200-207: LGTM!

The conditional AllReduce instantiation properly handles the attention DP case. The forward method at lines 278-279 correctly guards the allreduce call with both enable_attention_dp and tp_size checks.


323-328: LGTM!

The reduce_output logic correctly determines when all-reduce is needed based on both attention DP mode and tensor parallelism size.


371-388: LGTM!

The conditional embedding strategy is correct. The comment clearly explains why parallel embedding can't be used with attention DP (varying problem sizes across ranks). Using a simple non-sharded Embedding when attention DP is enabled avoids the all-reduce incompatibility.

tensorrt_llm/_torch/modules/mlp.py (1)

26-36: The rank validation gap when enable_attention_dp=True is intentional but acknowledged as incomplete.

When enable_attention_dp=True, the Mapping constructor intentionally skips rank validation (see tensorrt_llm/mapping.py line 204-208 TODO comment: "skip check for enable_attention_dp temporarily, will support attention_dp_size"). This means config.mapping.rank from the original mapping can exceed pp_size - 1 without triggering a validation error.

This is a known limitation with a TODO comment in the Mapping class. While the pattern is used consistently in other modules (e.g., mamba2_mixer.py), consider either:

  1. Adding a comment in the code explaining why this rank inconsistency is acceptable in the attention_dp case, or
  2. Ensuring documentation clarifies the rank constraints when this feature is enabled.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch 2 times, most recently from 66d1e62 to 697699c Compare December 30, 2025 17:00
@nv-guomingz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30192 [ run ] triggered by Bot. Commit: 697699c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30192 [ run ] completed with state SUCCESS. Commit: 697699c
/LLM/main/L0_MergeRequest_PR pipeline #23238 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@Wanli-Jiang
Copy link
Collaborator

Checked locally and verified trtllm-serve can run well with this adp.

Copy link
Collaborator

@yechank-nvidia yechank-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from e40b51d to ad1924e Compare December 31, 2025 08:15
Copy link
Collaborator

@HuiGao-NV HuiGao-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch 2 times, most recently from af00463 to 4562534 Compare January 5, 2026 08:56
@nv-guomingz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@nv-guomingz nv-guomingz enabled auto-merge (squash) January 5, 2026 08:56
@tensorrt-cicd
Copy link
Collaborator

PR_Github #30578 [ run ] triggered by Bot. Commit: 4562534

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30578 [ run ] completed with state FAILURE. Commit: 4562534
/LLM/main/L0_MergeRequest_PR pipeline #23589 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30742 [ run ] completed with state SUCCESS. Commit: f96e0a8
/LLM/main/L0_MergeRequest_PR pipeline #23725 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from f96e0a8 to 4499603 Compare January 7, 2026 02:22
@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30808 [ run ] triggered by Bot. Commit: 4499603

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30808 [ run ] completed with state SUCCESS. Commit: 4499603
/LLM/main/L0_MergeRequest_PR pipeline #23789 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test --disabla-fail-fast

@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30856 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disabla-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30857 [ run ] triggered by Bot. Commit: 4499603

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30857 [ run ] completed with state SUCCESS. Commit: 4499603
/LLM/main/L0_MergeRequest_PR pipeline #23826 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30962 [ run ] triggered by Bot. Commit: 4499603

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30962 [ run ] completed with state SUCCESS. Commit: 4499603
/LLM/main/L0_MergeRequest_PR pipeline #23922 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from 4499603 to 1077cea Compare January 8, 2026 14:30
@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31079 [ run ] triggered by Bot. Commit: 1077cea

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31079 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 AM PST on 1/8.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from 1077cea to 8aef50a Compare January 9, 2026 01:29
@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31165 [ run ] triggered by Bot. Commit: 8aef50a

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from 8aef50a to ad1590c Compare January 9, 2026 03:05
@nv-guomingz
Copy link
Collaborator Author

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31180 [ run ] triggered by Bot. Commit: ad1590c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31165 [ run ] completed with state ABORTED. Commit: 8aef50a

@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from ad1590c to 026eb26 Compare January 9, 2026 17:18
@nv-guomingz nv-guomingz force-pushed the user/guomingz/adp_nemotron_super_v3 branch from 026eb26 to 598de12 Compare January 9, 2026 17:19
@nv-guomingz nv-guomingz disabled auto-merge January 9, 2026 17:19
@nv-guomingz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31282 [ run ] triggered by Bot. Commit: 598de12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants