Skip to content

Conversation

@benzh-2025
Copy link
Collaborator

@benzh-2025 benzh-2025 commented Dec 5, 2025

Summary by CodeRabbit

  • New Features

    • Introduced FP4 all-reduce GEMM operation with dynamic auto-tuning for optimized distributed tensor computations.
    • Added fused all-reduce support in linear layers for NVFP4-quantized models, improving performance for distributed inference and training scenarios.
  • Tests

    • Added comprehensive multi-GPU test coverage for FP4 row-wise linear all-reduce operations.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

@benzh-2025 benzh-2025 requested review from a team as code owners December 5, 2025 08:12
@benzh-2025 benzh-2025 requested review from liji-nv and yuxianq December 5, 2025 08:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a new fused FP4 GEMM all-reduce optimization for tensor-parallel linear operations. It adds C++ backend support, a tunable PyTorch custom operation with auto-tuner integration, integrates a fused execution path into the Linear module with conditional logic, and includes multi-GPU test coverage.

Changes

Cohort / File(s) Summary
Build Configuration
cpp/tensorrt_llm/thop/CMakeLists.txt
Added fusedGemmAllreduceOp.cpp to the th_common library source list alongside dsv3RopeOp.cpp.
PyTorch Custom Operations
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Introduced new Fp4GemmAllreduceRunner class (extends TunableRunner) with get_valid_tactics() and forward() methods for FP4 GEMM all-reduce tuning and execution. Added nvfp4_gemm_allreduce() custom torch operation with auto-tuner integration and corresponding fake-registration for TorchScript support.
Linear Module Integration
tensorrt_llm/_torch/modules/linear.py
Added abstract method apply_linear_allreduce() to LinearMethodBase and implemented (or stubbed with NotImplementedError) across multiple concrete linear method classes. Extended Linear.__init__ with new parameter use_fused_gemm_allreduce: bool = False. Modified Linear.forward() to conditionally invoke fused all-reduce path when flag is enabled and tensor-parallel mode is ROW.
Multi-GPU Tests
tests/unittest/_torch/multi_gpu/test_linear.py
Added utility functions check_accuracy(), fp4_row_linear_allreduce(), and fp4_row_linear_allreduce_run_single_rank(). Implemented test_fp4_row_linear_allreduce() to validate FP4 GEMM all-reduce correctness across multiple GPU ranks using mpi_pool_executor.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Linear as Linear.forward()
    participant Runner as Fp4GemmAllreduceRunner
    participant AutoTuner as AutoTuner
    participant CppBackend as C++ Backend<br/>(fusedGemmAllreduceOp)
    participant AllReduce as All-Reduce

    User->>Linear: forward(input, bias)
    
    alt use_fused_gemm_allreduce enabled & ROW mode
        Linear->>Linear: Check all_reduce_params
        Linear->>Runner: apply_linear_allreduce()
        
        Runner->>AutoTuner: get_valid_tactics(inputs, profile)
        AutoTuner-->>Runner: best_tactic_id
        
        Runner->>CppBackend: execute(act_fp4, weight, scales, tactic)
        CppBackend->>CppBackend: FP4 GEMM + All-Reduce
        CppBackend-->>Runner: fused_output
        
        Runner-->>Linear: output
    else Standard path
        Linear->>Linear: apply_linear(input, bias)
        Linear->>AllReduce: all_reduce(output)
        AllReduce-->>Linear: reduced_output
    end
    
    Linear-->>User: output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Linear module changes: Careful review needed across multiple method implementations (UnquantizedLinearMethod, FP8QDQLinearMethod, NVFP4LinearMethod, etc.) to verify abstract method stubs and forward-path conditional logic are correctly implemented and consistent.
  • Custom operation integration: Verify Fp4GemmAllreduceRunner tuning logic, auto-tuner queries, and fake-registration align with PyTorch extension requirements.
  • Forward path control flow: Confirm the conditional branching logic in Linear.forward() correctly routes between fused and standard paths based on quantization type and configuration flags.
  • Test coverage: Validate multi-GPU test setup, FP4 quantization initialization, and reference computation correctness across ranks.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is empty except for the template comments. No actual implementation details, rationale, or explanation of the changes are provided. Add a Description section explaining what the PR does, why it's needed, and the key implementation details. Also add a Test Coverage section listing relevant tests and the PR Checklist confirmation.
Docstring Coverage ⚠️ Warning Docstring coverage is 2.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding FP4 GEMM with all-reduce functionality, which aligns with the changes across CMakeLists.txt, torch_custom_ops.py, linear.py, and test files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
tests/unittest/_torch/multi_gpu/test_linear.py (3)

346-357: Consider reusing the existing check_accuracy utility.

This function duplicates check_accuracy from tests/unittest/utils/util.py. Consider importing and using the existing utility to reduce code duplication and maintain consistency.

+from utils.util import check_accuracy
-
-def check_accuracy(a, b, atol, rtol, percent):
-    assert a.shape == b.shape
-    assert a.dtype == b.dtype
-    a = a.to(torch.float32)
-    b = b.to(torch.float32)
-    left = torch.abs(a - b)
-    right = atol + rtol * torch.abs(b)
-    count = torch.sum(left > right)
-    mismatch_percent = count / a.numel()
-    if not (mismatch_percent < 1 - percent):
-        raise Exception("Mismatch percentage is %f for rtol %f" %
-                        (mismatch_percent, rtol))

360-405: Remove unused parameter and commented-out code.

  1. The seq_len parameter (line 361) is unused in this function.
  2. Line 401 contains commented-out torch.compile code that should either be removed or explained with a comment.
 @torch.inference_mode
-def fp4_row_linear_allreduce(tp_size, local_rank, seq_len, output_size,
+def fp4_row_linear_allreduce(tp_size, local_rank, output_size,
                              hidden_size, dtype, output_ref, x_sf_global,
                              w_sf_global, x_fp4s, w_fp4, x_sf_blocks,
                              w_sf_block_unswizzled):

Note: Removing seq_len would also require updating the caller fp4_row_linear_allreduce_run_single_rank and test_fp4_row_linear_allreduce.


482-489: Consider adding strict=True to zip() for safety.

Adding strict=True ensures the iterables have matching lengths, which can catch bugs early if argument counts mismatch.

     results = mpi_pool_executor.map(
         fp4_row_linear_allreduce_run_single_rank,
-        *zip(*[(fp4_row_linear_allreduce, tp_size, seq_len, output_size,
+        *zip(*[(fp4_row_linear_allreduce, tp_size, output_size,
                 hidden_size, dtype, output_ref, x_sf_global, w_sf_global,
-                x_fp4s, w_fp4, x_sf_blocks, w_sf_block_unswizzled)] * tp_size))
+                x_fp4s, w_fp4, x_sf_blocks, w_sf_block_unswizzled)] * tp_size, strict=True))

Note: strict=True requires Python 3.10+. Verify compatibility with the project's minimum Python version.

tensorrt_llm/_torch/modules/linear.py (2)

979-999: LGTM with optional deduplication opportunity.

The implementation correctly delegates to nvfp4_gemm_allreduce and handles the FP4 input preparation. The input handling code (lines 982-988) duplicates lines 937-943 in apply. Consider extracting a helper method to reduce duplication.


2273-2289: Use truthiness check instead of == True.

Replace the explicit == True comparison with a direct truthiness check for cleaner code.

             use_fused_gemm_allreduce = self.use_fused_gemm_allreduce and (
                 all_reduce_params is None or
-                (all_reduce_params.enable_allreduce == True
+                (all_reduce_params.enable_allreduce
                  and all_reduce_params.fusion_op == AllReduceFusionOp.NONE))

The conditional flow for fused vs. standard allreduce path looks correct - it correctly bypasses the fused path when fusion operations are requested.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a736226 and 80485c7.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/linear.py (12 hunks)
  • tests/unittest/_torch/multi_gpu/test_linear.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tests/unittest/_torch/multi_gpu/test_linear.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tests/unittest/_torch/multi_gpu/test_linear.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/linear.py
🧠 Learnings (11)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • tensorrt_llm/_torch/modules/linear.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/thop/CMakeLists.txt
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/thop/CMakeLists.txt
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/thop/CMakeLists.txt
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • cpp/tensorrt_llm/thop/CMakeLists.txt
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_linear.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_linear.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_linear.py
  • tensorrt_llm/_torch/modules/linear.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_linear.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧬 Code graph analysis (1)
tests/unittest/_torch/multi_gpu/test_linear.py (3)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-47)
tensorrt_llm/models/modeling_utils.py (2)
  • QuantConfig (131-271)
  • quant_algo (550-551)
tests/unittest/utils/util.py (1)
  • check_accuracy (458-469)
🪛 Ruff (0.14.7)
tests/unittest/_torch/multi_gpu/test_linear.py

356-357: Create your own exception

(TRY002)


361-361: Unused function argument: seq_len

(ARG001)


484-486: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1253-1253: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


1280-1280: Unused method argument: inputs

(ARG002)


1281-1281: Unused method argument: profile

(ARG002)


1281-1281: Unused method argument: kwargs

(ARG002)


1334-1334: Unused function argument: act_sf

(ARG001)


1335-1335: Unused function argument: weight_scale

(ARG001)


1336-1336: Unused function argument: alpha

(ARG001)


1338-1338: Unused function argument: tp_rank

(ARG001)


1339-1339: Unused function argument: tp_group

(ARG001)

tensorrt_llm/_torch/modules/linear.py

981-981: Unused method argument: args

(ARG002)


981-981: Unused method argument: kwargs

(ARG002)


1418-1418: Redefinition of unused apply_linear_allreduce from line 1413

(F811)


2275-2275: Avoid equality comparisons to True; use all_reduce_params.enable_allreduce: for truth checks

Replace with all_reduce_params.enable_allreduce

(E712)

🔇 Additional comments (4)
cpp/tensorrt_llm/thop/CMakeLists.txt (1)

107-108: LGTM!

The addition of fusedGemmAllreduceOp.cpp to the th_common library sources is correctly placed and follows the existing file listing pattern.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)

1300-1342: LGTM!

The nvfp4_gemm_allreduce custom operation and its fake registration follow the established patterns in this file. The unused parameters in the fake registration are required for signature matching and are consistent with other implementations.


1269-1278: Verify whether tp_rank should be included in the runner instance cache key.

The instance_key (line 1269) excludes tp_rank, causing all ranks with the same output_dtype and tp_group_str to share a single cached runner instance. However, tp_rank is passed to the C++ runner constructor. If the C++ Fp4GemmAllreduceRunner maintains rank-specific state or behavior, the cache key should include tp_rank to ensure separate instances per rank. Verify that the underlying C++ implementation correctly handles shared runner instances across different ranks, or if rank-specific instances are required for correctness.

tensorrt_llm/_torch/modules/linear.py (1)

2233-2245: Verify LoRA behavior with fused GEMM allreduce.

The LoRA result is added after the fused GEMM+allreduce operation. In the standard path, LoRA is applied before the separate allreduce. This difference in ordering may affect correctness for tensor-parallel LoRA scenarios where the LoRA contribution also needs to be reduced.

Confirm that adding LoRA after the fused allreduce produces correct results, especially when LoRA weights are sharded across tensor-parallel ranks.

Comment on lines 1413 to 1459
def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
bias: Optional[torch.Tensor], tp_rank: int,
tp_group: List[int], *args, **kwargs):
raise NotImplementedError

def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
bias: Optional[torch.Tensor], tp_rank: int,
tp_group: List[int], *args, **kwargs):
raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate method definition.

The apply_linear_allreduce method is defined twice consecutively in W4A8MXFP4FP8LinearMethod. The second definition shadows the first and creates dead code.

     def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
                                bias: Optional[torch.Tensor], tp_rank: int,
                                tp_group: List[int], *args, **kwargs):
         raise NotImplementedError
-
-    def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
-                               bias: Optional[torch.Tensor], tp_rank: int,
-                               tp_group: List[int], *args, **kwargs):
-        raise NotImplementedError
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
bias: Optional[torch.Tensor], tp_rank: int,
tp_group: List[int], *args, **kwargs):
raise NotImplementedError
def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
bias: Optional[torch.Tensor], tp_rank: int,
tp_group: List[int], *args, **kwargs):
raise NotImplementedError
def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
bias: Optional[torch.Tensor], tp_rank: int,
tp_group: List[int], *args, **kwargs):
raise NotImplementedError
🧰 Tools
🪛 Ruff (0.14.7)

1418-1418: Redefinition of unused apply_linear_allreduce from line 1413

(F811)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/linear.py around lines 1413 to 1421, there are
two identical consecutive definitions of apply_linear_allreduce in
W4A8MXFP4FP8LinearMethod; remove the duplicate so the method is defined only
once (delete the second definition block), ensuring any references or decorators
remain intact and preserving the intended implementation and signature of the
remaining method.

Comment on lines 2132 to 2160
self.use_fused_gemm_allreduce = use_fused_gemm_allreduce and self.quant_config.layer_quant_mode.has_nvfp4(
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential AttributeError if quant_config is None.

When use_fused_gemm_allreduce=True is passed but quant_config is None, accessing self.quant_config.layer_quant_mode.has_nvfp4() will raise AttributeError.

-        self.use_fused_gemm_allreduce = use_fused_gemm_allreduce and self.quant_config.layer_quant_mode.has_nvfp4(
-        )
+        self.use_fused_gemm_allreduce = (
+            use_fused_gemm_allreduce
+            and self.quant_config is not None
+            and self.quant_config.layer_quant_mode.has_nvfp4()
+        )
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/linear.py around lines 2132 to 2133, the code
unconditionally calls self.quant_config.layer_quant_mode.has_nvfp4() which will
raise AttributeError when quant_config is None; update the assignment to guard
the call by first checking that self.quant_config is not None (and that
layer_quant_mode exists) and only call has_nvfp4() when present, otherwise treat
the result as False (i.e., set use_fused_gemm_allreduce =
use_fused_gemm_allreduce and bool(self.quant_config and
getattr(self.quant_config, "layer_quant_mode", None) and
self.quant_config.layer_quant_mode.has_nvfp4())).

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 80485c7 to cb163d0 Compare December 5, 2025 08:18
@yihwang-nv
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27098 [ run ] triggered by Bot. Commit: cb163d0

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27098 [ run ] completed with state FAILURE. Commit: cb163d0
/LLM/main/L0_MergeRequest_PR pipeline #20672 completed with status: 'FAILURE'

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from cb163d0 to 74d7798 Compare December 5, 2025 08:58
@yihwang-nv
Copy link
Collaborator

/bot run

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 5, 2025
@yihwang-nv
Copy link
Collaborator

/bot run

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 74d7798 to e11bbc1 Compare December 8, 2025 06:59
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27263 [ run ] triggered by Bot. Commit: e11bbc1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27263 [ run ] completed with state FAILURE. Commit: e11bbc1
/LLM/main/L0_MergeRequest_PR pipeline #20818 completed with status: 'FAILURE'

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from e11bbc1 to d813fba Compare December 8, 2025 07:58
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27268 [ run ] triggered by Bot. Commit: d813fba

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27268 [ run ] completed with state SUCCESS. Commit: d813fba
/LLM/main/L0_MergeRequest_PR pipeline #20822 completed with status: 'FAILURE'

output = F.linear(input, module.weight, bias)
return output

def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to repeat this method in all LinearMethodBase subclasses. We have already define it in LinearMethodBase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in LinearMethodBase, it is abstractmethod.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to mark it as abstractmethod if we have a proper default implementation, which is raise NotImplementedError in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, let me try.

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from d813fba to 2ae463d Compare December 11, 2025 10:07
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27841 [ run ] triggered by Bot. Commit: 2ae463d

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch 2 times, most recently from b19e134 to 3c42ff0 Compare December 11, 2025 10:30
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29867 [ run ] triggered by Bot. Commit: cfc6143

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29868 [ run ] triggered by Bot. Commit: cfc6143

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29868 [ run ] completed with state SUCCESS. Commit: cfc6143
/LLM/main/L0_MergeRequest_PR pipeline #22969 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29937 [ run ] triggered by Bot. Commit: cfc6143

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29937 [ run ] completed with state SUCCESS. Commit: cfc6143
/LLM/main/L0_MergeRequest_PR pipeline #23025 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from cfc6143 to 8186180 Compare December 30, 2025 06:45
@benzh-2025
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30141 [ run ] triggered by Bot. Commit: 8186180

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30141 [ run ] completed with state SUCCESS. Commit: 8186180
/LLM/main/L0_MergeRequest_PR pipeline #23194 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 8186180 to e68b10e Compare December 31, 2025 05:49
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30241 [ run ] triggered by Bot. Commit: e68b10e

@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30241 [ run ] completed with state SUCCESS. Commit: e68b10e
/LLM/main/L0_MergeRequest_PR pipeline #23283 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30254 [ run ] triggered by Bot. Commit: e68b10e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30254 [ run ] completed with state SUCCESS. Commit: e68b10e
/LLM/main/L0_MergeRequest_PR pipeline #23292 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from e68b10e to 32c2545 Compare January 2, 2026 01:30
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30357 [ run ] triggered by Bot. Commit: 32c2545

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30357 [ run ] completed with state SUCCESS. Commit: 32c2545
/LLM/main/L0_MergeRequest_PR pipeline #23388 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30369 [ run ] triggered by Bot. Commit: 32c2545

@benzh-2025 benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 32c2545 to 289a4e5 Compare January 2, 2026 07:20
@benzh-2025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30378 [ run ] triggered by Bot. Commit: 289a4e5

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30378 [ run ] completed with state SUCCESS. Commit: 289a4e5
/LLM/main/L0_MergeRequest_PR pipeline #23407 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants