[None][feat] add fp4 gemm + allreduce #9729

benzh-2025 · 2025-12-05T08:12:43Z

Summary by CodeRabbit

New Features
- Introduced FP4 all-reduce GEMM operation with dynamic auto-tuning for optimized distributed tensor computations.
- Added fused all-reduce support in linear layers for NVFP4-quantized models, improving performance for distributed inference and training scenarios.
Tests
- Added comprehensive multi-GPU test coverage for FP4 row-wise linear all-reduce operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

coderabbitai · 2025-12-05T08:16:16Z

📝 Walkthrough

Walkthrough

This pull request introduces a new fused FP4 GEMM all-reduce optimization for tensor-parallel linear operations. It adds C++ backend support, a tunable PyTorch custom operation with auto-tuner integration, integrates a fused execution path into the Linear module with conditional logic, and includes multi-GPU test coverage.

Changes

Cohort / File(s)	Summary
Build Configuration `cpp/tensorrt_llm/thop/CMakeLists.txt`	Added `fusedGemmAllreduceOp.cpp` to the `th_common` library source list alongside `dsv3RopeOp.cpp`.
PyTorch Custom Operations `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Introduced new `Fp4GemmAllreduceRunner` class (extends `TunableRunner`) with `get_valid_tactics()` and `forward()` methods for FP4 GEMM all-reduce tuning and execution. Added `nvfp4_gemm_allreduce()` custom torch operation with auto-tuner integration and corresponding fake-registration for TorchScript support.
Linear Module Integration `tensorrt_llm/_torch/modules/linear.py`	Added abstract method `apply_linear_allreduce()` to `LinearMethodBase` and implemented (or stubbed with `NotImplementedError`) across multiple concrete linear method classes. Extended `Linear.__init__` with new parameter `use_fused_gemm_allreduce: bool = False`. Modified `Linear.forward()` to conditionally invoke fused all-reduce path when flag is enabled and tensor-parallel mode is ROW.
Multi-GPU Tests `tests/unittest/_torch/multi_gpu/test_linear.py`	Added utility functions `check_accuracy()`, `fp4_row_linear_allreduce()`, and `fp4_row_linear_allreduce_run_single_rank()`. Implemented `test_fp4_row_linear_allreduce()` to validate FP4 GEMM all-reduce correctness across multiple GPU ranks using `mpi_pool_executor`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Linear as Linear.forward()
    participant Runner as Fp4GemmAllreduceRunner
    participant AutoTuner as AutoTuner
    participant CppBackend as C++ Backend<br/>(fusedGemmAllreduceOp)
    participant AllReduce as All-Reduce

    User->>Linear: forward(input, bias)
    
    alt use_fused_gemm_allreduce enabled & ROW mode
        Linear->>Linear: Check all_reduce_params
        Linear->>Runner: apply_linear_allreduce()
        
        Runner->>AutoTuner: get_valid_tactics(inputs, profile)
        AutoTuner-->>Runner: best_tactic_id
        
        Runner->>CppBackend: execute(act_fp4, weight, scales, tactic)
        CppBackend->>CppBackend: FP4 GEMM + All-Reduce
        CppBackend-->>Runner: fused_output
        
        Runner-->>Linear: output
    else Standard path
        Linear->>Linear: apply_linear(input, bias)
        Linear->>AllReduce: all_reduce(output)
        AllReduce-->>Linear: reduced_output
    end
    
    Linear-->>User: output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Linear module changes: Careful review needed across multiple method implementations (UnquantizedLinearMethod, FP8QDQLinearMethod, NVFP4LinearMethod, etc.) to verify abstract method stubs and forward-path conditional logic are correctly implemented and consistent.
Custom operation integration: Verify Fp4GemmAllreduceRunner tuning logic, auto-tuner queries, and fake-registration align with PyTorch extension requirements.
Forward path control flow: Confirm the conditional branching logic in Linear.forward() correctly routes between fused and standard paths based on quantization type and configuration flags.
Test coverage: Validate multi-GPU test setup, FP4 quantization initialization, and reference computation correctness across ranks.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is empty except for the template comments. No actual implementation details, rationale, or explanation of the changes are provided.	Add a Description section explaining what the PR does, why it's needed, and the key implementation details. Also add a Test Coverage section listing relevant tests and the PR Checklist confirmation.
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding FP4 GEMM with all-reduce functionality, which aligns with the changes across CMakeLists.txt, torch_custom_ops.py, linear.py, and test files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

tests/unittest/_torch/multi_gpu/test_linear.py (3)
346-357: Consider reusing the existing check_accuracy utility.

This function duplicates check_accuracy from tests/unittest/utils/util.py. Consider importing and using the existing utility to reduce code duplication and maintain consistency.
+from utils.util import check_accuracy
-
-def check_accuracy(a, b, atol, rtol, percent):
-    assert a.shape == b.shape
-    assert a.dtype == b.dtype
-    a = a.to(torch.float32)
-    b = b.to(torch.float32)
-    left = torch.abs(a - b)
-    right = atol + rtol * torch.abs(b)
-    count = torch.sum(left > right)
-    mismatch_percent = count / a.numel()
-    if not (mismatch_percent < 1 - percent):
-        raise Exception("Mismatch percentage is %f for rtol %f" %
-                        (mismatch_percent, rtol))
360-405: Remove unused parameter and commented-out code.

The seq_len parameter (line 361) is unused in this function.

Line 401 contains commented-out torch.compile code that should either be removed or explained with a comment.
 @torch.inference_mode
-def fp4_row_linear_allreduce(tp_size, local_rank, seq_len, output_size,
+def fp4_row_linear_allreduce(tp_size, local_rank, output_size,
                              hidden_size, dtype, output_ref, x_sf_global,
                              w_sf_global, x_fp4s, w_fp4, x_sf_blocks,
                              w_sf_block_unswizzled):
Note: Removing seq_len would also require updating the caller fp4_row_linear_allreduce_run_single_rank and test_fp4_row_linear_allreduce.

482-489: Consider adding strict=True to zip() for safety.

Adding strict=True ensures the iterables have matching lengths, which can catch bugs early if argument counts mismatch.
     results = mpi_pool_executor.map(
         fp4_row_linear_allreduce_run_single_rank,
-        *zip(*[(fp4_row_linear_allreduce, tp_size, seq_len, output_size,
+        *zip(*[(fp4_row_linear_allreduce, tp_size, output_size,
                 hidden_size, dtype, output_ref, x_sf_global, w_sf_global,
-                x_fp4s, w_fp4, x_sf_blocks, w_sf_block_unswizzled)] * tp_size))
+                x_fp4s, w_fp4, x_sf_blocks, w_sf_block_unswizzled)] * tp_size, strict=True))
Note: strict=True requires Python 3.10+. Verify compatibility with the project's minimum Python version.
tensorrt_llm/_torch/modules/linear.py (2)
979-999: LGTM with optional deduplication opportunity.

The implementation correctly delegates to nvfp4_gemm_allreduce and handles the FP4 input preparation. The input handling code (lines 982-988) duplicates lines 937-943 in apply. Consider extracting a helper method to reduce duplication.

2273-2289: Use truthiness check instead of == True.

Replace the explicit == True comparison with a direct truthiness check for cleaner code.
             use_fused_gemm_allreduce = self.use_fused_gemm_allreduce and (
                 all_reduce_params is None or
-                (all_reduce_params.enable_allreduce == True
+                (all_reduce_params.enable_allreduce
                  and all_reduce_params.fusion_op == AllReduceFusionOp.NONE))
The conditional flow for fused vs. standard allreduce path looks correct - it correctly bypasses the fused path when fusion operations are requested.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a736226 and 80485c7.

📒 Files selected for processing (4)

cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
tensorrt_llm/_torch/modules/linear.py (12 hunks)
tests/unittest/_torch/multi_gpu/test_linear.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tests/unittest/_torch/multi_gpu/test_linear.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/modules/linear.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tests/unittest/_torch/multi_gpu/test_linear.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/modules/linear.py

🧠 Learnings (11)

📓 Common learnings

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

cpp/tensorrt_llm/thop/CMakeLists.txt
tensorrt_llm/_torch/modules/linear.py

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/thop/CMakeLists.txt

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/thop/CMakeLists.txt

📚 Learning: 2025-08-08T05:06:31.596Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

cpp/tensorrt_llm/thop/CMakeLists.txt

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

cpp/tensorrt_llm/thop/CMakeLists.txt

📚 Learning: 2025-10-13T19:45:03.518Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

tests/unittest/_torch/multi_gpu/test_linear.py

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/unittest/_torch/multi_gpu/test_linear.py

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tests/unittest/_torch/multi_gpu/test_linear.py
tensorrt_llm/_torch/modules/linear.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tests/unittest/_torch/multi_gpu/test_linear.py

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🧬 Code graph analysis (1)

tests/unittest/_torch/multi_gpu/test_linear.py (3)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-47)

tensorrt_llm/models/modeling_utils.py (2)

QuantConfig (131-271)

quant_algo (550-551)

tests/unittest/utils/util.py (1)

check_accuracy (458-469)

🪛 Ruff (0.14.7)

tests/unittest/_torch/multi_gpu/test_linear.py

356-357: Create your own exception

(TRY002)

361-361: Unused function argument: seq_len

(ARG001)

484-486: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1253-1253: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

1280-1280: Unused method argument: inputs

(ARG002)

1281-1281: Unused method argument: profile

(ARG002)

1281-1281: Unused method argument: kwargs

(ARG002)

1334-1334: Unused function argument: act_sf

(ARG001)

1335-1335: Unused function argument: weight_scale

(ARG001)

1336-1336: Unused function argument: alpha

(ARG001)

1338-1338: Unused function argument: tp_rank

(ARG001)

1339-1339: Unused function argument: tp_group

(ARG001)

tensorrt_llm/_torch/modules/linear.py

981-981: Unused method argument: args

(ARG002)

981-981: Unused method argument: kwargs

(ARG002)

1418-1418: Redefinition of unused apply_linear_allreduce from line 1413

(F811)

2275-2275: Avoid equality comparisons to True; use all_reduce_params.enable_allreduce: for truth checks

Replace with all_reduce_params.enable_allreduce

(E712)

🔇 Additional comments (4)

cpp/tensorrt_llm/thop/CMakeLists.txt (1)

107-108: LGTM!

The addition of fusedGemmAllreduceOp.cpp to the th_common library sources is correctly placed and follows the existing file listing pattern.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)

1300-1342: LGTM!

The nvfp4_gemm_allreduce custom operation and its fake registration follow the established patterns in this file. The unused parameters in the fake registration are required for signature matching and are consistent with other implementations.

1269-1278: Verify whether tp_rank should be included in the runner instance cache key.

The instance_key (line 1269) excludes tp_rank, causing all ranks with the same output_dtype and tp_group_str to share a single cached runner instance. However, tp_rank is passed to the C++ runner constructor. If the C++ Fp4GemmAllreduceRunner maintains rank-specific state or behavior, the cache key should include tp_rank to ensure separate instances per rank. Verify that the underlying C++ implementation correctly handles shared runner instances across different ranks, or if rank-specific instances are required for correctness.

tensorrt_llm/_torch/modules/linear.py (1)

2233-2245: Verify LoRA behavior with fused GEMM allreduce.

The LoRA result is added after the fused GEMM+allreduce operation. In the standard path, LoRA is applied before the separate allreduce. This difference in ordering may affect correctness for tensor-parallel LoRA scenarios where the LoRA contribution also needs to be reduced.

Confirm that adding LoRA after the fused allreduce produces correct results, especially when LoRA weights are sharded across tensor-parallel ranks.

coderabbitai · 2025-12-05T08:16:20Z

tensorrt_llm/_torch/modules/linear.py

+    def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
+                               bias: Optional[torch.Tensor], tp_rank: int,
+                               tp_group: List[int], *args, **kwargs):
+        raise NotImplementedError
+
+    def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,
+                               bias: Optional[torch.Tensor], tp_rank: int,
+                               tp_group: List[int], *args, **kwargs):
+        raise NotImplementedError


⚠️ Potential issue | 🟡 Minor

Remove duplicate method definition.

The apply_linear_allreduce method is defined twice consecutively in W4A8MXFP4FP8LinearMethod. The second definition shadows the first and creates dead code.

def apply_linear_allreduce(self, module: Linear, input: torch.Tensor, bias: Optional[torch.Tensor], tp_rank: int, tp_group: List[int], *args, **kwargs): raise NotImplementedError - - def apply_linear_allreduce(self, module: Linear, input: torch.Tensor, - bias: Optional[torch.Tensor], tp_rank: int, - tp_group: List[int], *args, **kwargs): - raise NotImplementedError

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,

bias: Optional[torch.Tensor], tp_rank: int,

tp_group: List[int], *args, **kwargs):

raise NotImplementedError

def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,

bias: Optional[torch.Tensor], tp_rank: int,

tp_group: List[int], *args, **kwargs):

raise NotImplementedError

def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,

bias: Optional[torch.Tensor], tp_rank: int,

tp_group: List[int], *args, **kwargs):

raise NotImplementedError

🧰 Tools

🪛 Ruff (0.14.7)

1418-1418: Redefinition of unused apply_linear_allreduce from line 1413

(F811)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/linear.py around lines 1413 to 1421, there are two identical consecutive definitions of apply_linear_allreduce in W4A8MXFP4FP8LinearMethod; remove the duplicate so the method is defined only once (delete the second definition block), ensuring any references or decorators remain intact and preserving the intended implementation and signature of the remaining method.

coderabbitai · 2025-12-05T08:16:20Z

tensorrt_llm/_torch/modules/linear.py

+        self.use_fused_gemm_allreduce = use_fused_gemm_allreduce and self.quant_config.layer_quant_mode.has_nvfp4(
+        )


⚠️ Potential issue | 🟠 Major

Potential AttributeError if quant_config is None.

When use_fused_gemm_allreduce=True is passed but quant_config is None, accessing self.quant_config.layer_quant_mode.has_nvfp4() will raise AttributeError.

- self.use_fused_gemm_allreduce = use_fused_gemm_allreduce and self.quant_config.layer_quant_mode.has_nvfp4( - ) + self.use_fused_gemm_allreduce = ( + use_fused_gemm_allreduce + and self.quant_config is not None + and self.quant_config.layer_quant_mode.has_nvfp4() + )

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/linear.py around lines 2132 to 2133, the code unconditionally calls self.quant_config.layer_quant_mode.has_nvfp4() which will raise AttributeError when quant_config is None; update the assignment to guard the call by first checking that self.quant_config is not None (and that layer_quant_mode exists) and only call has_nvfp4() when present, otherwise treat the result as False (i.e., set use_fused_gemm_allreduce = use_fused_gemm_allreduce and bool(self.quant_config and getattr(self.quant_config, "layer_quant_mode", None) and self.quant_config.layer_quant_mode.has_nvfp4())).

yihwang-nv · 2025-12-05T08:21:31Z

/bot run

tensorrt-cicd · 2025-12-05T08:27:50Z

PR_Github #27098 [ run ] triggered by Bot. Commit: cb163d0

tensorrt-cicd · 2025-12-05T08:44:44Z

PR_Github #27098 [ run ] completed with state FAILURE. Commit: cb163d0
/LLM/main/L0_MergeRequest_PR pipeline #20672 completed with status: 'FAILURE'

yihwang-nv · 2025-12-05T09:00:00Z

/bot run

yihwang-nv · 2025-12-07T04:53:44Z

/bot run

benzh-2025 · 2025-12-08T07:13:59Z

/bot run

tensorrt-cicd · 2025-12-08T07:19:29Z

PR_Github #27263 [ run ] triggered by Bot. Commit: e11bbc1

tensorrt-cicd · 2025-12-08T07:39:22Z

PR_Github #27263 [ run ] completed with state FAILURE. Commit: e11bbc1
/LLM/main/L0_MergeRequest_PR pipeline #20818 completed with status: 'FAILURE'

benzh-2025 · 2025-12-08T07:59:22Z

/bot run

tensorrt-cicd · 2025-12-08T08:05:08Z

PR_Github #27268 [ run ] triggered by Bot. Commit: d813fba

tensorrt-cicd · 2025-12-08T10:14:19Z

PR_Github #27268 [ run ] completed with state SUCCESS. Commit: d813fba
/LLM/main/L0_MergeRequest_PR pipeline #20822 completed with status: 'FAILURE'

yuxianq · 2025-12-11T04:12:48Z

tensorrt_llm/_torch/modules/linear.py

            output = F.linear(input, module.weight, bias)
        return output

+    def apply_linear_allreduce(self, module: Linear, input: torch.Tensor,


Don't need to repeat this method in all LinearMethodBase subclasses. We have already define it in LinearMethodBase.

in LinearMethodBase, it is abstractmethod.

We don't need to mark it as abstractmethod if we have a proper default implementation, which is raise NotImplementedError in this case.

good point, let me try.

tensorrt_llm/_torch/modules/linear.py

benzh-2025 · 2025-12-11T10:11:30Z

/bot run

tensorrt-cicd · 2025-12-11T10:17:42Z

PR_Github #27841 [ run ] triggered by Bot. Commit: 2ae463d

benzh-2025 · 2025-12-11T10:30:40Z

/bot run

tensorrt-cicd · 2025-12-25T02:35:00Z

PR_Github #29867 [ run ] triggered by Bot. Commit: cfc6143

tensorrt-cicd · 2025-12-25T02:39:38Z

PR_Github #29868 [ run ] triggered by Bot. Commit: cfc6143

tensorrt-cicd · 2025-12-25T07:00:41Z

PR_Github #29868 [ run ] completed with state SUCCESS. Commit: cfc6143
/LLM/main/L0_MergeRequest_PR pipeline #22969 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

benzh-2025 · 2025-12-25T08:37:18Z

/bot run

tensorrt-cicd · 2025-12-25T08:42:41Z

PR_Github #29937 [ run ] triggered by Bot. Commit: cfc6143

tensorrt-cicd · 2025-12-25T10:19:27Z

PR_Github #29937 [ run ] completed with state SUCCESS. Commit: cfc6143
/LLM/main/L0_MergeRequest_PR pipeline #23025 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

benzh-2025 · 2025-12-30T06:47:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-30T06:52:46Z

PR_Github #30141 [ run ] triggered by Bot. Commit: 8186180

tensorrt-cicd · 2025-12-30T13:50:32Z

PR_Github #30141 [ run ] completed with state SUCCESS. Commit: 8186180
/LLM/main/L0_MergeRequest_PR pipeline #23194 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

benzh-2025 · 2025-12-31T05:49:47Z

/bot run

tensorrt-cicd · 2025-12-31T05:56:11Z

PR_Github #30241 [ run ] triggered by Bot. Commit: e68b10e

benzh-2025 · 2025-12-31T07:24:25Z

/bot run

tensorrt-cicd · 2025-12-31T07:29:38Z

PR_Github #30241 [ run ] completed with state SUCCESS. Commit: e68b10e
/LLM/main/L0_MergeRequest_PR pipeline #23283 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

tensorrt-cicd · 2025-12-31T07:31:09Z

PR_Github #30254 [ run ] triggered by Bot. Commit: e68b10e

tensorrt-cicd · 2025-12-31T09:19:10Z

PR_Github #30254 [ run ] completed with state SUCCESS. Commit: e68b10e
/LLM/main/L0_MergeRequest_PR pipeline #23292 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

benzh-2025 · 2026-01-02T01:30:33Z

/bot run

tensorrt-cicd · 2026-01-02T01:36:34Z

PR_Github #30357 [ run ] triggered by Bot. Commit: 32c2545

tensorrt-cicd · 2026-01-02T04:42:51Z

PR_Github #30357 [ run ] completed with state SUCCESS. Commit: 32c2545
/LLM/main/L0_MergeRequest_PR pipeline #23388 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

benzh-2025 · 2026-01-02T04:45:01Z

/bot run

tensorrt-cicd · 2026-01-02T04:51:23Z

PR_Github #30369 [ run ] triggered by Bot. Commit: 32c2545

Signed-off-by: benzh <[email protected]>

Signed-off-by: benzh-2025 <[email protected]>

benzh-2025 · 2026-01-02T07:22:21Z

/bot run

tensorrt-cicd · 2026-01-02T07:28:18Z

PR_Github #30378 [ run ] triggered by Bot. Commit: 289a4e5

tensorrt-cicd · 2026-01-02T08:33:07Z

PR_Github #30378 [ run ] completed with state SUCCESS. Commit: 289a4e5
/LLM/main/L0_MergeRequest_PR pipeline #23407 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

benzh-2025 requested review from a team as code owners December 5, 2025 08:12

benzh-2025 requested review from liji-nv and yuxianq December 5, 2025 08:12

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 80485c7 to cb163d0 Compare December 5, 2025 08:18

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from cb163d0 to 74d7798 Compare December 5, 2025 08:58

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 5, 2025

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 74d7798 to e11bbc1 Compare December 8, 2025 06:59

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from e11bbc1 to d813fba Compare December 8, 2025 07:58

yuxianq reviewed Dec 11, 2025

View reviewed changes

tensorrt_llm/_torch/modules/linear.py Outdated Show resolved Hide resolved

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from d813fba to 2ae463d Compare December 11, 2025 10:07

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch 2 times, most recently from b19e134 to 3c42ff0 Compare December 11, 2025 10:30

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from cfc6143 to 8186180 Compare December 30, 2025 06:45

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 8186180 to e68b10e Compare December 31, 2025 05:49

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from e68b10e to 32c2545 Compare January 2, 2026 01:30

benzh-2025 added 3 commits January 2, 2026 07:20

add fp4 gemm + allreduce

0e160c9

Signed-off-by: benzh <[email protected]>

refactor logical check for gemm+allreduce fusion

3e12759

Signed-off-by: benzh-2025 <[email protected]>

support gemm+allreduce only on arch >= blackwell

289a4e5

Signed-off-by: benzh-2025 <[email protected]>

benzh-2025 force-pushed the benzh/add_gemm_ar_pytorch branch from 32c2545 to 289a4e5 Compare January 2, 2026 07:20

		self.use_fused_gemm_allreduce = use_fused_gemm_allreduce and self.quant_config.layer_quant_mode.has_nvfp4(
		)

[None][feat] add fp4 gemm + allreduce #9729

Are you sure you want to change the base?

[None][feat] add fp4 gemm + allreduce #9729

Uh oh!

Conversation

benzh-2025 commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Uh oh!

coderabbitai bot commented Dec 5, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

yihwang-nv commented Dec 5, 2025

Uh oh!

tensorrt-cicd commented Dec 5, 2025

Uh oh!

tensorrt-cicd commented Dec 5, 2025

Uh oh!

yihwang-nv commented Dec 5, 2025

Uh oh!

yihwang-nv commented Dec 7, 2025

Uh oh!

benzh-2025 commented Dec 8, 2025

Uh oh!

tensorrt-cicd commented Dec 8, 2025

Uh oh!

tensorrt-cicd commented Dec 8, 2025

Uh oh!

benzh-2025 commented Dec 8, 2025

Uh oh!

tensorrt-cicd commented Dec 8, 2025

Uh oh!

tensorrt-cicd commented Dec 8, 2025

Uh oh!

yuxianq Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

benzh-2025 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

yuxianq Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

benzh-2025 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benzh-2025 commented Dec 11, 2025

Uh oh!

tensorrt-cicd commented Dec 11, 2025

Uh oh!

benzh-2025 commented Dec 11, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

benzh-2025 commented Dec 25, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

benzh-2025 commented Dec 30, 2025

benzh-2025 commented Dec 5, 2025 •

edited by coderabbitai bot

Loading