Skip to content

[None][ci] Revert [TRTLLM-8821]#10549

Closed
QiJune wants to merge 1 commit intoNVIDIA:mainfrom
QiJune:revert_1
Closed

[None][ci] Revert [TRTLLM-8821]#10549
QiJune wants to merge 1 commit intoNVIDIA:mainfrom
QiJune:revert_1

Conversation

@QiJune
Copy link
Collaborator

@QiJune QiJune commented Jan 8, 2026

…egy tuning. (#8531)"

This reverts commit d272f1a.

Summary by CodeRabbit

Release Notes

  • Refactor

    • Simplified AllReduce operations by removing the autotuning mechanism and replacing it with direct invocation, improving predictability and reducing overhead.
    • Replaced environment-variable-driven workspace configuration with deterministic sizing based on world size.
    • Streamlined distributed operation invocation across models and benchmarks.
  • Tests

    • Updated AllReduce-related tests to reflect simplified operation flow.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@QiJune
Copy link
Collaborator Author

QiJune commented Jan 8, 2026

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@QiJune QiJune changed the title Revert "[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strat… [None][ci] Revert [TRTLLM-8821] Jan 8, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

This PR removes environment-variable-driven all-reduce configuration and autotuning infrastructure, replacing them with deterministic heuristic-based strategy selection. It simplifies all-reduce dispatch to use default torch operators, updates pattern compilation to eliminate parameterization, and refactors related tests and benchmarks to remove distributed state management.

Changes

Cohort / File(s) Summary
All-reduce strategy heuristics
cpp/tensorrt_llm/common/customAllReduceUtils.h
Replaced environment-variable-driven workspace sizing with deterministic logic (16 MB for world size ≤ 2, 8 MB otherwise). Introduced HeuristicThresholdLP mapping and SelectStrategyLP() function for ONESHOT/TWOSHOT strategy selection based on sequence length, hidden size, world size, and SM version.
Autotuner and distributed state removal
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py, tensorrt_llm/_torch/autotuner.py
Removed AutoTuner initialization from executor setup. Eliminated NVTX range instrumentation around profiling blocks and dropped distributed-merge synchronization barrier; reduced inline documentation.
All-reduce operator and dispatch
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py, tensorrt_llm/_torch/distributed/ops.py
Deleted AllReduceRunner class and tunable_allreduce custom op. Simplified AllReduce.forward() from conditional autotuning path to direct all_reduce_op invocation; added symmetric memory and workspace initialization logic in __init__.
Pattern compilation refactoring
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Refactored all-reduce pattern registration to use default torch.ops.trtllm.allreduce.default directly; dropped allreduce_func parameters from registration functions. Removed type annotations from allreduce and allreduce_pg function signatures in cpp custom ops.
Configuration and model integration
tensorrt_llm/_torch/model_config.py, tensorrt_llm/_torch/models/modeling_llama.py
Minor syntax adjustment (removed trailing comma in NCCL_SYMMETRIC entry). Removed strategy parameter from AllReduce initialization calls in LLaMA model.
Tests and benchmarks
tests/microbenchmarks/all_reduce.py, tests/unittest/_torch/multi_gpu/test_allreduce.py, tests/unittest/_torch/multi_gpu/test_user_buffers.py, tests/scripts/allreduce_perf/allreduce_perf_viz.py
Removed distributed state classes (AutoTuner, MPIDist, TorchDist) and replaced with direct MPI barriers. Eliminated autotuning infrastructure and simplified all-reduce path to single invocation. Updated expected fusion match counts in assertions. Modified visualization directory structure and path handling in performance scripts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • yilin-void
  • zongfeijing
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and lacks substantive content about the revert, only containing the repository template with minimal explanation. Fill the Description section with details about why the revert was necessary and the impact of reverting this AutoTuner AllReduce change.
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][ci] Revert [TRTLLM-8821]' clearly identifies this as a revert of a previous commit and specifies the ticket involved.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/scripts/allreduce_perf/allreduce_perf_viz.py (1)

571-585: Undefined fusion_op and non‑idempotent directory creation

  • Line 573 uses fusion_op before it’s defined; this will raise NameError on the first run when viz doesn’t exist.
  • os.makedirs(os.path.join(args.data_dir, "viz", fusion_op)) inside the loop (line 584) will raise FileExistsError on subsequent runs because exist_ok=True isn’t used.

Recommend:

  • Drop the pre-loop creation of viz/fusion_op, and
  • Make the per‑fusion directory creation idempotent.
Proposed fix for directory handling
-    if not os.path.exists(os.path.join(args.data_dir, "viz")):
-        os.makedirs(os.path.join(args.data_dir, "viz"))
-        os.makedirs(os.path.join(args.data_dir, "viz", fusion_op))
+    if not os.path.exists(os.path.join(args.data_dir, "viz")):
+        os.makedirs(os.path.join(args.data_dir, "viz"))
@@
-        for fusion_op in fusion_op_list:
-            os.makedirs(os.path.join(args.data_dir, "viz", fusion_op))
+        for fusion_op in fusion_op_list:
+            os.makedirs(
+                os.path.join(args.data_dir, "viz", fusion_op),
+                exist_ok=True,
+            )
tensorrt_llm/_torch/models/modeling_llama.py (1)

656-661: LlamaDecoderLayer inconsistently omits allreduce_strategy parameter

The AllReduce instantiation at line 660 in LlamaDecoderLayer omits the strategy argument, causing model_config.allreduce_strategy to be silently ignored. In contrast, Llama4DecoderLayer (line 457) and most other decoder layers throughout the codebase explicitly pass this parameter.

Consider aligning with the pattern used in Llama4DecoderLayer:

Proposed fix
-        self.all_reduce = AllReduce(mapping=model_config.mapping)
+        self.all_reduce = AllReduce(
+            mapping=model_config.mapping,
+            strategy=model_config.allreduce_strategy,
+        )
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)

37-87: Heuristic map should be const and structured binding has unused element

The new heuristic code has a few issues to address:

  • In SelectStrategyLP at line 41, the structured binding declares nccl_num_token_threshold but never uses it; this will generate an unused-variable warning.
  • HeuristicThresholdLP should be declared const and renamed kHeuristicThresholdLp per the coding guidelines, since it is read-only configuration data that should not be mutated.
  • Accessing HeuristicThresholdLP[sm_major][world_size] for unsupported world_size values will synthesize a {0, 0} entry, causing the function to use zero thresholds (effectively forcing TWOSHOT strategy). Adding bounds checking would be safer.

Consider the suggested cleanup below to align with coding standards and prevent potential issues:

Suggested fix
-// (SM major_version, TP_size) -> (NCCL_num_token_threshold, TWO_SHOT_numel_threshold)
-inline std::unordered_map<int, std::unordered_map<int, std::pair<size_t, size_t>>> HeuristicThresholdLP{
+// (SM major_version, TP_size) -> (NCCL_num_token_threshold, TWO_SHOT_numel_threshold)
+inline const std::unordered_map<int, std::unordered_map<int, std::pair<size_t, size_t>>> kHeuristicThresholdLp{

And in SelectStrategyLP:

-    auto const [nccl_num_token_threshold, two_shot_numel_threshold] = HeuristicThresholdLP[sm_major][world_size];
+    auto itSm = kHeuristicThresholdLp.find(sm_major);
+    if (itSm == kHeuristicThresholdLp.end())
+    {
+        return AllReduceStrategyType::ONESHOT;
+    }
+    auto itTp = itSm->second.find(world_size);
+    if (itTp == itSm->second.end())
+    {
+        return AllReduceStrategyType::ONESHOT;
+    }
+    auto const two_shot_numel_threshold = itTp->second.second;
🤖 Fix all issues with AI agents
In @tests/scripts/allreduce_perf/allreduce_perf_viz.py:
- Around line 601-605: The heatmap save path for viz_path_diff incorrectly
prefixes os.path.dirname(__file__) causing inconsistent/incorrect paths when
args.data_dir is absolute; update the viz_path_diff construction to use the same
form as the other visualizations (start with f"{args.data_dir}/...") so the path
is built consistently, then call visualize_strategy_difference_heatmaps(df,
fusion_op, save_path=viz_path_diff) with the corrected viz_path_diff.

In @tests/unittest/_torch/multi_gpu/test_allreduce.py:
- Around line 241-256: The assertion error message uses an unnecessary f-string
and the 1% mismatch allowance is undocumented; change the raised assertion to
use a plain string (replace f"Large mismatched elements encountered" with "Large
mismatched elements encountered") and add a brief inline comment next to the
mismatch_percentage check (or tighten the threshold) explaining why
mismatch_percentage < 0.01 is acceptable for calc_output_tensor vs
ref_output_tensor given rtol/atol and cross-GPU allreduce variability; ensure
the symbols rtol, atol, mismatched, and mismatch_percentage remain intact.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/autotuner.py (1)

870-878: Unused min_time from _profile_runners return tuple

The min_time component from _profile_runners is unpacked but never used, which is flagged by Ruff and slightly obscures intent.

Consider marking it as intentionally ignored:

Small clean‑up for unused value
-                    best_runner_id, best_tactic, min_time, has_tuning_failure_occurred = self._profile_runners(
+                    best_runner_id, best_tactic, _min_time, has_tuning_failure_occurred = self._profile_runners(
                         custom_op, runners, tensors, p, tuning_config, **kwargs)
tests/unittest/_torch/multi_gpu/test_allreduce.py (2)

135-137: Unused parameter in function signature.

The res parameter is not used in the function body. While this maintains signature consistency with other calc functions (like calc_fused_allreduce), consider documenting why it's present or using _ prefix to indicate it's intentionally unused.

♻️ Optional refactor to clarify intent
-    def calc_allreduce(x, res):
+    def calc_allreduce(x, _res):
         linear_out = linear(x)
         return [linear_out]

240-240: Consider adding strict=True to zip() if Python 3.10+ is required.

Static analysis suggests adding an explicit strict= parameter to catch potential length mismatches between calc_output and ref_output. However, note that this feature is only available in Python 3.10+, and the coding guidelines indicate Python 3.8+ support.

♻️ Optional modernization (Python 3.10+ only)
-    for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output):
+    for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output, strict=True):

Only apply this change if the project has moved to Python 3.10+ as the minimum version.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e033129 and b22021a.

📒 Files selected for processing (13)
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tests/microbenchmarks/all_reduce.py
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tests/unittest/_torch/multi_gpu/test_user_buffers.py
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

  • tests/unittest/_torch/multi_gpu/test_user_buffers.py
  • tests/microbenchmarks/all_reduce.py
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py
  • tensorrt_llm/_torch/autotuner.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

  • tests/unittest/_torch/multi_gpu/test_user_buffers.py
  • tests/microbenchmarks/all_reduce.py
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tensorrt_llm/_torch/autotuner.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible
A variable that is not modified after its initialization should be declared as const
For naming of constants in C++, use uppercase snakecase with prefix 'k' (e.g., kDIGIT_NUM)
Except for 0, nullptr, true, and false, all other literals should only be used for variable initialization and not in comparisons or expressions
Use Allman indentation style for brace notation in C++ code
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do..while, or for statement must be a compound statement (use brace-delimited statements)
If and else statements should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camelCase with first letter lowercase (e.g., thisIsAFilename.cpp)
All types (including class names) in C++ should use PascalCase with uppercase first letter (e.g., FooBarClass)
Local variables, methods, and namespaces in C++ should use camelCase with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camelCase prefixed with 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camelCase prefixed with 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camelCase with 's' as the first letter (e.g., static std::once_flag sFlag;)
Public, private, and protected class member variables should use camelCase prefixed with 'm' (e.g., mNbFooValues)
Do not use Hungarian notation in C++ except for 'apps hungarian' (e.g., 'nb' to indicate count: mNbLayers)
If a constructor parameter name conflicts with a public me...

Files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
**/*.{h,hpp,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hxx}: Follow Doxygen rules for documenting new C++ class interfaces and function prototypes. Use //! for C++-style single-line comments and //!< for class members
Use a preprocessor guard in C++ header files with the format TRTLLM_<FILENAME>_H, where the filename is in uppercase with no underscores, no prefix underscores, and no trailing underscores

Files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
**/*.{h,hpp,hxx,cpp,cc,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All C++ class templates, function templates, class template member functions, and class template static members must be instantiated at least once

Files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
🧠 Learnings (24)
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_user_buffers.py
  • tests/microbenchmarks/all_reduce.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tests/microbenchmarks/all_reduce.py
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-09-24T03:31:28.908Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.

Applied to files:

  • tests/microbenchmarks/all_reduce.py
  • tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/models/modeling_llama.py
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

  • tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_allreduce.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/_torch/multi_gpu/test_allreduce.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-08-27T16:22:10.695Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py:94-100
Timestamp: 2025-08-27T16:22:10.695Z
Learning: When there are inconsistent operator detection methods (like custom_op() vs target_op()), removing one method and standardizing on the other is often cleaner than supporting both methods simultaneously.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
🧬 Code graph analysis (7)
tests/microbenchmarks/all_reduce.py (2)
tensorrt_llm/functional.py (1)
  • AllReduceFusionOp (3887-3896)
tensorrt_llm/_utils.py (4)
  • local_mpi_rank (564-573)
  • local_mpi_size (576-577)
  • nvtx_range (891-910)
  • mpi_barrier (589-591)
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
cpp/tensorrt_llm/thop/allreduceOp.cpp (8)
  • seq_len (1182-1231)
  • seq_len (1182-1182)
  • seq_len (1233-1242)
  • seq_len (1233-1233)
  • op (1291-1291)
  • op (1330-1330)
  • message_size (1244-1257)
  • message_size (1244-1244)
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
tensorrt_llm/functional.py (4)
  • AllReduceFusionOp (3887-3896)
  • AllReduceParams (3899-3938)
  • AllReduceStrategy (3874-3884)
  • MoEAllReduceParams (3941-3977)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2)
cpp/tensorrt_llm/thop/allreduceOp.cpp (16)
  • input (281-315)
  • input (281-284)
  • input (335-413)
  • input (335-337)
  • input (415-446)
  • input (415-417)
  • input (448-563)
  • input (448-450)
  • input (565-630)
  • input (565-567)
  • input (632-815)
  • input (632-636)
  • input (817-883)
  • input (817-820)
  • op (1291-1291)
  • op (1330-1330)
cpp/tensorrt_llm/thop/reducescatterOp.cpp (6)
  • input (124-127)
  • input (124-124)
  • op (227-227)
  • op (245-245)
  • op (263-263)
  • op (282-282)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3874-3884)
tensorrt_llm/_torch/models/modeling_llama.py (1)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (637-850)
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
  • allreduce (17-60)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3874-3884)
🪛 Ruff (0.14.10)
tests/scripts/allreduce_perf/allreduce_perf_viz.py

573-573: Undefined name fusion_op

(F821)

tensorrt_llm/_torch/autotuner.py

874-874: Unpacked variable min_time is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tests/unittest/_torch/multi_gpu/test_allreduce.py

135-135: Unused function argument: res

(ARG001)


240-240: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


256-256: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py

19-19: Unused function argument: residual

(ARG001)


20-20: Unused function argument: norm_weight

(ARG001)


21-21: Unused function argument: scale

(ARG001)


22-22: Unused function argument: bias

(ARG001)


23-23: Unused function argument: workspace

(ARG001)


24-24: Unused function argument: group

(ARG001)


25-25: Unused function argument: strategy

(ARG001)


27-27: Unused function argument: eps

(ARG001)


28-28: Unused function argument: trigger_completion_at_end

(ARG001)


71-71: Unused function argument: rank

(ARG001)


72-72: Unused function argument: pg

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (6)

2-2: LGTM - Consistent function signature updates

All registration function signatures have been consistently updated to remove the allreduce_func parameter, replaced with hard-coded usage of torch.ops.trtllm.allreduce.default. The import changes are also correct.

Also applies to: 17-18, 113-114, 190-191, 266-267, 338-339, 409-410


20-24: LGTM - Pattern definitions correctly updated

All CallFunction pattern definitions have been consistently updated to use torch.ops.trtllm.allreduce.default instead of the parameterized allreduce_func.default. This ensures pattern matching works with the hard-coded operator.

Also applies to: 117-117, 194-194, 270-270, 342-342, 416-420, 670-670


59-64: LGTM - Target patterns correctly implemented

All target pattern implementations have been updated to directly call torch.ops.trtllm.allreduce instead of the parameterized allreduce_func. The logic remains correct and consistent with the pattern definitions.

Also applies to: 168-173, 245-249, 316-321, 388-392


104-110: LGTM - Strategy handling preserved correctly

The strategy-based selection logic has been correctly preserved. The check_non_ub_strategy function ensures FP8/FP4 quantization patterns are only applied for non-UB strategies, and the UB conversion pattern correctly checks for AUTO strategy. This maintains the strategy selection mechanism while removing the autotuner functionality.

Also applies to: 413-464


1-732: Revert appears complete and correct

The revert successfully removes the autotuner functionality by replacing the parameterized allreduce_func with hard-coded torch.ops.trtllm.allreduce.default throughout the file. All changes are consistent and the logic remains sound. The strategy-based selection is preserved, ensuring that different AllReduce strategies can still be used - just without the autotuning capability.

Key changes verified:

  • All function signatures updated consistently
  • All pattern definitions updated to use fixed operator
  • All target implementations updated correctly
  • All internal call sites updated
  • Strategy handling logic preserved
  • No syntax or logic errors introduced

719-732: External call sites have been properly updated.

The external call to register_ar_fusions in tensorrt_llm/_torch/compilation/backend.py:79-80 correctly passes only three parameters (cls._custom_pass_instances, mapping, ub_enabled) matching the updated function signature. No remaining references to allreduce_func were found in the codebase. The refactoring is complete.

tests/unittest/_torch/multi_gpu/test_user_buffers.py (3)

463-463: Add NVIDIA copyright header to the file.

This file is missing the required NVIDIA copyright header. Per coding guidelines, all TensorRT-LLM source files (.py and other source files) must include an NVIDIA copyright header with the year of latest meaningful modification. Add the header at the top of the file before the imports.

Additionally, verify the test passes with the updated match counts at lines 463, 762, and 996.

⛔ Skipped due to learnings
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.

762-762: Add NVIDIA copyright header to file.

The file is missing the required NVIDIA copyright header. Per the coding guidelines, all TensorRT-LLM source files (.py, .cpp, .h, .cu, etc.) must contain an NVIDIA copyright header with the year of latest meaningful modification. Add the header at the top of the file before any imports.

⛔ Skipped due to learnings
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-01-06T03:07:15.754Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py} : All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: tburt-nv
Repo: NVIDIA/TensorRT-LLM PR: 9881
File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36
Timestamp: 2025-12-10T19:14:58.432Z
Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

996-996: Add NVIDIA copyright header to file and verify test passes with updated match counts.

The assertion at line 996 is correctly consistent with line 463, as both tests validate similar fusion scenarios with different quantization types (FP4 vs FP8). However, the file is missing the required NVIDIA copyright header at the top. Per coding guidelines, all TensorRT-LLM Python source files must contain an SPDX-FileCopyrightText: Copyright (c) <year> NVIDIA CORPORATION & AFFILIATES header.

⛔ Skipped due to learnings
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-01-06T03:07:15.754Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py} : All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
tensorrt_llm/_torch/model_config.py (1)

154-167: AllReduce strategy string mapping looks consistent

The NCCL_SYMMETRIC entry and overall mapping in get_all_reduce_strategy remain correct; the syntactic tweak (trailing comma removal) is no-op semantically.

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

16-30: Fake allreduce op signatures remain structurally correct; verify schema compatibility

The trtllm::allreduce and trtllm::allreduce_pg fake kernels keep the same positional parameters but drop Python type annotations and forward allreduce_pg through allreduce for shape inference. That’s fine from a behavior/shape standpoint, and the unused arguments are required to keep the signature aligned with the C++ ops despite Ruff’s ARG001 noise.

Please double‑check that the argument names/order here still exactly match the registered C++ schemas for trtllm::allreduce and trtllm::allreduce_pg, so torch.library.register_fake and torch.ops.trtllm.* don’t hit a schema mismatch at import time.

Also applies to: 62-79

tensorrt_llm/_torch/distributed/ops.py (1)

694-753: AllReduce wiring and fallbacks look sound; verify op schemas and PG availability

The updated AllReduce:

  • Chooses between torch.ops.trtllm.allreduce and allreduce_pg via self.all_reduce_op, with rank/pg only passed in the PG path.
  • Initializes SymmetricMemoryAllReduce with a clean fallback to AllReduceStrategy.AUTO if unsupported or failing, and allocates a fusion workspace for all strategies except UB/NCCL/NCCL_SYMMETRIC.
  • Tries symm_mem (for fusion_op == NONE) and then MNNVLAllReduce before falling back to the generic C++ allreduce op with strategy normalized away from MNNVL/SYMM_MEM.

The call into self.all_reduce_op uses keyword arguments whose names/order match the Python fake registrations in cpp_custom_ops.py, so the Python → C++ boundary looks correct.

Two things to explicitly confirm:

  1. The C++ schemas for trtllm::allreduce and trtllm::allreduce_pg still match these keyword names and argument ordering (especially around group, rank, and pg).
  2. In environments where mpi_disabled() is true, mapping.tp_group_pg and torch.distributed have been initialized before AllReduce.forward is first called; otherwise the PG path will assert or get_rank() will fail.

Also applies to: 818-848

Comment on lines 601 to 605
# Create strategy difference heatmaps and save to data/viz directory
viz_path_diff = f"{args.data_dir}/viz/{fusion_op}/{case_name}_strategy_difference_heatmap.png"
viz_path_diff = f"{os.path.dirname(__file__)}/{args.data_dir}/viz/{fusion_op}/{case_name}_strategy_difference_heatmap.png"
visualize_strategy_difference_heatmaps(df,
fusion_op,
save_path=viz_path_diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

wc -l tests/scripts/allreduce_perf/allreduce_perf_viz.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 117


🏁 Script executed:

sed -n '590,620p' tests/scripts/allreduce_perf/allreduce_perf_viz.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1165


🏁 Script executed:

# Search for all visualization path constructions in the file
rg "viz_path|save_path" tests/scripts/allreduce_perf/allreduce_perf_viz.py -n -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 3581


🏁 Script executed:

# Also check how args.data_dir is used throughout
rg "args\.data_dir" tests/scripts/allreduce_perf/allreduce_perf_viz.py -n -B 1 -A 1

Repository: NVIDIA/TensorRT-LLM

Length of output: 1505


Remove inconsistent os.path.dirname(__file__) prefix from strategy-difference heatmap path

Line 602 uses f"{os.path.dirname(__file__)}/{args.data_dir}..." while lines 592 and 596 use f"{args.data_dir}..." for the other two visualizations. This inconsistency breaks relative paths and creates issues when args.data_dir is absolute. Align line 602 with the other paths by removing the os.path.dirname(__file__)/ prefix.

🤖 Prompt for AI Agents
In @tests/scripts/allreduce_perf/allreduce_perf_viz.py around lines 601 - 605,
The heatmap save path for viz_path_diff incorrectly prefixes
os.path.dirname(__file__) causing inconsistent/incorrect paths when
args.data_dir is absolute; update the viz_path_diff construction to use the same
form as the other visualizations (start with f"{args.data_dir}/...") so the path
is built consistently, then call visualize_strategy_difference_heatmaps(df,
fusion_op, save_path=viz_path_diff) with the corrected viz_path_diff.

Comment on lines +241 to +256
rtol, atol = 0.05, 0.15
try:
torch.testing.assert_close(
calc_output_tensor,
ref_output_tensor,
rtol=rtol,
atol=atol,
)
except AssertionError:
# Calculate percentage of mismatched elements
mismatched = torch.abs(calc_output_tensor - ref_output_tensor) > (
rtol * torch.abs(ref_output_tensor) + atol)
mismatch_percentage = (mismatched.sum() / mismatched.numel())

# If more than 1% elements mismatch, raise the error
assert mismatch_percentage < 0.01, f"Large mismatched elements encountered"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Read the file to see the actual code and context
head -n 260 tests/unittest/_torch/multi_gpu/test_allreduce.py | tail -n 50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2363


🏁 Script executed:

# Check git history for this file to understand the change
git log --oneline tests/unittest/_torch/multi_gpu/test_allreduce.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 161


🏁 Script executed:

# Look for any comments in the file that explain the mismatch threshold
grep -n "mismatch\|tolerance\|threshold\|precision" tests/unittest/_torch/multi_gpu/test_allreduce.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 446


🏁 Script executed:

# Check the entire test function context to understand why this relaxation might be needed
sed -n '200,260p' tests/unittest/_torch/multi_gpu/test_allreduce.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2993


🏁 Script executed:

# Check git blame to see when this specific code was added
git blame -L 241,256 tests/unittest/_torch/multi_gpu/test_allreduce.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1494


🏁 Script executed:

# Search for similar mismatch percentage checks in other tests
rg "mismatch_percentage\|torch.testing.assert_close" tests/ -A 5 -B 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check the file header for copyright
head -n 20 tests/unittest/_torch/multi_gpu/test_allreduce.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 813


Fix the unnecessary f-string and document or reconsider the 1% mismatch threshold.

The f-string at line 256 contains no placeholders and should be converted to a regular string per Python style guidelines.

Regarding the assertion logic: The try/except pattern that allows up to 1% element mismatch is valid for handling numerical precision variations across different allreduce strategies, but this design choice lacks documentation. Consider either adding an inline comment explaining why 1% is appropriate for these operations, or tightening the threshold if 1% is too permissive.

Suggested fix
-            assert mismatch_percentage < 0.01, f"Large mismatched elements encountered"
+            assert mismatch_percentage < 0.01, "Large mismatched elements encountered"

For the 1% threshold, consider adding a comment:

             # If more than 1% elements mismatch, raise the error
+            # (Allows for numerical precision variations in allreduce fusion operations)
             assert mismatch_percentage < 0.01, "Large mismatched elements encountered"
🧰 Tools
🪛 Ruff (0.14.10)

256-256: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In @tests/unittest/_torch/multi_gpu/test_allreduce.py around lines 241 - 256,
The assertion error message uses an unnecessary f-string and the 1% mismatch
allowance is undocumented; change the raised assertion to use a plain string
(replace f"Large mismatched elements encountered" with "Large mismatched
elements encountered") and add a brief inline comment next to the
mismatch_percentage check (or tighten the threshold) explaining why
mismatch_percentage < 0.01 is acceptable for calc_output_tensor vs
ref_output_tensor given rtol/atol and cross-GPU allreduce variability; ensure
the symbols rtol, atol, mismatched, and mismatch_percentage remain intact.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31067 [ run ] triggered by Bot. Commit: b22021a

@QiJune
Copy link
Collaborator Author

QiJune commented Jan 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31086 [ run ] triggered by Bot. Commit: b22021a

Copy link
Collaborator

@brb-nv brb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31086 [ run ] completed with state SUCCESS. Commit: b22021a
/LLM/main/L0_MergeRequest_PR pipeline #24006 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@tburt-nv
Copy link
Collaborator

tburt-nv commented Jan 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31123 [ run ] triggered by Bot. Commit: b22021a

@yuxianq yuxianq requested a review from hyukn January 9, 2026 02:05
@chzblych
Copy link
Collaborator

chzblych commented Jan 9, 2026

/bot kill

@QiJune QiJune closed this Jan 9, 2026
@tensorrt-cicd
Copy link
Collaborator

PR_Github #31181 [ ] completed with state FAILURE. Commit: b22021a
Not allowed on merged PR

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31123 [ run ] completed with state ABORTED. Commit: b22021a
LLM/main/L0_MergeRequest_PR #24040 (Blue Ocean) completed with status: ABORTED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants