[None][ci] Revert [TRTLLM-8821] by QiJune · Pull Request #10549 · NVIDIA/TensorRT-LLM

QiJune · 2026-01-08T13:20:59Z

…egy tuning. (#8531)"

This reverts commit d272f1a.

Summary by CodeRabbit

Release Notes

Refactor
- Simplified AllReduce operations by removing the autotuning mechanism and replacing it with direct invocation, improving predictability and reducing overhead.
- Replaced environment-variable-driven workspace configuration with deterministic sizing based on world size.
- Streamlined distributed operation invocation across models and benchmarks.
Tests
- Updated AllReduce-related tests to reflect simplified operation flow.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…egy tuning. (NVIDIA#8531)" This reverts commit d272f1a.

QiJune · 2026-01-08T13:22:05Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

coderabbitai · 2026-01-08T13:27:30Z

📝 Walkthrough

Walkthrough

This PR removes environment-variable-driven all-reduce configuration and autotuning infrastructure, replacing them with deterministic heuristic-based strategy selection. It simplifies all-reduce dispatch to use default torch operators, updates pattern compilation to eliminate parameterization, and refactors related tests and benchmarks to remove distributed state management.

Changes

Cohort / File(s)	Summary
All-reduce strategy heuristics `cpp/tensorrt_llm/common/customAllReduceUtils.h`	Replaced environment-variable-driven workspace sizing with deterministic logic (16 MB for world size ≤ 2, 8 MB otherwise). Introduced `HeuristicThresholdLP` mapping and `SelectStrategyLP()` function for ONESHOT/TWOSHOT strategy selection based on sequence length, hidden size, world size, and SM version.
Autotuner and distributed state removal `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`, `tensorrt_llm/_torch/autotuner.py`	Removed AutoTuner initialization from executor setup. Eliminated NVTX range instrumentation around profiling blocks and dropped distributed-merge synchronization barrier; reduced inline documentation.
All-reduce operator and dispatch `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`, `tensorrt_llm/_torch/distributed/ops.py`	Deleted `AllReduceRunner` class and `tunable_allreduce` custom op. Simplified `AllReduce.forward()` from conditional autotuning path to direct `all_reduce_op` invocation; added symmetric memory and workspace initialization logic in `__init__`.
Pattern compilation refactoring `tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py`, `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`	Refactored all-reduce pattern registration to use default `torch.ops.trtllm.allreduce.default` directly; dropped `allreduce_func` parameters from registration functions. Removed type annotations from `allreduce` and `allreduce_pg` function signatures in cpp custom ops.
Configuration and model integration `tensorrt_llm/_torch/model_config.py`, `tensorrt_llm/_torch/models/modeling_llama.py`	Minor syntax adjustment (removed trailing comma in NCCL_SYMMETRIC entry). Removed `strategy` parameter from `AllReduce` initialization calls in LLaMA model.
Tests and benchmarks `tests/microbenchmarks/all_reduce.py`, `tests/unittest/_torch/multi_gpu/test_allreduce.py`, `tests/unittest/_torch/multi_gpu/test_user_buffers.py`, `tests/scripts/allreduce_perf/allreduce_perf_viz.py`	Removed distributed state classes (AutoTuner, MPIDist, TorchDist) and replaced with direct MPI barriers. Eliminated autotuning infrastructure and simplified all-reduce path to single invocation. Updated expected fusion match counts in assertions. Modified visualization directory structure and path handling in performance scripts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

[None][feat] MNNVLAllreduce Kernel Refactor #8018: Implements the same runtime heuristic for ONESHOT/TWOSHOT all-reduce kernel selection and workspace/Lamport support configuration.

Suggested reviewers

yilin-void
zongfeijing

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete and lacks substantive content about the revert, only containing the repository template with minimal explanation.	Fill the Description section with details about why the revert was necessary and the impact of reverting this AutoTuner AllReduce change.
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][ci] Revert [TRTLLM-8821]' clearly identifies this as a revert of a previous commit and specifies the ticket involved.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tests/scripts/allreduce_perf/allreduce_perf_viz.py (1)
571-585: Undefined fusion_op and non‑idempotent directory creation

Line 573 uses fusion_op before it’s defined; this will raise NameError on the first run when viz doesn’t exist.

os.makedirs(os.path.join(args.data_dir, "viz", fusion_op)) inside the loop (line 584) will raise FileExistsError on subsequent runs because exist_ok=True isn’t used.

Recommend:

Drop the pre-loop creation of viz/fusion_op, and

Make the per‑fusion directory creation idempotent.
Proposed fix for directory handling
-    if not os.path.exists(os.path.join(args.data_dir, "viz")):
-        os.makedirs(os.path.join(args.data_dir, "viz"))
-        os.makedirs(os.path.join(args.data_dir, "viz", fusion_op))
+    if not os.path.exists(os.path.join(args.data_dir, "viz")):
+        os.makedirs(os.path.join(args.data_dir, "viz"))
@@
-        for fusion_op in fusion_op_list:
-            os.makedirs(os.path.join(args.data_dir, "viz", fusion_op))
+        for fusion_op in fusion_op_list:
+            os.makedirs(
+                os.path.join(args.data_dir, "viz", fusion_op),
+                exist_ok=True,
+            )
tensorrt_llm/_torch/models/modeling_llama.py (1)
656-661: LlamaDecoderLayer inconsistently omits allreduce_strategy parameter

The AllReduce instantiation at line 660 in LlamaDecoderLayer omits the strategy argument, causing model_config.allreduce_strategy to be silently ignored. In contrast, Llama4DecoderLayer (line 457) and most other decoder layers throughout the codebase explicitly pass this parameter.

Consider aligning with the pattern used in Llama4DecoderLayer:
Proposed fix
-        self.all_reduce = AllReduce(mapping=model_config.mapping)
+        self.all_reduce = AllReduce(
+            mapping=model_config.mapping,
+            strategy=model_config.allreduce_strategy,
+        )
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
37-87: Heuristic map should be const and structured binding has unused element

The new heuristic code has a few issues to address:

In SelectStrategyLP at line 41, the structured binding declares nccl_num_token_threshold but never uses it; this will generate an unused-variable warning.

HeuristicThresholdLP should be declared const and renamed kHeuristicThresholdLp per the coding guidelines, since it is read-only configuration data that should not be mutated.

Accessing HeuristicThresholdLP[sm_major][world_size] for unsupported world_size values will synthesize a {0, 0} entry, causing the function to use zero thresholds (effectively forcing TWOSHOT strategy). Adding bounds checking would be safer.

Consider the suggested cleanup below to align with coding standards and prevent potential issues:
Suggested fix
-// (SM major_version, TP_size) -> (NCCL_num_token_threshold, TWO_SHOT_numel_threshold)
-inline std::unordered_map<int, std::unordered_map<int, std::pair<size_t, size_t>>> HeuristicThresholdLP{
+// (SM major_version, TP_size) -> (NCCL_num_token_threshold, TWO_SHOT_numel_threshold)
+inline const std::unordered_map<int, std::unordered_map<int, std::pair<size_t, size_t>>> kHeuristicThresholdLp{
And in SelectStrategyLP:
-    auto const [nccl_num_token_threshold, two_shot_numel_threshold] = HeuristicThresholdLP[sm_major][world_size];
+    auto itSm = kHeuristicThresholdLp.find(sm_major);
+    if (itSm == kHeuristicThresholdLp.end())
+    {
+        return AllReduceStrategyType::ONESHOT;
+    }
+    auto itTp = itSm->second.find(world_size);
+    if (itTp == itSm->second.end())
+    {
+        return AllReduceStrategyType::ONESHOT;
+    }
+    auto const two_shot_numel_threshold = itTp->second.second;

🤖 Fix all issues with AI agents

In @tests/scripts/allreduce_perf/allreduce_perf_viz.py:
- Around line 601-605: The heatmap save path for viz_path_diff incorrectly
prefixes os.path.dirname(__file__) causing inconsistent/incorrect paths when
args.data_dir is absolute; update the viz_path_diff construction to use the same
form as the other visualizations (start with f"{args.data_dir}/...") so the path
is built consistently, then call visualize_strategy_difference_heatmaps(df,
fusion_op, save_path=viz_path_diff) with the corrected viz_path_diff.

In @tests/unittest/_torch/multi_gpu/test_allreduce.py:
- Around line 241-256: The assertion error message uses an unnecessary f-string
and the 1% mismatch allowance is undocumented; change the raised assertion to
use a plain string (replace f"Large mismatched elements encountered" with "Large
mismatched elements encountered") and add a brief inline comment next to the
mismatch_percentage check (or tighten the threshold) explaining why
mismatch_percentage < 0.01 is acceptable for calc_output_tensor vs
ref_output_tensor given rtol/atol and cross-GPU allreduce variability; ensure
the symbols rtol, atol, mismatched, and mismatch_percentage remain intact.

🧹 Nitpick comments (3)

tensorrt_llm/_torch/autotuner.py (1)
870-878: Unused min_time from _profile_runners return tuple

The min_time component from _profile_runners is unpacked but never used, which is flagged by Ruff and slightly obscures intent.

Consider marking it as intentionally ignored:
Small clean‑up for unused value
-                    best_runner_id, best_tactic, min_time, has_tuning_failure_occurred = self._profile_runners(
+                    best_runner_id, best_tactic, _min_time, has_tuning_failure_occurred = self._profile_runners(
                         custom_op, runners, tensors, p, tuning_config, **kwargs)
tests/unittest/_torch/multi_gpu/test_allreduce.py (2)
135-137: Unused parameter in function signature.

The res parameter is not used in the function body. While this maintains signature consistency with other calc functions (like calc_fused_allreduce), consider documenting why it's present or using _ prefix to indicate it's intentionally unused.
♻️ Optional refactor to clarify intent
-    def calc_allreduce(x, res):
+    def calc_allreduce(x, _res):
         linear_out = linear(x)
         return [linear_out]
240-240: Consider adding strict=True to zip() if Python 3.10+ is required.

Static analysis suggests adding an explicit strict= parameter to catch potential length mismatches between calc_output and ref_output. However, note that this feature is only available in Python 3.10+, and the coding guidelines indicate Python 3.8+ support.
♻️ Optional modernization (Python 3.10+ only)
-    for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output):
+    for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output, strict=True):
Only apply this change if the project has moved to Python 3.10+ as the minimum version.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e033129 and b22021a.

📒 Files selected for processing (13)

cpp/tensorrt_llm/common/customAllReduceUtils.h
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/autotuner.py
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/_torch/models/modeling_llama.py
tests/microbenchmarks/all_reduce.py
tests/scripts/allreduce_perf/allreduce_perf_viz.py
tests/unittest/_torch/multi_gpu/test_allreduce.py
tests/unittest/_torch/multi_gpu/test_user_buffers.py

💤 Files with no reviewable changes (2)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py