[None][chore] Weekly mass integration of release/1.1 by mikeiovine · Pull Request #8508 · NVIDIA/TensorRT-LLM

mikeiovine · 2025-10-20T14:44:06Z

Description

Another mass integration commits from the release/1.1 branch.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Added support for DeepSeek-V3-Lite model configurations
- Implemented adaptive AllReduce strategy selection for improved communication efficiency
- Enhanced MPI session management with global executor pooling and singleton patterns
Performance Improvements
- Optimized GPU synchronization primitives for collective communication
- Added lookup-table driven strategy selection for AllReduce operations based on tensor parallel size and model parameters

coderabbitai · 2025-10-20T14:58:19Z

📝 Walkthrough

Walkthrough

This PR introduces heuristic-based and lookup-table-driven AllReduce strategy selection mechanisms, adds SM100f GPU architecture support utilities, optimizes GPU kernel synchronization with inline PTX assembly, centralizes MPI session management through global executors and singleton patterns, and expands test coverage with new integration tests, configurations, and benchmarking tools.

Changes

Cohort / File(s)	Summary
AllReduce Strategy Selection `cpp/tensorrt_llm/common/customAllReduceUtils.h`	Introduces HeuristicThresholdLP (SM version/world size to threshold mapping), SelectStrategyLP (heuristic-based TWOSHOT/ONESHOT selection), and AllReduceBestStrategyTable lookup mechanism with pre-computed 4D tables for SM90/SM100, enabling strategy selection via selectStrategyLookUpTable based on token count, hidden size, fusion op, and TP size.
GPU Kernel Assembly Optimizations `cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu`, `cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu`	Replaces software-flag stores with inline PTX assembly (st.global.relaxed.sys.b32), adds release fence (fence.release.sys), and introduces griddepcontrol.wait/launch_dependents assembly for refined GPU grid dependency management and synchronization.
AllReduce Kernel Helpers `cpp/tensorrt_llm/kernels/customAllReduceKernels.h`	Adds operator<< and toString overloads for AllReduceStrategyType string representation.
SM100f Architecture Support `tests/integration/defs/conftest.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py`, `tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py`, `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Introduces is_sm_100f() utility returning true for SM 100/103; replaces direct get_sm_version() checks with is_sm_100f() calls for DeepGemm and FP8 block scales path selection.
MPI Session Global Management `tensorrt_llm/llmapi/mpi_session.py`	Adds MPINodeState with _global_comm_executor/_global_mpi_pool for session reuse, RemoteMpiCommSessionClient singleton pattern (_global_instance, _global_instance_lock, __new__) for per-address instances, pending_futures coordination in server paths, and conditional pool shutdown based on owns_mpi_pool flag.
Executor and Resource Refactoring `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Centralizes can-queue logic into _can_queue helper, adds KVCacheManager.from_model_config classmethod for construction from model config.
Linear Module Updates `tensorrt_llm/_torch/modules/linear.py`	Fixes validation error message to use function argument in_features; extends load_weights_fused_qkv_linear to pass tp_size, tp_rank, tp_mode to load_weight_scales.
AllReduce Op Refactoring `cpp/tensorrt_llm/thop/allreduceOp.cpp`	Replaces getRuntimeStrategy/logRunTimeStrategy with selectImplementation, integrates hidden_size into strategy decision logic, updates MIN_LATENCY path to consider hidden_size for oneshot/twoshot selection.
GPU Kernel Synchronization `cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu`	Replaces cudaTriggerProgrammaticLaunchCompletion with cudaGridDependencySynchronize in multiple kernel paths, restructures synchronization timing and placement.
New Integration Tests `tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml`, `tests/integration/defs/disaggregated/test_disaggregated.py`, `tests/integration/defs/test_e2e.py`, `tests/integration/defs/perf/test_perf.py`	Adds DeepSeek-V3-Lite bf16 configuration, test_disaggregated_deepseek_v3_lite_bf16_empty_batch with configurable parameters (num_ranks, concurrency, lengths), test_ptp_quickstart_advanced_pp_enabled for PP/TP sizing and CUDA graph modes.
AllReduce Performance Tools `tests/scripts/allreduce_perf/README.md`, `tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py`, `tests/scripts/allreduce_perf/allreduce_perf_viz.py`, `tests/microbenchmarks/all_reduce.py`	Introduces code generation tool for AllReduce lookup tables from benchmark data, visualization scripts for 2D heatmaps/strategy comparisons, profile_allreduce function encapsulating benchmark workflow.
Test Lists and Configuration `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/qa/llm_function_core_sanity.txt`, `tests/integration/test_lists/qa/llm_function_nim.txt`, `tests/integration/test_lists/test-db/l0_dgx_h100.yml`	Adds new test entries for FP8, TP/PP variants, DeepSeek-V3-Lite, removes obsolete tests, reorganizes test suite composition.
Unit Test Updates `tests/unittest/_torch/modules/test_fused_moe.py`, `tests/unittest/_torch/multi_gpu/test_allreduce.py`, `tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py`, `tests/unittest/llmapi/test_mpi_session.py`	Re-enables skipped FP8 test, adds allreduce_strategy parameter to Linear, introduces multi-task MPI/LLM test via subprocess orchestration.
New Test Utilities `tests/unittest/llmapi/_run_multi_llm_tasks.py`, `tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py`	Adds scripts for multi-LLM runs with TP2 and multi-task MPI submissions (sync/async).
Dependencies `requirements.txt`	Adds numba-cuda>=0.19.0 with WAR comment for nvbugs/5501820.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant selectImplementation
    participant SelectStrategyLP
    participant selectStrategyLookUpTable
    participant AllReduceBestStrategyTable

    User->>selectImplementation: seq_len, hidden_size
    alt Auto/LP strategy
        selectImplementation->>SelectStrategyLP: seq_len, hidden_size, world_size, op
        SelectStrategyLP->>SelectStrategyLP: Compare message_size to thresholds
        SelectStrategyLP-->>selectImplementation: ONESHOT or TWOSHOT
    else Fallback
        selectImplementation->>selectStrategyLookUpTable: num_tokens, hidden_size, op, tp_size
        selectStrategyLookUpTable->>AllReduceBestStrategyTable: Lookup by SM, TP, op, hidden_size, tokens
        AllReduceBestStrategyTable-->>selectStrategyLookUpTable: Strategy index
        selectStrategyLookUpTable-->>selectImplementation: NCCL (default) or TWOSHOT/ONESHOT
    end
    selectImplementation-->>User: AllReduceStrategyType

sequenceDiagram
    participant App
    participant RemoteMpiCommSessionClient as Client (Singleton)
    participant _global_instance
    participant RemoteMpiCommSessionServer as Server

    App->>Client: new Client(addr)
    activate Client
    Client->>Client: Check _global_instance_lock
    alt Instance not cached
        Client->>_global_instance: Create new instance
        Client->>Client: Set _initialized flag
    else Instance cached
        Client->>_global_instance: Return existing instance
    end
    deactivate Client
    Client-->>App: Singleton instance
    
    App->>Client: submit_sync(task)
    Client->>Server: Send task
    Server->>Server: Append future to pending_futures
    Server->>Server: Wait for prior pending_futures
    Server->>Server: Execute task
    Server-->>Client: Result
    Client-->>App: Result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The PR spans multiple heterogeneous domains—GPU kernel assembly optimizations, AllReduce strategy selection mechanics (heuristic + lookup tables), MPI session management with singleton patterns, executor refactoring, and extensive test infrastructure. While individual changes are localized, they require separate reasoning for correctness (PTX semantics, strategy thresholds, singleton lifecycle, test configuration). The lookup tables and heuristic logic are dense; GPU kernel changes involve low-level synchronization semantics; MPI/singleton management introduces lifecycle concerns. The test additions are substantial but largely homogeneous.

Suggested reviewers

byshiue
yizhang-nv
nv-guomingz
syuoni
tongyuantongyu

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.20% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[None][chore] Weekly mass integration of release/1.1" accurately describes the primary nature of the changeset. This is a mass integration commit series from the release/1.1 branch into main, and the title correctly captures this intent. While the title doesn't enumerate all the specific technical changes across the multiple files (which is appropriate for a broad integration PR), it clearly and specifically identifies what the changeset represents: a periodic integration of changes from the release/1.1 branch. The title is concise, avoids vague terminology, and gives reviewers a clear understanding of the PR's scope at a glance.
Description Check	✅ Passed	The PR description follows the required template structure with all major sections present: Description, Test Coverage, and PR Checklist. The Description section, while minimal, appropriately characterizes this as "another mass integration commits from the release/1.1 branch," which aligns with the PR's nature. The Test Coverage section correctly marks tests as "N/A" since mass integrations typically reuse testing from the source branch rather than introducing new test cases. The PR Checklist is complete with all review items listed and the final confirmation checkbox marked. For a mass integration PR, this level of detail is appropriate and sufficient.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 27

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tests/microbenchmarks/all_reduce.py (1)
1-1: Update copyright year to include 2025.

The copyright header should include the current year (2025) per the coding guidelines.

Apply this diff:
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Based on coding guidelines.
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)
196-199: Use system-scope fence before programmatic completion on Ampere path.

__threadfence() may be insufficient for inter-GPU/userbuffer visibility when paired with cudaTriggerProgrammaticLaunchCompletion(). Prefer __threadfence_system() for release semantics.

Apply:
-    if (threadIdx.x == 0)
-        __threadfence();
+    if (threadIdx.x == 0)
+        __threadfence_system();
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
1132-1145: Prepare resources after the post-connector can_queue recheck.

In _executor_loop, prepare_resources() runs before recomputing can_queue when kv_connector_manager may mutate scheduled_batch. If the batch becomes empty, you’ve done unnecessary prep and may transiently over-allocate.

Move prepare_resources() (and first-token handling) below the recheck block. Same comment applies to _executor_loop_overlap. Example:
- if can_queue:
-     if self.kv_cache_transceiver:
-         self._prepare_disagg_gen_transmission_complete(scheduled_batch)
-         self._handle_first_token_response(scheduled_batch)
-     self.resource_manager.prepare_resources(scheduled_batch)
-     self._kv_connector_start_batch(scheduled_batch)
+ if can_queue:
+     if self.kv_cache_transceiver:
+         self._prepare_disagg_gen_transmission_complete(scheduled_batch)
+         self._handle_first_token_response(scheduled_batch)
+     self.resource_manager.prepare_resources(scheduled_batch)
+     self._kv_connector_start_batch(scheduled_batch)
  # if using a kv connector, we need to call can_queue again since scheduled_batch might have changed
  if self.kv_connector_manager:
      can_queue = self._can_queue(scheduled_batch)
- if can_queue:
+ if can_queue:
     ...
And mirror this reorder in _executor_loop_overlap.

Also applies to: 1145-1150

1298-1313: Mirror the same resource-prep reorder in overlap loop.

Defer prepare_resources() until after the second can_queue evaluation when a kv connector is present to avoid prepping an empty batch.

Also applies to: 1309-1311

🧹 Nitpick comments (25)

tests/unittest/llmapi/_run_multi_llm_tasks.py (2)
19-19: Consider validating GPU availability for tensor_parallel_size=2.

The script requires 2 GPUs but doesn't validate availability upfront. If fewer GPUs are available, the failure will occur later with a less clear error message.

Add a check at the start of the script or function:
import torch

# At module level or in run_llm_tp2
if torch.cuda.device_count() < 2:
    raise RuntimeError(f"This script requires 2 GPUs, but only {torch.cuda.device_count()} available")
32-33: Consider adding top-level error handling.

The script has no error handling at the entry point. Adding a try/except block would provide clearer error messages if the script fails.

Wrap the call in error handling:
 if __name__ == "__main__":
-    run_multi_llm_tasks()
+    try:
+        run_multi_llm_tasks()
+    except Exception as e:
+        print_colored(f"Error: {e}\n", "red")
+        sys.exit(1)
tests/microbenchmarks/all_reduce.py (2)
52-52: Consider moving logger configuration out of the profiling function.

Setting the logger level inside profile_allreduce means it's called repeatedly during benchmarking. Consider setting it once in allreduce_benchmark (line 127) or at module level.

Apply this diff:
 def profile_allreduce(
     mapping: Mapping,
     enable_cudagraph: bool = False,
     inner_loop=200,
     outer_loop=10,
     strategy=AllReduceStrategy.NCCL,
     fusion=AllReduceFusionOp.NONE,
     input=None,
     residual=None,
     norm=None,
     scale=None,
     bias=None,
 ):
-    tllm.logger.set_level('error')
-
     allreduce_params = AllReduceParams(
And ensure it's set once in allreduce_benchmark at line 127 (which already exists).

39-51: Consider adding a docstring to document the profiling function.

A Google-style docstring would help document the purpose, parameters, and return value of this new public function.

Example:
def profile_allreduce(
    mapping: Mapping,
    enable_cudagraph: bool = False,
    inner_loop=200,
    outer_loop=10,
    strategy=AllReduceStrategy.NCCL,
    fusion=AllReduceFusionOp.NONE,
    input=None,
    residual=None,
    norm=None,
    scale=None,
    bias=None,
):
    """Profile a single AllReduce configuration.
    
    Args:
        mapping: Tensor parallelism mapping configuration.
        enable_cudagraph: Whether to use CUDA graph capture for profiling.
        inner_loop: Number of iterations per timing measurement.
        outer_loop: Number of timing measurements to compute median.
        strategy: AllReduce strategy to benchmark.
        fusion: Fusion operation to apply with AllReduce.
        input: Input tensor for AllReduce.
        residual: Optional residual tensor for fusion.
        norm: Optional RMSNorm module for fusion.
        scale: Optional scale tensor for quantization fusion.
        bias: Optional bias tensor for fusion.
        
    Returns:
        float: Median runtime in milliseconds per iteration.
    """
Based on coding guidelines.
tests/integration/defs/test_e2e.py (2)
2351-2353: Use the existing constant for KV cache fraction

Keeps CLI composition consistent with the rest of the file.
-        "--kv_cache_fraction=0.5",
+        f"--kv_cache_fraction={_MEM_FRACTION_50}",
2359-2359: Ensure failures surface consistently

Prefer the helper used elsewhere so non‑zero exits fail the test.
-    llm_venv.run_cmd(cmd)
+    venv_check_call(llm_venv, cmd)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
376-397: Validate assumptions and consider adding docstring.

The new from_model_config classmethod provides a convenient alternative constructor, but several concerns warrant attention:

Homogeneous KV heads assumption (line 390): model_config.num_kv_heads(0) assumes layer 0 exists and is representative of all layers. If num_attention_layers returns 0 or layers have heterogeneous KV heads, this will produce incorrect results or fail silently. Consider validating that num_attention_layers > 0 before accessing num_kv_heads(0).

Missing docstring: Add a docstring explaining the method's purpose, when to use it vs. the regular constructor, and documenting the homogeneous KV cache assumption.

Limited parameter exposure: Several optional __init__ parameters (spec_config, layer_mask, max_num_tokens, max_beam_width, is_draft, kv_connector_manager) default to None/0/False, which may limit the method's utility for more complex configurations. Consider whether these should be exposed or documented as limitations.

Consider adding validation:
 @classmethod
 def from_model_config(cls,
                       model_config: ModelConfigCpp,
                       kv_cache_config: KvCacheConfig,
                       mapping: Mapping,
                       kv_cache_type: CacheTypeCpp = CacheTypeCpp.SELF,
                       dtype: DataType = DataType.HALF) -> "KVCacheManager":
+    """
+    Construct a KVCacheManager from model and KV cache configurations.
+    
+    Assumes homogeneous KV cache (all layers have the same number of KV heads).
+    For more complex configurations (e.g., speculative decoding, heterogeneous layers),
+    use the standard __init__ constructor.
+    """
+    num_layers = model_config.num_attention_layers(mapping.pp_size)
+    if num_layers <= 0:
+        raise ValueError(f"num_attention_layers must be > 0, got {num_layers}")
     return cls(
         kv_cache_config,
         kv_cache_type,
-        num_layers=model_config.num_attention_layers(mapping.pp_size),
+        num_layers=num_layers,
         # NOTE: this preserves existing behavior in KV cache manager.
         # But we should change this to pass a list at some point.
         # We're assuming the KV cache is homogeneous here.
         num_kv_heads=model_config.num_kv_heads(0),
         head_dim=model_config.size_per_head,
         tokens_per_block=model_config.tokens_per_block,
         max_seq_len=model_config.max_seq_len,
         max_batch_size=model_config.max_batch_size,
         mapping=mapping,
         dtype=dtype)
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)
691-706: Handshake toggle consistency for oneshot variants.

Non-oneshot kernels advance reduce_id twice (next_flag(*reduceidptr) then next_flag(reduce_id)) before the trailing barrier; oneshot variants skip the second toggle. This may deadlock subsequent launches depending on the consumer’s expectation.

Consider adding the second toggle after multi_gpu_block_barrier(...), mirroring other kernels:
@@
-        multi_gpu_block_barrier(reduce_id, (int volatile*) &myptr[targetgpu]);
+        multi_gpu_block_barrier(reduce_id, (int volatile*) &myptr[targetgpu]);
+        // Advance to the next phase for trailing handshake
+        reduce_id = next_flag(reduce_id);
Please confirm the intended protocol; if a single-phase handshake is by design for oneshot, document it inline to avoid regressions.

Also applies to: 890-905, 1084-1097
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
120-129: Harmonize strategy between Linear and separate AllReduce to avoid confounds.

Linear uses NCCL, while AllReduce() defaults to AUTO; this can mask issues or introduce noise in assertions.

Apply:
-    linear = Linear(
+    linear = Linear(
         ...
-        allreduce_strategy=AllReduceStrategy.NCCL,
+        allreduce_strategy=AllReduceStrategy.NCCL,
     ).cuda()
-    allreduce = AllReduce(mapping=mapping)
+    allreduce = AllReduce(mapping=mapping, strategy=AllReduceStrategy.NCCL)
If you intend to validate AUTO vs NCCL equivalence, add a parametrized strategy to test both explicitly.

Also applies to: 140-153
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1015-1019: Minor: avoid object all-gather overhead.

tp_allgather on a Python int uses all_gather_object. Consider sending a 0-d CUDA tensor (or CPU tensor) to use tensor all-gather and avoid pickling. Optional micro‑opt.
tests/scripts/allreduce_perf/allreduce_perf_viz.py (3)
584-584: Make directory creation idempotent.

os.makedirs(..., exist_ok=True) avoids failures on re‑runs. Apply here and where directories are created.

149-149: Remove extraneous f‑string prefixes.

These prints have no placeholders. Drop the f to satisfy linters.
- print(f"\n2D Heatmap Statistics:")
+ print("\n2D Heatmap Statistics:")
...
- print(f"\nBest Strategy Heatmap Statistics:")
+ print("\nBest Strategy Heatmap Statistics:")
...
- print(f"\nStrategy distribution:")
+ print("\nStrategy distribution:")
...
- print(f"\nStrategy Difference Heatmap Statistics:")
+ print("\nStrategy Difference Heatmap Statistics:")
...
- print(f"Note: Positive values indicate slower than best strategy")
+ print("Note: Positive values indicate slower than best strategy")
Also applies to: 303-303, 313-313, 535-535, 538-538

133-134: Optional: match colorbar label to selected time column.

If time_col is 'time_ms', update the colorbar label to “Time (ms)”.
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (4)
36-41: Time column should be detected dynamically.

Some CSVs use time_ms. Mirror the viz script fallback.
-def find_best_strategy(df: pd.DataFrame):
+def find_best_strategy(df: pd.DataFrame):
     """Find the best strategy for each combination of parameters."""
-    return df.groupby([
-        'world_size', 'fusion', 'hidden_size', 'num_tokens'
-    ]).apply(lambda group: group.loc[group['time (us)'].idxmin(), 'strategy'])
+    time_col = 'time (us)' if 'time (us)' in df.columns else 'time_ms'
+    return df.groupby(['world_size', 'fusion', 'hidden_size', 'num_tokens']).apply(
+        lambda g: g.loc[g[time_col].idxmin(), 'strategy'])
125-131: Drop unnecessary f‑string.

The header line has no placeholders.
-    cpp_code = f"// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy\n"
+    cpp_code = "// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy\n"
206-221: Harden subprocess calls.

Fail fast if the benchmark invocation fails.
-            subprocess.run(
-                cmd,
-                env=os.environ,
-            )
+            subprocess.run(cmd, env=os.environ, check=True)
232-232: Style: prefer assert not df.empty.
-    assert df.empty == False, "Benchmark data is empty"
+    assert not df.empty, "Benchmark data is empty"
tests/scripts/allreduce_perf/README.md (1)
142-154: Add language to fenced code block.

Specify a language to satisfy markdownlint, e.g., text.
-```
+```text
 data/
 ├── viz/
 ...
tensorrt_llm/llmapi/mpi_session.py (1)

477-481: Narrow exception handling.

Catching broad Exception hides programming errors. Catch specific exceptions from future.result() if needed, or re‑raise after logging.
tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (1)
6-8: Prefer module-namespace imports per guidelines.

Import the module, then reference symbols via the module to keep namespaces clean (tests can be lighter, but consistency helps).

Example:
-from tensorrt_llm.llmapi.mpi_session import RemoteMpiCommSessionClient
+from tensorrt_llm.llmapi import mpi_session
...
-    client = RemoteMpiCommSessionClient(...)
+    client = mpi_session.RemoteMpiCommSessionClient(...)
As per coding guidelines.
cpp/tensorrt_llm/common/customAllReduceUtils.h (2)
99-103: Inconsistent dimension constant.

Tokens list has 15 buckets (1..16384), but kNumTokensChoice is 14.

Apply:
-constexpr int kNumTokensChoice = 14;
+constexpr int kNumTokensChoice = 15;
120-124: extern forward decl for an inline variable is unnecessary.

You define AllReduceBestStrategyTable as inline later. The extern declaration can be dropped.
-extern const std::unordered_map<int, AllReduceBestStrategyTableType> AllReduceBestStrategyTable;
+// defined below as an inline variable
tests/unittest/llmapi/test_mpi_session.py (2)
121-133: Use the parameterized task_script (fix unused-arg and test intent).

task_script is never used; the test always runs _run_multi_llm_tasks.py. Use the param to pick the script.
-    test_file = os.path.join(cur_dir, "_run_multi_llm_tasks.py")
+    test_file = os.path.join(cur_dir, task_script)
136-143: Document/silence subprocess lint (S603) in test context.

Command is a fixed list (not user-controlled). If you want to quiet S603, add a per-call noqa.
-    with Popen(command,
+    with Popen(command,  # noqa: S603
                env=os.environ,
                stdout=PIPE,
                stderr=PIPE,
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
46-46: Add a timeout to prevent CI hangs for the new mpirun test.

Other entries use explicit TIMEOUTs. Recommend adding one here.
-  - unittest/llmapi/test_mpi_session.py::test_llmapi_launch_multiple_tasks
+  - unittest/llmapi/test_mpi_session.py::test_llmapi_launch_multiple_tasks TIMEOUT (120)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0663e1 and de27532.

📒 Files selected for processing (33)

cpp/tensorrt_llm/common/customAllReduceUtils.h (2 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1 hunks)
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1 hunks)
cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu (1 hunks)
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (20 hunks)
cpp/tensorrt_llm/thop/allreduceOp.cpp (3 hunks)
requirements.txt (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py (2 hunks)
tensorrt_llm/_torch/modules/linear.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (6 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
tensorrt_llm/llmapi/mpi_session.py (9 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3 hunks)
tests/integration/defs/conftest.py (2 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (5 hunks)
tests/integration/defs/perf/test_perf.py (0 hunks)
tests/integration/defs/test_e2e.py (1 hunks)
tests/integration/test_lists/qa/llm_function_core.txt (2 hunks)
tests/integration/test_lists/qa/llm_function_core_sanity.txt (1 hunks)
tests/integration/test_lists/qa/llm_function_nim.txt (5 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (2 hunks)
tests/microbenchmarks/all_reduce.py (3 hunks)
tests/scripts/allreduce_perf/README.md (1 hunks)
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (1 hunks)
tests/scripts/allreduce_perf/allreduce_perf_viz.py (1 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
tests/unittest/_torch/multi_gpu/test_allreduce.py (2 hunks)
tests/unittest/llmapi/_run_multi_llm_tasks.py (1 hunks)
tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (1 hunks)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1 hunks)
tests/unittest/llmapi/test_mpi_session.py (2 hunks)

💤 Files with no reviewable changes (2)

tests/unittest/_torch/modules/test_fused_moe.py
tests/integration/defs/perf/test_perf.py

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}