Skip to content

[None][chore] Weekly mass integration of release/1.1#8508

Merged
chzblych merged 11 commits intoNVIDIA:mainfrom
mikeiovine:mass-integrate-1.1
Nov 4, 2025
Merged

[None][chore] Weekly mass integration of release/1.1#8508
chzblych merged 11 commits intoNVIDIA:mainfrom
mikeiovine:mass-integrate-1.1

Conversation

@mikeiovine
Copy link
Collaborator

@mikeiovine mikeiovine commented Oct 20, 2025

Description

Another mass integration commits from the release/1.1 branch.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Added support for DeepSeek-V3-Lite model configurations
    • Implemented adaptive AllReduce strategy selection for improved communication efficiency
    • Enhanced MPI session management with global executor pooling and singleton patterns
  • Performance Improvements

    • Optimized GPU synchronization primitives for collective communication
    • Added lookup-table driven strategy selection for AllReduce operations based on tensor parallel size and model parameters

@mikeiovine mikeiovine requested review from a team as code owners October 20, 2025 14:44
@mikeiovine mikeiovine requested review from QiJune and syuoni October 20, 2025 14:44
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 20, 2025

📝 Walkthrough

Walkthrough

This PR introduces heuristic-based and lookup-table-driven AllReduce strategy selection mechanisms, adds SM100f GPU architecture support utilities, optimizes GPU kernel synchronization with inline PTX assembly, centralizes MPI session management through global executors and singleton patterns, and expands test coverage with new integration tests, configurations, and benchmarking tools.

Changes

Cohort / File(s) Summary
AllReduce Strategy Selection
cpp/tensorrt_llm/common/customAllReduceUtils.h
Introduces HeuristicThresholdLP (SM version/world size to threshold mapping), SelectStrategyLP (heuristic-based TWOSHOT/ONESHOT selection), and AllReduceBestStrategyTable lookup mechanism with pre-computed 4D tables for SM90/SM100, enabling strategy selection via selectStrategyLookUpTable based on token count, hidden size, fusion op, and TP size.
GPU Kernel Assembly Optimizations
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu, cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
Replaces software-flag stores with inline PTX assembly (st.global.relaxed.sys.b32), adds release fence (fence.release.sys), and introduces griddepcontrol.wait/launch_dependents assembly for refined GPU grid dependency management and synchronization.
AllReduce Kernel Helpers
cpp/tensorrt_llm/kernels/customAllReduceKernels.h
Adds operator<< and toString overloads for AllReduceStrategyType string representation.
SM100f Architecture Support
tests/integration/defs/conftest.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py, tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Introduces is_sm_100f() utility returning true for SM 100/103; replaces direct get_sm_version() checks with is_sm_100f() calls for DeepGemm and FP8 block scales path selection.
MPI Session Global Management
tensorrt_llm/llmapi/mpi_session.py
Adds MPINodeState with _global_comm_executor/_global_mpi_pool for session reuse, RemoteMpiCommSessionClient singleton pattern (_global_instance, _global_instance_lock, __new__) for per-address instances, pending_futures coordination in server paths, and conditional pool shutdown based on owns_mpi_pool flag.
Executor and Resource Refactoring
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py
Centralizes can-queue logic into _can_queue helper, adds KVCacheManager.from_model_config classmethod for construction from model config.
Linear Module Updates
tensorrt_llm/_torch/modules/linear.py
Fixes validation error message to use function argument in_features; extends load_weights_fused_qkv_linear to pass tp_size, tp_rank, tp_mode to load_weight_scales.
AllReduce Op Refactoring
cpp/tensorrt_llm/thop/allreduceOp.cpp
Replaces getRuntimeStrategy/logRunTimeStrategy with selectImplementation, integrates hidden_size into strategy decision logic, updates MIN_LATENCY path to consider hidden_size for oneshot/twoshot selection.
GPU Kernel Synchronization
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu
Replaces cudaTriggerProgrammaticLaunchCompletion with cudaGridDependencySynchronize in multiple kernel paths, restructures synchronization timing and placement.
New Integration Tests
tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml, tests/integration/defs/disaggregated/test_disaggregated.py, tests/integration/defs/test_e2e.py, tests/integration/defs/perf/test_perf.py
Adds DeepSeek-V3-Lite bf16 configuration, test_disaggregated_deepseek_v3_lite_bf16_empty_batch with configurable parameters (num_ranks, concurrency, lengths), test_ptp_quickstart_advanced_pp_enabled for PP/TP sizing and CUDA graph modes.
AllReduce Performance Tools
tests/scripts/allreduce_perf/README.md, tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py, tests/scripts/allreduce_perf/allreduce_perf_viz.py, tests/microbenchmarks/all_reduce.py
Introduces code generation tool for AllReduce lookup tables from benchmark data, visualization scripts for 2D heatmaps/strategy comparisons, profile_allreduce function encapsulating benchmark workflow.
Test Lists and Configuration
tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/qa/llm_function_core_sanity.txt, tests/integration/test_lists/qa/llm_function_nim.txt, tests/integration/test_lists/test-db/l0_dgx_h100.yml
Adds new test entries for FP8, TP/PP variants, DeepSeek-V3-Lite, removes obsolete tests, reorganizes test suite composition.
Unit Test Updates
tests/unittest/_torch/modules/test_fused_moe.py, tests/unittest/_torch/multi_gpu/test_allreduce.py, tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py, tests/unittest/llmapi/test_mpi_session.py
Re-enables skipped FP8 test, adds allreduce_strategy parameter to Linear, introduces multi-task MPI/LLM test via subprocess orchestration.
New Test Utilities
tests/unittest/llmapi/_run_multi_llm_tasks.py, tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py
Adds scripts for multi-LLM runs with TP2 and multi-task MPI submissions (sync/async).
Dependencies
requirements.txt
Adds numba-cuda>=0.19.0 with WAR comment for nvbugs/5501820.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant selectImplementation
    participant SelectStrategyLP
    participant selectStrategyLookUpTable
    participant AllReduceBestStrategyTable

    User->>selectImplementation: seq_len, hidden_size
    alt Auto/LP strategy
        selectImplementation->>SelectStrategyLP: seq_len, hidden_size, world_size, op
        SelectStrategyLP->>SelectStrategyLP: Compare message_size to thresholds
        SelectStrategyLP-->>selectImplementation: ONESHOT or TWOSHOT
    else Fallback
        selectImplementation->>selectStrategyLookUpTable: num_tokens, hidden_size, op, tp_size
        selectStrategyLookUpTable->>AllReduceBestStrategyTable: Lookup by SM, TP, op, hidden_size, tokens
        AllReduceBestStrategyTable-->>selectStrategyLookUpTable: Strategy index
        selectStrategyLookUpTable-->>selectImplementation: NCCL (default) or TWOSHOT/ONESHOT
    end
    selectImplementation-->>User: AllReduceStrategyType
Loading
sequenceDiagram
    participant App
    participant RemoteMpiCommSessionClient as Client (Singleton)
    participant _global_instance
    participant RemoteMpiCommSessionServer as Server

    App->>Client: new Client(addr)
    activate Client
    Client->>Client: Check _global_instance_lock
    alt Instance not cached
        Client->>_global_instance: Create new instance
        Client->>Client: Set _initialized flag
    else Instance cached
        Client->>_global_instance: Return existing instance
    end
    deactivate Client
    Client-->>App: Singleton instance
    
    App->>Client: submit_sync(task)
    Client->>Server: Send task
    Server->>Server: Append future to pending_futures
    Server->>Server: Wait for prior pending_futures
    Server->>Server: Execute task
    Server-->>Client: Result
    Client-->>App: Result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The PR spans multiple heterogeneous domains—GPU kernel assembly optimizations, AllReduce strategy selection mechanics (heuristic + lookup tables), MPI session management with singleton patterns, executor refactoring, and extensive test infrastructure. While individual changes are localized, they require separate reasoning for correctness (PTX semantics, strategy thresholds, singleton lifecycle, test configuration). The lookup tables and heuristic logic are dense; GPU kernel changes involve low-level synchronization semantics; MPI/singleton management introduces lifecycle concerns. The test additions are substantial but largely homogeneous.

Suggested reviewers

  • byshiue
  • yizhang-nv
  • nv-guomingz
  • syuoni
  • tongyuantongyu

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.20% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[None][chore] Weekly mass integration of release/1.1" accurately describes the primary nature of the changeset. This is a mass integration commit series from the release/1.1 branch into main, and the title correctly captures this intent. While the title doesn't enumerate all the specific technical changes across the multiple files (which is appropriate for a broad integration PR), it clearly and specifically identifies what the changeset represents: a periodic integration of changes from the release/1.1 branch. The title is concise, avoids vague terminology, and gives reviewers a clear understanding of the PR's scope at a glance.
Description Check ✅ Passed The PR description follows the required template structure with all major sections present: Description, Test Coverage, and PR Checklist. The Description section, while minimal, appropriately characterizes this as "another mass integration commits from the release/1.1 branch," which aligns with the PR's nature. The Test Coverage section correctly marks tests as "N/A" since mass integrations typically reuse testing from the source branch rather than introducing new test cases. The PR Checklist is complete with all review items listed and the final confirmation checkbox marked. For a mass integration PR, this level of detail is appropriate and sufficient.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 27

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tests/microbenchmarks/all_reduce.py (1)

1-1: Update copyright year to include 2025.

The copyright header should include the current year (2025) per the coding guidelines.

Apply this diff:

-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Based on coding guidelines.

cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)

196-199: Use system-scope fence before programmatic completion on Ampere path.

__threadfence() may be insufficient for inter-GPU/userbuffer visibility when paired with cudaTriggerProgrammaticLaunchCompletion(). Prefer __threadfence_system() for release semantics.

Apply:

-    if (threadIdx.x == 0)
-        __threadfence();
+    if (threadIdx.x == 0)
+        __threadfence_system();
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1132-1145: Prepare resources after the post-connector can_queue recheck.

In _executor_loop, prepare_resources() runs before recomputing can_queue when kv_connector_manager may mutate scheduled_batch. If the batch becomes empty, you’ve done unnecessary prep and may transiently over-allocate.

Move prepare_resources() (and first-token handling) below the recheck block. Same comment applies to _executor_loop_overlap. Example:

- if can_queue:
-     if self.kv_cache_transceiver:
-         self._prepare_disagg_gen_transmission_complete(scheduled_batch)
-         self._handle_first_token_response(scheduled_batch)
-     self.resource_manager.prepare_resources(scheduled_batch)
-     self._kv_connector_start_batch(scheduled_batch)
+ if can_queue:
+     if self.kv_cache_transceiver:
+         self._prepare_disagg_gen_transmission_complete(scheduled_batch)
+         self._handle_first_token_response(scheduled_batch)
+     self.resource_manager.prepare_resources(scheduled_batch)
+     self._kv_connector_start_batch(scheduled_batch)
  # if using a kv connector, we need to call can_queue again since scheduled_batch might have changed
  if self.kv_connector_manager:
      can_queue = self._can_queue(scheduled_batch)
- if can_queue:
+ if can_queue:
     ...

And mirror this reorder in _executor_loop_overlap.

Also applies to: 1145-1150


1298-1313: Mirror the same resource-prep reorder in overlap loop.

Defer prepare_resources() until after the second can_queue evaluation when a kv connector is present to avoid prepping an empty batch.

Also applies to: 1309-1311

🧹 Nitpick comments (25)
tests/unittest/llmapi/_run_multi_llm_tasks.py (2)

19-19: Consider validating GPU availability for tensor_parallel_size=2.

The script requires 2 GPUs but doesn't validate availability upfront. If fewer GPUs are available, the failure will occur later with a less clear error message.

Add a check at the start of the script or function:

import torch

# At module level or in run_llm_tp2
if torch.cuda.device_count() < 2:
    raise RuntimeError(f"This script requires 2 GPUs, but only {torch.cuda.device_count()} available")

32-33: Consider adding top-level error handling.

The script has no error handling at the entry point. Adding a try/except block would provide clearer error messages if the script fails.

Wrap the call in error handling:

 if __name__ == "__main__":
-    run_multi_llm_tasks()
+    try:
+        run_multi_llm_tasks()
+    except Exception as e:
+        print_colored(f"Error: {e}\n", "red")
+        sys.exit(1)
tests/microbenchmarks/all_reduce.py (2)

52-52: Consider moving logger configuration out of the profiling function.

Setting the logger level inside profile_allreduce means it's called repeatedly during benchmarking. Consider setting it once in allreduce_benchmark (line 127) or at module level.

Apply this diff:

 def profile_allreduce(
     mapping: Mapping,
     enable_cudagraph: bool = False,
     inner_loop=200,
     outer_loop=10,
     strategy=AllReduceStrategy.NCCL,
     fusion=AllReduceFusionOp.NONE,
     input=None,
     residual=None,
     norm=None,
     scale=None,
     bias=None,
 ):
-    tllm.logger.set_level('error')
-
     allreduce_params = AllReduceParams(

And ensure it's set once in allreduce_benchmark at line 127 (which already exists).


39-51: Consider adding a docstring to document the profiling function.

A Google-style docstring would help document the purpose, parameters, and return value of this new public function.

Example:

def profile_allreduce(
    mapping: Mapping,
    enable_cudagraph: bool = False,
    inner_loop=200,
    outer_loop=10,
    strategy=AllReduceStrategy.NCCL,
    fusion=AllReduceFusionOp.NONE,
    input=None,
    residual=None,
    norm=None,
    scale=None,
    bias=None,
):
    """Profile a single AllReduce configuration.
    
    Args:
        mapping: Tensor parallelism mapping configuration.
        enable_cudagraph: Whether to use CUDA graph capture for profiling.
        inner_loop: Number of iterations per timing measurement.
        outer_loop: Number of timing measurements to compute median.
        strategy: AllReduce strategy to benchmark.
        fusion: Fusion operation to apply with AllReduce.
        input: Input tensor for AllReduce.
        residual: Optional residual tensor for fusion.
        norm: Optional RMSNorm module for fusion.
        scale: Optional scale tensor for quantization fusion.
        bias: Optional bias tensor for fusion.
        
    Returns:
        float: Median runtime in milliseconds per iteration.
    """

Based on coding guidelines.

tests/integration/defs/test_e2e.py (2)

2351-2353: Use the existing constant for KV cache fraction

Keeps CLI composition consistent with the rest of the file.

-        "--kv_cache_fraction=0.5",
+        f"--kv_cache_fraction={_MEM_FRACTION_50}",

2359-2359: Ensure failures surface consistently

Prefer the helper used elsewhere so non‑zero exits fail the test.

-    llm_venv.run_cmd(cmd)
+    venv_check_call(llm_venv, cmd)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

376-397: Validate assumptions and consider adding docstring.

The new from_model_config classmethod provides a convenient alternative constructor, but several concerns warrant attention:

  1. Homogeneous KV heads assumption (line 390): model_config.num_kv_heads(0) assumes layer 0 exists and is representative of all layers. If num_attention_layers returns 0 or layers have heterogeneous KV heads, this will produce incorrect results or fail silently. Consider validating that num_attention_layers > 0 before accessing num_kv_heads(0).

  2. Missing docstring: Add a docstring explaining the method's purpose, when to use it vs. the regular constructor, and documenting the homogeneous KV cache assumption.

  3. Limited parameter exposure: Several optional __init__ parameters (spec_config, layer_mask, max_num_tokens, max_beam_width, is_draft, kv_connector_manager) default to None/0/False, which may limit the method's utility for more complex configurations. Consider whether these should be exposed or documented as limitations.

Consider adding validation:

 @classmethod
 def from_model_config(cls,
                       model_config: ModelConfigCpp,
                       kv_cache_config: KvCacheConfig,
                       mapping: Mapping,
                       kv_cache_type: CacheTypeCpp = CacheTypeCpp.SELF,
                       dtype: DataType = DataType.HALF) -> "KVCacheManager":
+    """
+    Construct a KVCacheManager from model and KV cache configurations.
+    
+    Assumes homogeneous KV cache (all layers have the same number of KV heads).
+    For more complex configurations (e.g., speculative decoding, heterogeneous layers),
+    use the standard __init__ constructor.
+    """
+    num_layers = model_config.num_attention_layers(mapping.pp_size)
+    if num_layers <= 0:
+        raise ValueError(f"num_attention_layers must be > 0, got {num_layers}")
     return cls(
         kv_cache_config,
         kv_cache_type,
-        num_layers=model_config.num_attention_layers(mapping.pp_size),
+        num_layers=num_layers,
         # NOTE: this preserves existing behavior in KV cache manager.
         # But we should change this to pass a list at some point.
         # We're assuming the KV cache is homogeneous here.
         num_kv_heads=model_config.num_kv_heads(0),
         head_dim=model_config.size_per_head,
         tokens_per_block=model_config.tokens_per_block,
         max_seq_len=model_config.max_seq_len,
         max_batch_size=model_config.max_batch_size,
         mapping=mapping,
         dtype=dtype)
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)

691-706: Handshake toggle consistency for oneshot variants.

Non-oneshot kernels advance reduce_id twice (next_flag(*reduceidptr) then next_flag(reduce_id)) before the trailing barrier; oneshot variants skip the second toggle. This may deadlock subsequent launches depending on the consumer’s expectation.

Consider adding the second toggle after multi_gpu_block_barrier(...), mirroring other kernels:

@@
-        multi_gpu_block_barrier(reduce_id, (int volatile*) &myptr[targetgpu]);
+        multi_gpu_block_barrier(reduce_id, (int volatile*) &myptr[targetgpu]);
+        // Advance to the next phase for trailing handshake
+        reduce_id = next_flag(reduce_id);

Please confirm the intended protocol; if a single-phase handshake is by design for oneshot, document it inline to avoid regressions.

Also applies to: 890-905, 1084-1097

tests/unittest/_torch/multi_gpu/test_allreduce.py (1)

120-129: Harmonize strategy between Linear and separate AllReduce to avoid confounds.

Linear uses NCCL, while AllReduce() defaults to AUTO; this can mask issues or introduce noise in assertions.

Apply:

-    linear = Linear(
+    linear = Linear(
         ...
-        allreduce_strategy=AllReduceStrategy.NCCL,
+        allreduce_strategy=AllReduceStrategy.NCCL,
     ).cuda()
-    allreduce = AllReduce(mapping=mapping)
+    allreduce = AllReduce(mapping=mapping, strategy=AllReduceStrategy.NCCL)

If you intend to validate AUTO vs NCCL equivalence, add a parametrized strategy to test both explicitly.

Also applies to: 140-153

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1015-1019: Minor: avoid object all-gather overhead.

tp_allgather on a Python int uses all_gather_object. Consider sending a 0-d CUDA tensor (or CPU tensor) to use tensor all-gather and avoid pickling. Optional micro‑opt.

tests/scripts/allreduce_perf/allreduce_perf_viz.py (3)

584-584: Make directory creation idempotent.

os.makedirs(..., exist_ok=True) avoids failures on re‑runs. Apply here and where directories are created.


149-149: Remove extraneous f‑string prefixes.

These prints have no placeholders. Drop the f to satisfy linters.

- print(f"\n2D Heatmap Statistics:")
+ print("\n2D Heatmap Statistics:")
...
- print(f"\nBest Strategy Heatmap Statistics:")
+ print("\nBest Strategy Heatmap Statistics:")
...
- print(f"\nStrategy distribution:")
+ print("\nStrategy distribution:")
...
- print(f"\nStrategy Difference Heatmap Statistics:")
+ print("\nStrategy Difference Heatmap Statistics:")
...
- print(f"Note: Positive values indicate slower than best strategy")
+ print("Note: Positive values indicate slower than best strategy")

Also applies to: 303-303, 313-313, 535-535, 538-538


133-134: Optional: match colorbar label to selected time column.

If time_col is 'time_ms', update the colorbar label to “Time (ms)”.

tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (4)

36-41: Time column should be detected dynamically.

Some CSVs use time_ms. Mirror the viz script fallback.

-def find_best_strategy(df: pd.DataFrame):
+def find_best_strategy(df: pd.DataFrame):
     """Find the best strategy for each combination of parameters."""
-    return df.groupby([
-        'world_size', 'fusion', 'hidden_size', 'num_tokens'
-    ]).apply(lambda group: group.loc[group['time (us)'].idxmin(), 'strategy'])
+    time_col = 'time (us)' if 'time (us)' in df.columns else 'time_ms'
+    return df.groupby(['world_size', 'fusion', 'hidden_size', 'num_tokens']).apply(
+        lambda g: g.loc[g[time_col].idxmin(), 'strategy'])

125-131: Drop unnecessary f‑string.

The header line has no placeholders.

-    cpp_code = f"// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy\n"
+    cpp_code = "// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy\n"

206-221: Harden subprocess calls.

Fail fast if the benchmark invocation fails.

-            subprocess.run(
-                cmd,
-                env=os.environ,
-            )
+            subprocess.run(cmd, env=os.environ, check=True)

232-232: Style: prefer assert not df.empty.

-    assert df.empty == False, "Benchmark data is empty"
+    assert not df.empty, "Benchmark data is empty"
tests/scripts/allreduce_perf/README.md (1)

142-154: Add language to fenced code block.

Specify a language to satisfy markdownlint, e.g., text.

-```
+```text
 data/
 ├── viz/
 ...
tensorrt_llm/llmapi/mpi_session.py (1)

477-481: Narrow exception handling.

Catching broad Exception hides programming errors. Catch specific exceptions from future.result() if needed, or re‑raise after logging.

tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (1)

6-8: Prefer module-namespace imports per guidelines.

Import the module, then reference symbols via the module to keep namespaces clean (tests can be lighter, but consistency helps).

Example:

-from tensorrt_llm.llmapi.mpi_session import RemoteMpiCommSessionClient
+from tensorrt_llm.llmapi import mpi_session
...
-    client = RemoteMpiCommSessionClient(...)
+    client = mpi_session.RemoteMpiCommSessionClient(...)

As per coding guidelines.

cpp/tensorrt_llm/common/customAllReduceUtils.h (2)

99-103: Inconsistent dimension constant.

Tokens list has 15 buckets (1..16384), but kNumTokensChoice is 14.

Apply:

-constexpr int kNumTokensChoice = 14;
+constexpr int kNumTokensChoice = 15;

120-124: extern forward decl for an inline variable is unnecessary.

You define AllReduceBestStrategyTable as inline later. The extern declaration can be dropped.

-extern const std::unordered_map<int, AllReduceBestStrategyTableType> AllReduceBestStrategyTable;
+// defined below as an inline variable
tests/unittest/llmapi/test_mpi_session.py (2)

121-133: Use the parameterized task_script (fix unused-arg and test intent).

task_script is never used; the test always runs _run_multi_llm_tasks.py. Use the param to pick the script.

-    test_file = os.path.join(cur_dir, "_run_multi_llm_tasks.py")
+    test_file = os.path.join(cur_dir, task_script)

136-143: Document/silence subprocess lint (S603) in test context.

Command is a fixed list (not user-controlled). If you want to quiet S603, add a per-call noqa.

-    with Popen(command,
+    with Popen(command,  # noqa: S603
                env=os.environ,
                stdout=PIPE,
                stderr=PIPE,
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

46-46: Add a timeout to prevent CI hangs for the new mpirun test.

Other entries use explicit TIMEOUTs. Recommend adding one here.

-  - unittest/llmapi/test_mpi_session.py::test_llmapi_launch_multiple_tasks
+  - unittest/llmapi/test_mpi_session.py::test_llmapi_launch_multiple_tasks TIMEOUT (120)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0663e1 and de27532.

📒 Files selected for processing (33)
  • cpp/tensorrt_llm/common/customAllReduceUtils.h (2 hunks)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (20 hunks)
  • cpp/tensorrt_llm/thop/allreduceOp.cpp (3 hunks)
  • requirements.txt (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py (2 hunks)
  • tensorrt_llm/_torch/modules/linear.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (6 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tensorrt_llm/llmapi/mpi_session.py (9 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (3 hunks)
  • tests/integration/defs/conftest.py (2 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (5 hunks)
  • tests/integration/defs/perf/test_perf.py (0 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_core.txt (2 hunks)
  • tests/integration/test_lists/qa/llm_function_core_sanity.txt (1 hunks)
  • tests/integration/test_lists/qa/llm_function_nim.txt (5 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (2 hunks)
  • tests/microbenchmarks/all_reduce.py (3 hunks)
  • tests/scripts/allreduce_perf/README.md (1 hunks)
  • tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (1 hunks)
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py (1 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (0 hunks)
  • tests/unittest/_torch/multi_gpu/test_allreduce.py (2 hunks)
  • tests/unittest/llmapi/_run_multi_llm_tasks.py (1 hunks)
  • tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (1 hunks)
  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1 hunks)
  • tests/unittest/llmapi/test_mpi_session.py (2 hunks)
💤 Files with no reviewable changes (2)
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tests/integration/defs/perf/test_perf.py
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/modules/linear.py
  • tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu
  • tests/integration/defs/test_e2e.py
  • tests/integration/defs/conftest.py
  • tensorrt_llm/llmapi/mpi_session.py
  • tests/microbenchmarks/all_reduce.py
  • tests/unittest/llmapi/_run_multi_llm_tasks.py
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tests/unittest/llmapi/test_mpi_session.py
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
  • cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/modules/linear.py
  • tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py
  • tests/integration/defs/test_e2e.py
  • tests/integration/defs/conftest.py
  • tensorrt_llm/llmapi/mpi_session.py
  • tests/microbenchmarks/all_reduce.py
  • tests/unittest/llmapi/_run_multi_llm_tasks.py
  • tests/unittest/llmapi/test_mpi_session.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
  • tensorrt_llm/_torch/modules/linear.py
  • tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tests/scripts/allreduce_perf/allreduce_perf_viz.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu
  • tests/integration/defs/test_e2e.py
  • tests/integration/defs/conftest.py
  • tensorrt_llm/llmapi/mpi_session.py
  • tests/microbenchmarks/all_reduce.py
  • tests/unittest/llmapi/_run_multi_llm_tasks.py
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • tests/unittest/llmapi/test_mpi_session.py
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
  • cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
  • cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
  • cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • cpp/tensorrt_llm/common/customAllReduceUtils.h
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_nim.txt
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_core_sanity.txt
  • tests/unittest/llmapi/_run_multi_llm_tasks.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_nim.txt
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/qa/llm_function_core_sanity.txt
  • tests/unittest/llmapi/_run_multi_llm_tasks.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

  • cpp/tensorrt_llm/common/customAllReduceUtils.h
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
🧬 Code graph analysis (21)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/llmapi/llm_args.py (1)
  • KvCacheConfig (1199-1333)
tensorrt_llm/mapping.py (1)
  • Mapping (348-507)
tensorrt_llm/runtime/generation.py (5)
  • kv_cache_type (1208-1209)
  • dtype (854-855)
  • dtype (1257-1258)
  • num_layers (1173-1176)
  • tokens_per_block (1216-1217)
cpp/include/tensorrt_llm/executor/types.h (1)
  • DataType (73-658)
tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py (1)
tests/integration/defs/conftest.py (1)
  • is_sm_100f (1896-1899)
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)
cpp/tensorrt_llm/thop/allreduceOp.cpp (2)
  • op (1029-1029)
  • op (1068-1068)
tests/unittest/_torch/multi_gpu/test_allreduce.py (2)
tensorrt_llm/functional.py (2)
  • AllReduceParams (3900-3939)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (455-617)
tensorrt_llm/_torch/modules/linear.py (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (63-64)
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (2)
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
  • tensorrt_llm (28-296)
tests/scripts/allreduce_perf/allreduce_perf_viz.py (1)
  • main (553-605)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tests/integration/defs/conftest.py (2)
  • get_sm_version (1890-1893)
  • is_sm_100f (1896-1899)
tensorrt_llm/_utils.py (1)
  • local_mpi_size (557-558)
tests/scripts/allreduce_perf/allreduce_perf_viz.py (2)
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
  • tensorrt_llm (28-296)
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (1)
  • main (168-251)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
tests/integration/defs/conftest.py (2)
  • get_sm_version (1890-1893)
  • is_sm_100f (1896-1899)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_allgather (394-395)
  • tp_allgather (653-666)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1)
  • batch_size (35-36)
tests/integration/defs/disaggregated/test_disaggregated.py (4)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
  • env (61-68)
tests/integration/defs/trt_test_alternative.py (1)
  • check_call (250-258)
tests/integration/defs/conftest.py (2)
  • disaggregated_example_root (285-290)
  • llm_venv (702-719)
tests/integration/defs/local_venv.py (1)
  • get_working_directory (43-49)
tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (3)
tensorrt_llm/executor/utils.py (1)
  • LlmLauncherEnvs (22-29)
tensorrt_llm/llmapi/mpi_session.py (9)
  • RemoteMpiCommSessionClient (300-405)
  • submit (91-93)
  • submit (148-153)
  • submit (215-231)
  • submit (344-358)
  • submit_sync (96-97)
  • submit_sync (155-160)
  • submit_sync (233-235)
  • submit_sync (362-378)
tensorrt_llm/llmapi/utils.py (1)
  • print_colored (47-63)
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)
cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu (14)
  • void (35-53)
  • void (60-69)
  • void (71-76)
  • void (90-95)
  • void (98-112)
  • void (130-138)
  • void (140-144)
  • void (179-193)
  • void (195-232)
  • void (279-294)
  • void (296-333)
  • void (389-415)
  • void (417-461)
  • void (463-537)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (3)
  • llm_root (192-193)
  • llm_venv (702-719)
  • llm_models_root (80-94)
tests/integration/defs/conftest.py (3)
tensorrt_llm/_utils.py (2)
  • is_sm_100f (739-742)
  • get_sm_version (733-735)
tests/integration/defs/utils/periodic_junit.py (2)
  • PeriodicJUnitXML (41-342)
  • pytest_configure (120-137)
tests/integration/defs/trt_test_alternative.py (2)
  • print_info (300-306)
  • print_warning (309-315)
tensorrt_llm/llmapi/mpi_session.py (3)
tensorrt_llm/executor/proxy.py (1)
  • shutdown (369-416)
tensorrt_llm/executor/utils.py (1)
  • shutdown (103-104)
tensorrt_llm/llmapi/utils.py (2)
  • print_colored_debug (66-70)
  • print_colored (47-63)
tests/microbenchmarks/all_reduce.py (4)
tensorrt_llm/_utils.py (3)
  • local_mpi_rank (553-554)
  • local_mpi_size (557-558)
  • nvtx_range (904-923)
tensorrt_llm/functional.py (2)
  • AllReduceParams (3900-3939)
  • AllReduceStrategy (3876-3885)
tensorrt_llm/plugin/plugin.py (2)
  • CustomAllReduceHelper (542-718)
  • max_workspace_size_auto (582-590)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (455-617)
tests/unittest/llmapi/_run_multi_llm_tasks.py (1)
tensorrt_llm/llmapi/utils.py (1)
  • print_colored (47-63)
cpp/tensorrt_llm/common/customAllReduceUtils.h (2)
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)
  • AllReduceFusionOp (69-195)
cpp/tensorrt_llm/thop/allreduceOp.cpp (8)
  • seq_len (927-970)
  • seq_len (927-927)
  • seq_len (972-981)
  • seq_len (972-972)
  • op (1029-1029)
  • op (1068-1068)
  • message_size (983-996)
  • message_size (983-983)
tests/unittest/llmapi/test_mpi_session.py (2)
tensorrt_llm/llmapi/mpi_session.py (4)
  • MPINodeState (30-58)
  • MpiPoolSession (136-180)
  • RemoteMpiCommSessionClient (300-405)
  • split_mpi_env (558-599)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
  • env (61-68)
cpp/tensorrt_llm/thop/allreduceOp.cpp (2)
cpp/include/tensorrt_llm/common/dataType.h (1)
  • getDTypeSize (26-44)
cpp/tensorrt_llm/common/customAllReduceUtils.h (2)
  • getMaxRequiredWorkspaceSize (34-45)
  • selectStrategyLookUpTable (125-159)
🪛 Clang (14.0.6)
cpp/tensorrt_llm/common/customAllReduceUtils.h

[error] 19-19: 'tensorrt_llm/common/cudaUtils.h' file not found

(clang-diagnostic-error)

🪛 markdownlint-cli2 (0.18.1)
tests/scripts/allreduce_perf/README.md

142-142: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.0)
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py

22-22: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


23-23: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


24-27: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


28-28: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


29-33: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


125-125: f-string without any placeholders

Remove extraneous f prefix

(F541)


218-218: subprocess call: check for execution of untrusted input

(S603)


232-232: Avoid equality comparisons to False; use not df.empty: for false checks

Replace with not df.empty

(E712)

tests/scripts/allreduce_perf/allreduce_perf_viz.py

149-149: f-string without any placeholders

Remove extraneous f prefix

(F541)


239-239: Unpacked variable fig is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


303-303: f-string without any placeholders

Remove extraneous f prefix

(F541)


313-313: f-string without any placeholders

Remove extraneous f prefix

(F541)


535-535: f-string without any placeholders

Remove extraneous f prefix

(F541)


538-538: f-string without any placeholders

Remove extraneous f prefix

(F541)


573-573: Undefined name fusion_op

(F821)

tensorrt_llm/llmapi/mpi_session.py

279-279: f-string without any placeholders

Remove extraneous f prefix

(F541)


310-310: Unused static method argument: hmac_key

(ARG004)


401-401: f-string without any placeholders

Remove extraneous f prefix

(F541)


404-404: Unused method argument: grace

(ARG002)


404-404: Unused method argument: reason

(ARG002)


477-477: Do not catch blind exception: Exception

(BLE001)

tests/microbenchmarks/all_reduce.py

124-124: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


140-140: Avoid specifying long messages outside the exception class

(TRY003)

tests/unittest/llmapi/test_mpi_session.py

123-123: Unused function argument: task_script

(ARG001)


136-136: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (29)
tests/integration/test_lists/qa/llm_function_nim.txt (1)

1-459: Test list update looks good.

This is a configuration file with no production code. The changes reflect intentional test suite expansion and reorganization for the NIM category per the mass integration objectives. All test identifiers follow proper pytest syntax, and the addition of new model variants (Llama4 Maverick/Scout, Qwen3, DeepSeek, Nemotron) and quantization strategies (fp8, nvfp4, fp4, etc.) is consistent with the PR summary.

tests/unittest/llmapi/_run_multi_llm_tasks.py (1)

1-13: Add required NVIDIA Apache-2.0 copyright header.

The file is missing the mandatory NVIDIA Apache-2.0 copyright header that must be prepended to all Python source files.

As per coding guidelines.

Add the copyright header at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import sys
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
tests/microbenchmarks/all_reduce.py (5)

17-17: LGTM! Imports are appropriate.

The new imports (product, pandas, additional utilities from _utils, and CustomAllReduceHelper) are all used in the refactored benchmarking logic.

Also applies to: 21-21, 32-36


65-116: LGTM! Profiling logic is sound.

The function correctly implements:

  • CUDA graph capture when enabled with proper warmup
  • Event-based timing with median calculation to reduce noise
  • MPI barrier for cross-rank synchronization
  • Delay kernel to mitigate host overhead
  • Correctness verification for the non-fused case

147-169: LGTM! Shape generation logic is correct.

The code properly handles both 2D exploration (using product of predefined lists) and linear exploration (geometric progression from test_range). The assertion at line 169 ensures the size calculation remains consistent.

Optional: Consider adding validation for the test_range format at line 158 to provide clearer error messages if an invalid format is passed via CLI.


171-240: LGTM! Benchmark loop is well-structured with appropriate gating.

The logic correctly:

  • Validates message size against workspace limits (line 188)
  • Creates appropriate test tensors for each configuration
  • Applies sensible gating (TWOSHOT requires sufficient tokens, NVFP4 requires SM100+, AUTO is opt-in)
  • Collects results only on rank 0

Optional: For better performance, consider collecting results in a list of dictionaries and creating the DataFrame once after the loop instead of using pd.concat repeatedly (line 228), which can be slow for large iteration counts.


242-251: LGTM! Output handling and CLI integration are correct.

The changes properly:

  • Configure pandas to display all results
  • Save to CSV when requested (rank 0 only)
  • Add CLI arguments for all new features
  • Pass arguments through to the benchmark function

Also applies to: 254-276

tensorrt_llm/_torch/modules/linear.py (2)

1369-1369: LGTM! Corrected error message to use function parameter.

The error message now correctly references the in_features function parameter instead of self.in_features. This is more accurate since create_weights is a method of the linear method class, not the Linear module itself.


1468-1469: LGTM! Fixed distributed weight loading for fused QKV.

The updated call now passes tp_size, tp_rank, and tp_mode to load_weight_scales, ensuring proper tensor-parallel sharding of weight scales for fused QKV layers. This aligns with how other quantization methods (e.g., NVFP4LinearMethod) handle distributed weight loading.

cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (3)

62-65: Entry grid-dependency wait looks correct.

cudaGridDependencySynchronize() gated on CUDA_ARCH >= 900 is appropriate at kernel entry. No issues.

Also applies to: 143-146


129-133: End-of-kernel programmatic completion is consistent.

Triggering completion after the final flag write is aligned with PDL expectations (with the fences added as suggested above). Looks good.

Ensure callers set cudaLaunchAttributeProgrammaticStreamSerializationAllowed (already done in LaunchConfig) whenever these kernels are launched via callranks/callranksMC.

Also applies to: 239-241, 876-877, 1068-1069, 1162-1163, 1219-1220


368-379: Graceful handling for SM < 90 paths.

Explicit brkpt with clear message is fine for unsupported architectures.

Also applies to: 1222-1277

tests/integration/test_lists/qa/llm_function_core.txt (1)

698-701: New Llama-3.3-70B FP8 PP-enabled entries look fine.

Duplication across lists is acceptable for different execution contexts. No action needed.

Based on learnings

tests/unittest/_torch/multi_gpu/test_allreduce.py (1)

27-30: Import updates align with new API surface.

AllReduceStrategy/Params usage is consistent with functional changes.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1013-1022: Good centralization of queuing logic.

Consolidating the ADP gating into _can_queue() improves readability and parity across loops.

tests/scripts/allreduce_perf/README.md (1)

105-112: Docs/code mismatch: no logarithmic scaling implemented.

Either remove “logarithmic scaling” from docs or add norm=LogNorm() to the heatmaps.

tensorrt_llm/llmapi/mpi_session.py (3)

265-277: Reusing a global MPICommExecutor is reasonable; ensure predictable teardown.

The COMM_WORLD path holds a process‑global executor/pool without calling __exit__(). Confirm expectations at process shutdown (e.g., atexit hook) so ranks don’t hang on interpreter teardown.

Also applies to: 281-285


396-406: No‑op shutdown can leak resources.

If LLM instances are created/destroyed repeatedly, the PAIR socket may persist. Consider making shutdown() idempotent and reference‑counted per address, or document the lifecycle assumptions.


468-485: Nice: wait for in‑flight futures before next task.

This barrier avoids interleaving tasks across ranks and reduces synchronization issues.

Also applies to: 503-503

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

371-381: SM100f gate change looks good; keep a fallback path.

Switching to is_sm_100f() for DeepGemm vs Cutlass selection is correct; ensure _utils.is_sm_100f is public and unit-tested.


8-9: The import of is_sm_100f is valid—the function exists in tensorrt_llm._utils.

Verification confirms is_sm_100f is defined at line 739 of tensorrt_llm/_utils.py. The import in fused_moe_wide_ep.py is correct and requires no changes.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)

109-131: Stringification helpers for AllReduceStrategyType — LGTM.

Consistent with existing AllReduceFusionOp helpers; aids logging/debug.

tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py (2)

225-233: Selector logic update — LGTM.

Using is_sm_100f() centralizes the SM100f check; behavior unchanged otherwise.


21-21: No action required; dependency already properly established.

The import from tensorrt_llm._utils import is_sm_100f at line 21 of moe_op.py is valid. The function is_sm_100f is already defined in tensorrt_llm/_utils.py at line 739, and both moe_op.py (line 227) and WideEPMoE (fused_moe_wide_ep.py line 377) successfully import and use this single-source dependency. No runtime ImportError risk exists, and no duplication or missing exports are present.

tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

149-149: DeepSeek V3-Lite bf16 empty-batch test entry looks good.

No issues from this addition.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)

19-19: Import of is_sm_100f is correct and aligns with conftest.

Looks good.


2282-2285: Switch to is_sm_100f() for SM100f gating.

Appropriate replacement for previous ad-hoc checks; keeps intent explicit.


2333-2341: SM100f branch logic LGTM.

Defaulting MoE backend to DEEPGEMM and tuning memory when is_sm_100f() is true is consistent with the new helper.

tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml (1)

1-61: Config looks consistent for the empty-batch DeepSeek V3-Lite bf16 scenario.

Values and splits (ctx/gen, ports, kv cache fractions) are reasonable.

@mikeiovine mikeiovine force-pushed the mass-integrate-1.1 branch 2 times, most recently from 3837379 to b50566a Compare October 20, 2025 15:45
@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21926 [ run ] triggered by Bot. Commit: b50566a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21926 [ run ] completed with state SUCCESS. Commit: b50566a
/LLM/main/L0_MergeRequest_PR pipeline #16528 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22064 [ run ] triggered by Bot. Commit: cd195b6

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22068 [ run ] triggered by Bot. Commit: ca5705d

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22064 [ run ] completed with state ABORTED. Commit: cd195b6
LLM/main/L0_MergeRequest_PR #16637 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22068 [ run ] completed with state SUCCESS. Commit: ca5705d
/LLM/main/L0_MergeRequest_PR pipeline #16641 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22211 [ run ] triggered by Bot. Commit: 55ea317

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22211 [ run ] completed with state SUCCESS. Commit: 55ea317
/LLM/main/L0_MergeRequest_PR pipeline #16748 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22311 [ run ] triggered by Bot. Commit: e251c7c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23083 [ run ] triggered by Bot. Commit: a078bf3

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23083 [ run ] completed with state SUCCESS. Commit: a078bf3
/LLM/main/L0_MergeRequest_PR pipeline #17408 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23204 [ run ] triggered by Bot. Commit: 590e173

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23210 [ run ] triggered by Bot. Commit: 2185600

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23204 [ run ] completed with state ABORTED. Commit: 590e173
LLM/main/L0_MergeRequest_PR #17490 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23210 [ run ] completed with state SUCCESS. Commit: 2185600
/LLM/main/L0_MergeRequest_PR pipeline #17495 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

VALLIS-NERIA and others added 11 commits November 3, 2025 09:43
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…VIDIA#8357)

Signed-off-by: Stanley Sun <stsun@nvidia.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…ention DP with disagg (NVIDIA#8372)

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…acy issue (NVIDIA#8318)

Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…VIDIA#7870)

Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it.

Implemented new AllreduceOp heuristic:
- Added Linear programming-based heuristic implementation.
- Added LUT-based heuristic implementation and corresponding code generation script.

AllreduceOp minor fixing:
- Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set.
- Fixed a minor TWOSHOT kernel perf issue.
- Cleaned up Dispatching code in AllReduceOp.

This PR will fix the perf gaps reported in:
https://nvbugspro.nvidia.com/bug/5517023

For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…st.py (NVIDIA#8388)

Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…les in W4A16 AWQ (NVIDIA#8432)

Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…VIDIA#8455)

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
@mikeiovine
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23410 [ reuse-pipeline ] triggered by Bot. Commit: a906068

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23410 [ reuse-pipeline ] completed with state SUCCESS. Commit: a906068
Reusing PR_Github #23210 for commit a906068

@chzblych chzblych merged commit 23717cd into NVIDIA:main Nov 4, 2025
5 checks passed
@mikeiovine mikeiovine deleted the mass-integrate-1.1 branch November 4, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.