[None][chore] Weekly mass integration of release/1.1#8508
[None][chore] Weekly mass integration of release/1.1#8508chzblych merged 11 commits intoNVIDIA:mainfrom
Conversation
47d8bae to
de27532
Compare
📝 WalkthroughWalkthroughThis PR introduces heuristic-based and lookup-table-driven AllReduce strategy selection mechanisms, adds SM100f GPU architecture support utilities, optimizes GPU kernel synchronization with inline PTX assembly, centralizes MPI session management through global executors and singleton patterns, and expands test coverage with new integration tests, configurations, and benchmarking tools. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant selectImplementation
participant SelectStrategyLP
participant selectStrategyLookUpTable
participant AllReduceBestStrategyTable
User->>selectImplementation: seq_len, hidden_size
alt Auto/LP strategy
selectImplementation->>SelectStrategyLP: seq_len, hidden_size, world_size, op
SelectStrategyLP->>SelectStrategyLP: Compare message_size to thresholds
SelectStrategyLP-->>selectImplementation: ONESHOT or TWOSHOT
else Fallback
selectImplementation->>selectStrategyLookUpTable: num_tokens, hidden_size, op, tp_size
selectStrategyLookUpTable->>AllReduceBestStrategyTable: Lookup by SM, TP, op, hidden_size, tokens
AllReduceBestStrategyTable-->>selectStrategyLookUpTable: Strategy index
selectStrategyLookUpTable-->>selectImplementation: NCCL (default) or TWOSHOT/ONESHOT
end
selectImplementation-->>User: AllReduceStrategyType
sequenceDiagram
participant App
participant RemoteMpiCommSessionClient as Client (Singleton)
participant _global_instance
participant RemoteMpiCommSessionServer as Server
App->>Client: new Client(addr)
activate Client
Client->>Client: Check _global_instance_lock
alt Instance not cached
Client->>_global_instance: Create new instance
Client->>Client: Set _initialized flag
else Instance cached
Client->>_global_instance: Return existing instance
end
deactivate Client
Client-->>App: Singleton instance
App->>Client: submit_sync(task)
Client->>Server: Send task
Server->>Server: Append future to pending_futures
Server->>Server: Wait for prior pending_futures
Server->>Server: Execute task
Server-->>Client: Result
Client-->>App: Result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: The PR spans multiple heterogeneous domains—GPU kernel assembly optimizations, AllReduce strategy selection mechanics (heuristic + lookup tables), MPI session management with singleton patterns, executor refactoring, and extensive test infrastructure. While individual changes are localized, they require separate reasoning for correctness (PTX semantics, strategy thresholds, singleton lifecycle, test configuration). The lookup tables and heuristic logic are dense; GPU kernel changes involve low-level synchronization semantics; MPI/singleton management introduces lifecycle concerns. The test additions are substantial but largely homogeneous. Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 27
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tests/microbenchmarks/all_reduce.py (1)
1-1: Update copyright year to include 2025.The copyright header should include the current year (2025) per the coding guidelines.
Apply this diff:
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.Based on coding guidelines.
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)
196-199: Use system-scope fence before programmatic completion on Ampere path.__threadfence() may be insufficient for inter-GPU/userbuffer visibility when paired with cudaTriggerProgrammaticLaunchCompletion(). Prefer __threadfence_system() for release semantics.
Apply:
- if (threadIdx.x == 0) - __threadfence(); + if (threadIdx.x == 0) + __threadfence_system();tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
1132-1145: Prepare resources after the post-connector can_queue recheck.In
_executor_loop,prepare_resources()runs before recomputingcan_queuewhenkv_connector_managermay mutatescheduled_batch. If the batch becomes empty, you’ve done unnecessary prep and may transiently over-allocate.Move
prepare_resources()(and first-token handling) below the recheck block. Same comment applies to_executor_loop_overlap. Example:- if can_queue: - if self.kv_cache_transceiver: - self._prepare_disagg_gen_transmission_complete(scheduled_batch) - self._handle_first_token_response(scheduled_batch) - self.resource_manager.prepare_resources(scheduled_batch) - self._kv_connector_start_batch(scheduled_batch) + if can_queue: + if self.kv_cache_transceiver: + self._prepare_disagg_gen_transmission_complete(scheduled_batch) + self._handle_first_token_response(scheduled_batch) + self.resource_manager.prepare_resources(scheduled_batch) + self._kv_connector_start_batch(scheduled_batch) # if using a kv connector, we need to call can_queue again since scheduled_batch might have changed if self.kv_connector_manager: can_queue = self._can_queue(scheduled_batch) - if can_queue: + if can_queue: ...And mirror this reorder in
_executor_loop_overlap.Also applies to: 1145-1150
1298-1313: Mirror the same resource-prep reorder in overlap loop.Defer
prepare_resources()until after the secondcan_queueevaluation when a kv connector is present to avoid prepping an empty batch.Also applies to: 1309-1311
🧹 Nitpick comments (25)
tests/unittest/llmapi/_run_multi_llm_tasks.py (2)
19-19: Consider validating GPU availability for tensor_parallel_size=2.The script requires 2 GPUs but doesn't validate availability upfront. If fewer GPUs are available, the failure will occur later with a less clear error message.
Add a check at the start of the script or function:
import torch # At module level or in run_llm_tp2 if torch.cuda.device_count() < 2: raise RuntimeError(f"This script requires 2 GPUs, but only {torch.cuda.device_count()} available")
32-33: Consider adding top-level error handling.The script has no error handling at the entry point. Adding a try/except block would provide clearer error messages if the script fails.
Wrap the call in error handling:
if __name__ == "__main__": - run_multi_llm_tasks() + try: + run_multi_llm_tasks() + except Exception as e: + print_colored(f"Error: {e}\n", "red") + sys.exit(1)tests/microbenchmarks/all_reduce.py (2)
52-52: Consider moving logger configuration out of the profiling function.Setting the logger level inside
profile_allreducemeans it's called repeatedly during benchmarking. Consider setting it once inallreduce_benchmark(line 127) or at module level.Apply this diff:
def profile_allreduce( mapping: Mapping, enable_cudagraph: bool = False, inner_loop=200, outer_loop=10, strategy=AllReduceStrategy.NCCL, fusion=AllReduceFusionOp.NONE, input=None, residual=None, norm=None, scale=None, bias=None, ): - tllm.logger.set_level('error') - allreduce_params = AllReduceParams(And ensure it's set once in
allreduce_benchmarkat line 127 (which already exists).
39-51: Consider adding a docstring to document the profiling function.A Google-style docstring would help document the purpose, parameters, and return value of this new public function.
Example:
def profile_allreduce( mapping: Mapping, enable_cudagraph: bool = False, inner_loop=200, outer_loop=10, strategy=AllReduceStrategy.NCCL, fusion=AllReduceFusionOp.NONE, input=None, residual=None, norm=None, scale=None, bias=None, ): """Profile a single AllReduce configuration. Args: mapping: Tensor parallelism mapping configuration. enable_cudagraph: Whether to use CUDA graph capture for profiling. inner_loop: Number of iterations per timing measurement. outer_loop: Number of timing measurements to compute median. strategy: AllReduce strategy to benchmark. fusion: Fusion operation to apply with AllReduce. input: Input tensor for AllReduce. residual: Optional residual tensor for fusion. norm: Optional RMSNorm module for fusion. scale: Optional scale tensor for quantization fusion. bias: Optional bias tensor for fusion. Returns: float: Median runtime in milliseconds per iteration. """Based on coding guidelines.
tests/integration/defs/test_e2e.py (2)
2351-2353: Use the existing constant for KV cache fractionKeeps CLI composition consistent with the rest of the file.
- "--kv_cache_fraction=0.5", + f"--kv_cache_fraction={_MEM_FRACTION_50}",
2359-2359: Ensure failures surface consistentlyPrefer the helper used elsewhere so non‑zero exits fail the test.
- llm_venv.run_cmd(cmd) + venv_check_call(llm_venv, cmd)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
376-397: Validate assumptions and consider adding docstring.The new
from_model_configclassmethod provides a convenient alternative constructor, but several concerns warrant attention:
Homogeneous KV heads assumption (line 390):
model_config.num_kv_heads(0)assumes layer 0 exists and is representative of all layers. Ifnum_attention_layersreturns 0 or layers have heterogeneous KV heads, this will produce incorrect results or fail silently. Consider validating thatnum_attention_layers > 0before accessingnum_kv_heads(0).Missing docstring: Add a docstring explaining the method's purpose, when to use it vs. the regular constructor, and documenting the homogeneous KV cache assumption.
Limited parameter exposure: Several optional
__init__parameters (spec_config,layer_mask,max_num_tokens,max_beam_width,is_draft,kv_connector_manager) default toNone/0/False, which may limit the method's utility for more complex configurations. Consider whether these should be exposed or documented as limitations.Consider adding validation:
@classmethod def from_model_config(cls, model_config: ModelConfigCpp, kv_cache_config: KvCacheConfig, mapping: Mapping, kv_cache_type: CacheTypeCpp = CacheTypeCpp.SELF, dtype: DataType = DataType.HALF) -> "KVCacheManager": + """ + Construct a KVCacheManager from model and KV cache configurations. + + Assumes homogeneous KV cache (all layers have the same number of KV heads). + For more complex configurations (e.g., speculative decoding, heterogeneous layers), + use the standard __init__ constructor. + """ + num_layers = model_config.num_attention_layers(mapping.pp_size) + if num_layers <= 0: + raise ValueError(f"num_attention_layers must be > 0, got {num_layers}") return cls( kv_cache_config, kv_cache_type, - num_layers=model_config.num_attention_layers(mapping.pp_size), + num_layers=num_layers, # NOTE: this preserves existing behavior in KV cache manager. # But we should change this to pass a list at some point. # We're assuming the KV cache is homogeneous here. num_kv_heads=model_config.num_kv_heads(0), head_dim=model_config.size_per_head, tokens_per_block=model_config.tokens_per_block, max_seq_len=model_config.max_seq_len, max_batch_size=model_config.max_batch_size, mapping=mapping, dtype=dtype)cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)
691-706: Handshake toggle consistency for oneshot variants.Non-oneshot kernels advance reduce_id twice (next_flag(*reduceidptr) then next_flag(reduce_id)) before the trailing barrier; oneshot variants skip the second toggle. This may deadlock subsequent launches depending on the consumer’s expectation.
Consider adding the second toggle after multi_gpu_block_barrier(...), mirroring other kernels:
@@ - multi_gpu_block_barrier(reduce_id, (int volatile*) &myptr[targetgpu]); + multi_gpu_block_barrier(reduce_id, (int volatile*) &myptr[targetgpu]); + // Advance to the next phase for trailing handshake + reduce_id = next_flag(reduce_id);Please confirm the intended protocol; if a single-phase handshake is by design for oneshot, document it inline to avoid regressions.
Also applies to: 890-905, 1084-1097
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
120-129: Harmonize strategy between Linear and separate AllReduce to avoid confounds.Linear uses NCCL, while AllReduce() defaults to AUTO; this can mask issues or introduce noise in assertions.
Apply:
- linear = Linear( + linear = Linear( ... - allreduce_strategy=AllReduceStrategy.NCCL, + allreduce_strategy=AllReduceStrategy.NCCL, ).cuda() - allreduce = AllReduce(mapping=mapping) + allreduce = AllReduce(mapping=mapping, strategy=AllReduceStrategy.NCCL)If you intend to validate AUTO vs NCCL equivalence, add a parametrized strategy to test both explicitly.
Also applies to: 140-153
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1015-1019: Minor: avoid object all-gather overhead.
tp_allgatheron a Python int usesall_gather_object. Consider sending a 0-d CUDA tensor (or CPU tensor) to use tensor all-gather and avoid pickling. Optional micro‑opt.tests/scripts/allreduce_perf/allreduce_perf_viz.py (3)
584-584: Make directory creation idempotent.
os.makedirs(..., exist_ok=True)avoids failures on re‑runs. Apply here and where directories are created.
149-149: Remove extraneous f‑string prefixes.These prints have no placeholders. Drop the
fto satisfy linters.- print(f"\n2D Heatmap Statistics:") + print("\n2D Heatmap Statistics:") ... - print(f"\nBest Strategy Heatmap Statistics:") + print("\nBest Strategy Heatmap Statistics:") ... - print(f"\nStrategy distribution:") + print("\nStrategy distribution:") ... - print(f"\nStrategy Difference Heatmap Statistics:") + print("\nStrategy Difference Heatmap Statistics:") ... - print(f"Note: Positive values indicate slower than best strategy") + print("Note: Positive values indicate slower than best strategy")Also applies to: 303-303, 313-313, 535-535, 538-538
133-134: Optional: match colorbar label to selected time column.If
time_colis'time_ms', update the colorbar label to “Time (ms)”.tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (4)
36-41: Time column should be detected dynamically.Some CSVs use
time_ms. Mirror the viz script fallback.-def find_best_strategy(df: pd.DataFrame): +def find_best_strategy(df: pd.DataFrame): """Find the best strategy for each combination of parameters.""" - return df.groupby([ - 'world_size', 'fusion', 'hidden_size', 'num_tokens' - ]).apply(lambda group: group.loc[group['time (us)'].idxmin(), 'strategy']) + time_col = 'time (us)' if 'time (us)' in df.columns else 'time_ms' + return df.groupby(['world_size', 'fusion', 'hidden_size', 'num_tokens']).apply( + lambda g: g.loc[g[time_col].idxmin(), 'strategy'])
125-131: Drop unnecessary f‑string.The header line has no placeholders.
- cpp_code = f"// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy\n" + cpp_code = "// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy\n"
206-221: Harden subprocess calls.Fail fast if the benchmark invocation fails.
- subprocess.run( - cmd, - env=os.environ, - ) + subprocess.run(cmd, env=os.environ, check=True)
232-232: Style: preferassert not df.empty.- assert df.empty == False, "Benchmark data is empty" + assert not df.empty, "Benchmark data is empty"tests/scripts/allreduce_perf/README.md (1)
142-154: Add language to fenced code block.Specify a language to satisfy markdownlint, e.g.,
text.-``` +```text data/ ├── viz/ ...tensorrt_llm/llmapi/mpi_session.py (1)
477-481: Narrow exception handling.Catching broad
Exceptionhides programming errors. Catch specific exceptions fromfuture.result()if needed, or re‑raise after logging.tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (1)
6-8: Prefer module-namespace imports per guidelines.Import the module, then reference symbols via the module to keep namespaces clean (tests can be lighter, but consistency helps).
Example:
-from tensorrt_llm.llmapi.mpi_session import RemoteMpiCommSessionClient +from tensorrt_llm.llmapi import mpi_session ... - client = RemoteMpiCommSessionClient(...) + client = mpi_session.RemoteMpiCommSessionClient(...)As per coding guidelines.
cpp/tensorrt_llm/common/customAllReduceUtils.h (2)
99-103: Inconsistent dimension constant.Tokens list has 15 buckets (1..16384), but
kNumTokensChoiceis 14.Apply:
-constexpr int kNumTokensChoice = 14; +constexpr int kNumTokensChoice = 15;
120-124:externforward decl for an inline variable is unnecessary.You define
AllReduceBestStrategyTableasinlinelater. Theexterndeclaration can be dropped.-extern const std::unordered_map<int, AllReduceBestStrategyTableType> AllReduceBestStrategyTable; +// defined below as an inline variabletests/unittest/llmapi/test_mpi_session.py (2)
121-133: Use the parameterizedtask_script(fix unused-arg and test intent).
task_scriptis never used; the test always runs_run_multi_llm_tasks.py. Use the param to pick the script.- test_file = os.path.join(cur_dir, "_run_multi_llm_tasks.py") + test_file = os.path.join(cur_dir, task_script)
136-143: Document/silence subprocess lint (S603) in test context.Command is a fixed list (not user-controlled). If you want to quiet S603, add a per-call noqa.
- with Popen(command, + with Popen(command, # noqa: S603 env=os.environ, stdout=PIPE, stderr=PIPE,tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
46-46: Add a timeout to prevent CI hangs for the new mpirun test.Other entries use explicit TIMEOUTs. Recommend adding one here.
- - unittest/llmapi/test_mpi_session.py::test_llmapi_launch_multiple_tasks + - unittest/llmapi/test_mpi_session.py::test_llmapi_launch_multiple_tasks TIMEOUT (120)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (33)
cpp/tensorrt_llm/common/customAllReduceUtils.h(2 hunks)cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu(1 hunks)cpp/tensorrt_llm/kernels/customAllReduceKernels.h(1 hunks)cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu(1 hunks)cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu(20 hunks)cpp/tensorrt_llm/thop/allreduceOp.cpp(3 hunks)requirements.txt(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(2 hunks)tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py(2 hunks)tensorrt_llm/_torch/modules/linear.py(2 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(6 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tensorrt_llm/llmapi/mpi_session.py(9 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(3 hunks)tests/integration/defs/conftest.py(2 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(5 hunks)tests/integration/defs/perf/test_perf.py(0 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/integration/test_lists/qa/llm_function_core.txt(2 hunks)tests/integration/test_lists/qa/llm_function_core_sanity.txt(1 hunks)tests/integration/test_lists/qa/llm_function_nim.txt(5 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml(2 hunks)tests/microbenchmarks/all_reduce.py(3 hunks)tests/scripts/allreduce_perf/README.md(1 hunks)tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py(1 hunks)tests/scripts/allreduce_perf/allreduce_perf_viz.py(1 hunks)tests/unittest/_torch/modules/test_fused_moe.py(0 hunks)tests/unittest/_torch/multi_gpu/test_allreduce.py(2 hunks)tests/unittest/llmapi/_run_multi_llm_tasks.py(1 hunks)tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py(1 hunks)tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py(1 hunks)tests/unittest/llmapi/test_mpi_session.py(2 hunks)
💤 Files with no reviewable changes (2)
- tests/unittest/_torch/modules/test_fused_moe.py
- tests/integration/defs/perf/test_perf.py
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/modules/fused_moe/ops/moe_op.pycpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cucpp/tensorrt_llm/kernels/customAllReduceKernels.htests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/modules/linear.pytests/scripts/allreduce_perf/allreduce_heuristic_code_gen.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.pytests/scripts/allreduce_perf/allreduce_perf_viz.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/disaggregated/test_disaggregated.pytests/unittest/llmapi/_run_multi_mpi_comm_tasks.pycpp/tensorrt_llm/kernels/userbuffers/userbuffers.cutests/integration/defs/test_e2e.pytests/integration/defs/conftest.pytensorrt_llm/llmapi/mpi_session.pytests/microbenchmarks/all_reduce.pytests/unittest/llmapi/_run_multi_llm_tasks.pycpp/tensorrt_llm/common/customAllReduceUtils.htests/unittest/llmapi/test_mpi_session.pycpp/tensorrt_llm/thop/allreduceOp.cppcpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/modules/fused_moe/ops/moe_op.pytests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/modules/linear.pytests/scripts/allreduce_perf/allreduce_heuristic_code_gen.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.pytests/scripts/allreduce_perf/allreduce_perf_viz.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/disaggregated/test_disaggregated.pytests/unittest/llmapi/_run_multi_mpi_comm_tasks.pytests/integration/defs/test_e2e.pytests/integration/defs/conftest.pytensorrt_llm/llmapi/mpi_session.pytests/microbenchmarks/all_reduce.pytests/unittest/llmapi/_run_multi_llm_tasks.pytests/unittest/llmapi/test_mpi_session.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/modules/fused_moe/ops/moe_op.pycpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cucpp/tensorrt_llm/kernels/customAllReduceKernels.htests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/modules/linear.pytests/scripts/allreduce_perf/allreduce_heuristic_code_gen.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.pytests/scripts/allreduce_perf/allreduce_perf_viz.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/disaggregated/test_disaggregated.pytests/unittest/llmapi/_run_multi_mpi_comm_tasks.pycpp/tensorrt_llm/kernels/userbuffers/userbuffers.cutests/integration/defs/test_e2e.pytests/integration/defs/conftest.pytensorrt_llm/llmapi/mpi_session.pytests/microbenchmarks/all_reduce.pytests/unittest/llmapi/_run_multi_llm_tasks.pycpp/tensorrt_llm/common/customAllReduceUtils.htests/unittest/llmapi/test_mpi_session.pycpp/tensorrt_llm/thop/allreduceOp.cppcpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cucpp/tensorrt_llm/kernels/customAllReduceKernels.hcpp/tensorrt_llm/kernels/userbuffers/userbuffers.cucpp/tensorrt_llm/common/customAllReduceUtils.hcpp/tensorrt_llm/thop/allreduceOp.cppcpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cucpp/tensorrt_llm/kernels/customAllReduceKernels.hcpp/tensorrt_llm/kernels/userbuffers/userbuffers.cucpp/tensorrt_llm/common/customAllReduceUtils.hcpp/tensorrt_llm/thop/allreduceOp.cppcpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/tensorrt_llm/kernels/customAllReduceKernels.hcpp/tensorrt_llm/common/customAllReduceUtils.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/kernels/customAllReduceKernels.hcpp/tensorrt_llm/common/customAllReduceUtils.hcpp/tensorrt_llm/thop/allreduceOp.cpp
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/tensorrt_llm/kernels/customAllReduceKernels.hcpp/tensorrt_llm/common/customAllReduceUtils.h
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_nim.txttests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_core_sanity.txttests/unittest/llmapi/_run_multi_llm_tasks.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/qa/llm_function_nim.txttests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_core_sanity.txttests/unittest/llmapi/_run_multi_llm_tasks.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.hcpp/tensorrt_llm/thop/allreduceOp.cpp
🧬 Code graph analysis (21)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig(1199-1333)tensorrt_llm/mapping.py (1)
Mapping(348-507)tensorrt_llm/runtime/generation.py (5)
kv_cache_type(1208-1209)dtype(854-855)dtype(1257-1258)num_layers(1173-1176)tokens_per_block(1216-1217)cpp/include/tensorrt_llm/executor/types.h (1)
DataType(73-658)
tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py (1)
tests/integration/defs/conftest.py (1)
is_sm_100f(1896-1899)
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)
cpp/tensorrt_llm/thop/allreduceOp.cpp (2)
op(1029-1029)op(1068-1068)
tests/unittest/_torch/multi_gpu/test_allreduce.py (2)
tensorrt_llm/functional.py (2)
AllReduceParams(3900-3939)AllReduceStrategy(3876-3885)tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce(455-617)
tensorrt_llm/_torch/modules/linear.py (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size(63-64)
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (2)
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
tensorrt_llm(28-296)tests/scripts/allreduce_perf/allreduce_perf_viz.py (1)
main(553-605)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tests/integration/defs/conftest.py (2)
get_sm_version(1890-1893)is_sm_100f(1896-1899)tensorrt_llm/_utils.py (1)
local_mpi_size(557-558)
tests/scripts/allreduce_perf/allreduce_perf_viz.py (2)
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
tensorrt_llm(28-296)tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py (1)
main(168-251)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
tests/integration/defs/conftest.py (2)
get_sm_version(1890-1893)is_sm_100f(1896-1899)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/_torch/distributed/communicator.py (2)
tp_allgather(394-395)tp_allgather(653-666)tensorrt_llm/_torch/pyexecutor/scheduler.py (1)
batch_size(35-36)
tests/integration/defs/disaggregated/test_disaggregated.py (4)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
env(61-68)tests/integration/defs/trt_test_alternative.py (1)
check_call(250-258)tests/integration/defs/conftest.py (2)
disaggregated_example_root(285-290)llm_venv(702-719)tests/integration/defs/local_venv.py (1)
get_working_directory(43-49)
tests/unittest/llmapi/_run_multi_mpi_comm_tasks.py (3)
tensorrt_llm/executor/utils.py (1)
LlmLauncherEnvs(22-29)tensorrt_llm/llmapi/mpi_session.py (9)
RemoteMpiCommSessionClient(300-405)submit(91-93)submit(148-153)submit(215-231)submit(344-358)submit_sync(96-97)submit_sync(155-160)submit_sync(233-235)submit_sync(362-378)tensorrt_llm/llmapi/utils.py (1)
print_colored(47-63)
cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (1)
cpp/tensorrt_llm/kernels/dsv3MinLatencyKernels/dsv3FusedAGemm.cu (14)
void(35-53)void(60-69)void(71-76)void(90-95)void(98-112)void(130-138)void(140-144)void(179-193)void(195-232)void(279-294)void(296-333)void(389-415)void(417-461)void(463-537)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (3)
llm_root(192-193)llm_venv(702-719)llm_models_root(80-94)
tests/integration/defs/conftest.py (3)
tensorrt_llm/_utils.py (2)
is_sm_100f(739-742)get_sm_version(733-735)tests/integration/defs/utils/periodic_junit.py (2)
PeriodicJUnitXML(41-342)pytest_configure(120-137)tests/integration/defs/trt_test_alternative.py (2)
print_info(300-306)print_warning(309-315)
tensorrt_llm/llmapi/mpi_session.py (3)
tensorrt_llm/executor/proxy.py (1)
shutdown(369-416)tensorrt_llm/executor/utils.py (1)
shutdown(103-104)tensorrt_llm/llmapi/utils.py (2)
print_colored_debug(66-70)print_colored(47-63)
tests/microbenchmarks/all_reduce.py (4)
tensorrt_llm/_utils.py (3)
local_mpi_rank(553-554)local_mpi_size(557-558)nvtx_range(904-923)tensorrt_llm/functional.py (2)
AllReduceParams(3900-3939)AllReduceStrategy(3876-3885)tensorrt_llm/plugin/plugin.py (2)
CustomAllReduceHelper(542-718)max_workspace_size_auto(582-590)tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce(455-617)
tests/unittest/llmapi/_run_multi_llm_tasks.py (1)
tensorrt_llm/llmapi/utils.py (1)
print_colored(47-63)
cpp/tensorrt_llm/common/customAllReduceUtils.h (2)
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)
AllReduceFusionOp(69-195)cpp/tensorrt_llm/thop/allreduceOp.cpp (8)
seq_len(927-970)seq_len(927-927)seq_len(972-981)seq_len(972-972)op(1029-1029)op(1068-1068)message_size(983-996)message_size(983-983)
tests/unittest/llmapi/test_mpi_session.py (2)
tensorrt_llm/llmapi/mpi_session.py (4)
MPINodeState(30-58)MpiPoolSession(136-180)RemoteMpiCommSessionClient(300-405)split_mpi_env(558-599)tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
env(61-68)
cpp/tensorrt_llm/thop/allreduceOp.cpp (2)
cpp/include/tensorrt_llm/common/dataType.h (1)
getDTypeSize(26-44)cpp/tensorrt_llm/common/customAllReduceUtils.h (2)
getMaxRequiredWorkspaceSize(34-45)selectStrategyLookUpTable(125-159)
🪛 Clang (14.0.6)
cpp/tensorrt_llm/common/customAllReduceUtils.h
[error] 19-19: 'tensorrt_llm/common/cudaUtils.h' file not found
(clang-diagnostic-error)
🪛 markdownlint-cli2 (0.18.1)
tests/scripts/allreduce_perf/README.md
142-142: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.0)
tests/scripts/allreduce_perf/allreduce_heuristic_code_gen.py
22-22: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
23-23: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
24-27: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
28-28: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
29-33: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
125-125: f-string without any placeholders
Remove extraneous f prefix
(F541)
218-218: subprocess call: check for execution of untrusted input
(S603)
232-232: Avoid equality comparisons to False; use not df.empty: for false checks
Replace with not df.empty
(E712)
tests/scripts/allreduce_perf/allreduce_perf_viz.py
149-149: f-string without any placeholders
Remove extraneous f prefix
(F541)
239-239: Unpacked variable fig is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
303-303: f-string without any placeholders
Remove extraneous f prefix
(F541)
313-313: f-string without any placeholders
Remove extraneous f prefix
(F541)
535-535: f-string without any placeholders
Remove extraneous f prefix
(F541)
538-538: f-string without any placeholders
Remove extraneous f prefix
(F541)
573-573: Undefined name fusion_op
(F821)
tensorrt_llm/llmapi/mpi_session.py
279-279: f-string without any placeholders
Remove extraneous f prefix
(F541)
310-310: Unused static method argument: hmac_key
(ARG004)
401-401: f-string without any placeholders
Remove extraneous f prefix
(F541)
404-404: Unused method argument: grace
(ARG002)
404-404: Unused method argument: reason
(ARG002)
477-477: Do not catch blind exception: Exception
(BLE001)
tests/microbenchmarks/all_reduce.py
124-124: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
140-140: Avoid specifying long messages outside the exception class
(TRY003)
tests/unittest/llmapi/test_mpi_session.py
123-123: Unused function argument: task_script
(ARG001)
136-136: subprocess call: check for execution of untrusted input
(S603)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (29)
tests/integration/test_lists/qa/llm_function_nim.txt (1)
1-459: Test list update looks good.This is a configuration file with no production code. The changes reflect intentional test suite expansion and reorganization for the NIM category per the mass integration objectives. All test identifiers follow proper pytest syntax, and the addition of new model variants (Llama4 Maverick/Scout, Qwen3, DeepSeek, Nemotron) and quantization strategies (fp8, nvfp4, fp4, etc.) is consistent with the PR summary.
tests/unittest/llmapi/_run_multi_llm_tasks.py (1)
1-13: Add required NVIDIA Apache-2.0 copyright header.The file is missing the mandatory NVIDIA Apache-2.0 copyright header that must be prepended to all Python source files.
As per coding guidelines.
Add the copyright header at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os import sys⛔ Skipped due to learnings
Learnt from: galagam PR: NVIDIA/TensorRT-LLM#6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.tests/microbenchmarks/all_reduce.py (5)
17-17: LGTM! Imports are appropriate.The new imports (
product,pandas, additional utilities from_utils, andCustomAllReduceHelper) are all used in the refactored benchmarking logic.Also applies to: 21-21, 32-36
65-116: LGTM! Profiling logic is sound.The function correctly implements:
- CUDA graph capture when enabled with proper warmup
- Event-based timing with median calculation to reduce noise
- MPI barrier for cross-rank synchronization
- Delay kernel to mitigate host overhead
- Correctness verification for the non-fused case
147-169: LGTM! Shape generation logic is correct.The code properly handles both 2D exploration (using
productof predefined lists) and linear exploration (geometric progression fromtest_range). The assertion at line 169 ensures the size calculation remains consistent.Optional: Consider adding validation for the
test_rangeformat at line 158 to provide clearer error messages if an invalid format is passed via CLI.
171-240: LGTM! Benchmark loop is well-structured with appropriate gating.The logic correctly:
- Validates message size against workspace limits (line 188)
- Creates appropriate test tensors for each configuration
- Applies sensible gating (TWOSHOT requires sufficient tokens, NVFP4 requires SM100+, AUTO is opt-in)
- Collects results only on rank 0
Optional: For better performance, consider collecting results in a list of dictionaries and creating the DataFrame once after the loop instead of using
pd.concatrepeatedly (line 228), which can be slow for large iteration counts.
242-251: LGTM! Output handling and CLI integration are correct.The changes properly:
- Configure pandas to display all results
- Save to CSV when requested (rank 0 only)
- Add CLI arguments for all new features
- Pass arguments through to the benchmark function
Also applies to: 254-276
tensorrt_llm/_torch/modules/linear.py (2)
1369-1369: LGTM! Corrected error message to use function parameter.The error message now correctly references the
in_featuresfunction parameter instead ofself.in_features. This is more accurate sincecreate_weightsis a method of the linear method class, not theLinearmodule itself.
1468-1469: LGTM! Fixed distributed weight loading for fused QKV.The updated call now passes
tp_size,tp_rank, andtp_modetoload_weight_scales, ensuring proper tensor-parallel sharding of weight scales for fused QKV layers. This aligns with how other quantization methods (e.g.,NVFP4LinearMethod) handle distributed weight loading.cpp/tensorrt_llm/kernels/userbuffers/userbuffers.cu (3)
62-65: Entry grid-dependency wait looks correct.cudaGridDependencySynchronize() gated on CUDA_ARCH >= 900 is appropriate at kernel entry. No issues.
Also applies to: 143-146
129-133: End-of-kernel programmatic completion is consistent.Triggering completion after the final flag write is aligned with PDL expectations (with the fences added as suggested above). Looks good.
Ensure callers set cudaLaunchAttributeProgrammaticStreamSerializationAllowed (already done in LaunchConfig) whenever these kernels are launched via callranks/callranksMC.
Also applies to: 239-241, 876-877, 1068-1069, 1162-1163, 1219-1220
368-379: Graceful handling for SM < 90 paths.Explicit brkpt with clear message is fine for unsupported architectures.
Also applies to: 1222-1277
tests/integration/test_lists/qa/llm_function_core.txt (1)
698-701: New Llama-3.3-70B FP8 PP-enabled entries look fine.Duplication across lists is acceptable for different execution contexts. No action needed.
Based on learnings
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
27-30: Import updates align with new API surface.AllReduceStrategy/Params usage is consistent with functional changes.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1013-1022: Good centralization of queuing logic.Consolidating the ADP gating into
_can_queue()improves readability and parity across loops.tests/scripts/allreduce_perf/README.md (1)
105-112: Docs/code mismatch: no logarithmic scaling implemented.Either remove “logarithmic scaling” from docs or add
norm=LogNorm()to the heatmaps.tensorrt_llm/llmapi/mpi_session.py (3)
265-277: Reusing a global MPICommExecutor is reasonable; ensure predictable teardown.The COMM_WORLD path holds a process‑global executor/pool without calling
__exit__(). Confirm expectations at process shutdown (e.g., atexit hook) so ranks don’t hang on interpreter teardown.Also applies to: 281-285
396-406: No‑op shutdown can leak resources.If LLM instances are created/destroyed repeatedly, the PAIR socket may persist. Consider making
shutdown()idempotent and reference‑counted per address, or document the lifecycle assumptions.
468-485: Nice: wait for in‑flight futures before next task.This barrier avoids interleaving tasks across ranks and reduces synchronization issues.
Also applies to: 503-503
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
371-381: SM100f gate change looks good; keep a fallback path.Switching to
is_sm_100f()for DeepGemm vs Cutlass selection is correct; ensure_utils.is_sm_100fis public and unit-tested.
8-9: The import ofis_sm_100fis valid—the function exists intensorrt_llm._utils.Verification confirms
is_sm_100fis defined at line 739 oftensorrt_llm/_utils.py. The import infused_moe_wide_ep.pyis correct and requires no changes.Likely an incorrect or invalid review comment.
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)
109-131: Stringification helpers for AllReduceStrategyType — LGTM.Consistent with existing
AllReduceFusionOphelpers; aids logging/debug.tensorrt_llm/_torch/modules/fused_moe/ops/moe_op.py (2)
225-233: Selector logic update — LGTM.Using
is_sm_100f()centralizes the SM100f check; behavior unchanged otherwise.
21-21: No action required; dependency already properly established.The import
from tensorrt_llm._utils import is_sm_100fat line 21 of moe_op.py is valid. The functionis_sm_100fis already defined in tensorrt_llm/_utils.py at line 739, and both moe_op.py (line 227) and WideEPMoE (fused_moe_wide_ep.py line 377) successfully import and use this single-source dependency. No runtime ImportError risk exists, and no duplication or missing exports are present.tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
149-149: DeepSeek V3-Lite bf16 empty-batch test entry looks good.No issues from this addition.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
19-19: Import ofis_sm_100fis correct and aligns with conftest.Looks good.
2282-2285: Switch tois_sm_100f()for SM100f gating.Appropriate replacement for previous ad-hoc checks; keeps intent explicit.
2333-2341: SM100f branch logic LGTM.Defaulting MoE backend to DEEPGEMM and tuning memory when
is_sm_100f()is true is consistent with the new helper.tests/integration/defs/disaggregated/test_configs/disagg_config_deepseek_v3_lite_empty_batch.yaml (1)
1-61: Config looks consistent for the empty-batch DeepSeek V3-Lite bf16 scenario.Values and splits (ctx/gen, ports, kv cache fractions) are reasonable.
3837379 to
b50566a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #21926 [ run ] triggered by Bot. Commit: |
|
PR_Github #21926 [ run ] completed with state |
b50566a to
cd195b6
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #22064 [ run ] triggered by Bot. Commit: |
cd195b6 to
ca5705d
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #22068 [ run ] triggered by Bot. Commit: |
|
PR_Github #22064 [ run ] completed with state |
|
PR_Github #22068 [ run ] completed with state |
ca5705d to
55ea317
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #22211 [ run ] triggered by Bot. Commit: |
|
PR_Github #22211 [ run ] completed with state |
55ea317 to
e251c7c
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #22311 [ run ] triggered by Bot. Commit: |
|
PR_Github #23083 [ run ] triggered by Bot. Commit: |
|
PR_Github #23083 [ run ] completed with state |
a078bf3 to
590e173
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23204 [ run ] triggered by Bot. Commit: |
590e173 to
2185600
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23210 [ run ] triggered by Bot. Commit: |
|
PR_Github #23204 [ run ] completed with state |
|
PR_Github #23210 [ run ] completed with state |
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…VIDIA#8357) Signed-off-by: Stanley Sun <stsun@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…ention DP with disagg (NVIDIA#8372) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…acy issue (NVIDIA#8318) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…VIDIA#7870) Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…st.py (NVIDIA#8388) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…les in W4A16 AWQ (NVIDIA#8432) Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
…VIDIA#8455) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2185600 to
a906068
Compare
|
/bot reuse-pipeline |
|
PR_Github #23410 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #23410 [ reuse-pipeline ] completed with state |
Description
Another mass integration commits from the release/1.1 branch.
Test Coverage
N/A
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Performance Improvements