Conversation
…egy tuning. (NVIDIA#8531)" This reverts commit d272f1a.
|
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" |
📝 WalkthroughWalkthroughThis PR removes environment-variable-driven all-reduce configuration and autotuning infrastructure, replacing them with deterministic heuristic-based strategy selection. It simplifies all-reduce dispatch to use default torch operators, updates pattern compilation to eliminate parameterization, and refactors related tests and benchmarks to remove distributed state management. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/scripts/allreduce_perf/allreduce_perf_viz.py (1)
571-585: Undefinedfusion_opand non‑idempotent directory creation
- Line 573 uses
fusion_opbefore it’s defined; this will raiseNameErroron the first run whenvizdoesn’t exist.os.makedirs(os.path.join(args.data_dir, "viz", fusion_op))inside the loop (line 584) will raiseFileExistsErroron subsequent runs becauseexist_ok=Trueisn’t used.Recommend:
- Drop the pre-loop creation of
viz/fusion_op, and- Make the per‑fusion directory creation idempotent.
Proposed fix for directory handling
- if not os.path.exists(os.path.join(args.data_dir, "viz")): - os.makedirs(os.path.join(args.data_dir, "viz")) - os.makedirs(os.path.join(args.data_dir, "viz", fusion_op)) + if not os.path.exists(os.path.join(args.data_dir, "viz")): + os.makedirs(os.path.join(args.data_dir, "viz")) @@ - for fusion_op in fusion_op_list: - os.makedirs(os.path.join(args.data_dir, "viz", fusion_op)) + for fusion_op in fusion_op_list: + os.makedirs( + os.path.join(args.data_dir, "viz", fusion_op), + exist_ok=True, + )tensorrt_llm/_torch/models/modeling_llama.py (1)
656-661: LlamaDecoderLayer inconsistently omitsallreduce_strategyparameterThe
AllReduceinstantiation at line 660 inLlamaDecoderLayeromits thestrategyargument, causingmodel_config.allreduce_strategyto be silently ignored. In contrast,Llama4DecoderLayer(line 457) and most other decoder layers throughout the codebase explicitly pass this parameter.Consider aligning with the pattern used in Llama4DecoderLayer:
Proposed fix
- self.all_reduce = AllReduce(mapping=model_config.mapping) + self.all_reduce = AllReduce( + mapping=model_config.mapping, + strategy=model_config.allreduce_strategy, + )cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
37-87: Heuristic map should be const and structured binding has unused elementThe new heuristic code has a few issues to address:
- In
SelectStrategyLPat line 41, the structured binding declaresnccl_num_token_thresholdbut never uses it; this will generate an unused-variable warning.HeuristicThresholdLPshould be declaredconstand renamedkHeuristicThresholdLpper the coding guidelines, since it is read-only configuration data that should not be mutated.- Accessing
HeuristicThresholdLP[sm_major][world_size]for unsupportedworld_sizevalues will synthesize a{0, 0}entry, causing the function to use zero thresholds (effectively forcingTWOSHOTstrategy). Adding bounds checking would be safer.Consider the suggested cleanup below to align with coding standards and prevent potential issues:
Suggested fix
-// (SM major_version, TP_size) -> (NCCL_num_token_threshold, TWO_SHOT_numel_threshold) -inline std::unordered_map<int, std::unordered_map<int, std::pair<size_t, size_t>>> HeuristicThresholdLP{ +// (SM major_version, TP_size) -> (NCCL_num_token_threshold, TWO_SHOT_numel_threshold) +inline const std::unordered_map<int, std::unordered_map<int, std::pair<size_t, size_t>>> kHeuristicThresholdLp{And in
SelectStrategyLP:- auto const [nccl_num_token_threshold, two_shot_numel_threshold] = HeuristicThresholdLP[sm_major][world_size]; + auto itSm = kHeuristicThresholdLp.find(sm_major); + if (itSm == kHeuristicThresholdLp.end()) + { + return AllReduceStrategyType::ONESHOT; + } + auto itTp = itSm->second.find(world_size); + if (itTp == itSm->second.end()) + { + return AllReduceStrategyType::ONESHOT; + } + auto const two_shot_numel_threshold = itTp->second.second;
🤖 Fix all issues with AI agents
In @tests/scripts/allreduce_perf/allreduce_perf_viz.py:
- Around line 601-605: The heatmap save path for viz_path_diff incorrectly
prefixes os.path.dirname(__file__) causing inconsistent/incorrect paths when
args.data_dir is absolute; update the viz_path_diff construction to use the same
form as the other visualizations (start with f"{args.data_dir}/...") so the path
is built consistently, then call visualize_strategy_difference_heatmaps(df,
fusion_op, save_path=viz_path_diff) with the corrected viz_path_diff.
In @tests/unittest/_torch/multi_gpu/test_allreduce.py:
- Around line 241-256: The assertion error message uses an unnecessary f-string
and the 1% mismatch allowance is undocumented; change the raised assertion to
use a plain string (replace f"Large mismatched elements encountered" with "Large
mismatched elements encountered") and add a brief inline comment next to the
mismatch_percentage check (or tighten the threshold) explaining why
mismatch_percentage < 0.01 is acceptable for calc_output_tensor vs
ref_output_tensor given rtol/atol and cross-GPU allreduce variability; ensure
the symbols rtol, atol, mismatched, and mismatch_percentage remain intact.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/autotuner.py (1)
870-878: Unusedmin_timefrom_profile_runnersreturn tupleThe
min_timecomponent from_profile_runnersis unpacked but never used, which is flagged by Ruff and slightly obscures intent.Consider marking it as intentionally ignored:
Small clean‑up for unused value
- best_runner_id, best_tactic, min_time, has_tuning_failure_occurred = self._profile_runners( + best_runner_id, best_tactic, _min_time, has_tuning_failure_occurred = self._profile_runners( custom_op, runners, tensors, p, tuning_config, **kwargs)tests/unittest/_torch/multi_gpu/test_allreduce.py (2)
135-137: Unused parameter in function signature.The
resparameter is not used in the function body. While this maintains signature consistency with other calc functions (likecalc_fused_allreduce), consider documenting why it's present or using_prefix to indicate it's intentionally unused.♻️ Optional refactor to clarify intent
- def calc_allreduce(x, res): + def calc_allreduce(x, _res): linear_out = linear(x) return [linear_out]
240-240: Consider addingstrict=Trueto zip() if Python 3.10+ is required.Static analysis suggests adding an explicit
strict=parameter to catch potential length mismatches betweencalc_outputandref_output. However, note that this feature is only available in Python 3.10+, and the coding guidelines indicate Python 3.8+ support.♻️ Optional modernization (Python 3.10+ only)
- for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output): + for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output, strict=True):Only apply this change if the project has moved to Python 3.10+ as the minimum version.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/tensorrt_llm/common/customAllReduceUtils.htensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/_torch/distributed/ops.pytensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytests/microbenchmarks/all_reduce.pytests/scripts/allreduce_perf/allreduce_perf_viz.pytests/unittest/_torch/multi_gpu/test_allreduce.pytests/unittest/_torch/multi_gpu/test_user_buffers.py
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
tests/unittest/_torch/multi_gpu/test_user_buffers.pytests/microbenchmarks/all_reduce.pytests/scripts/allreduce_perf/allreduce_perf_viz.pytensorrt_llm/_torch/autotuner.pytests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
tests/unittest/_torch/multi_gpu/test_user_buffers.pytests/microbenchmarks/all_reduce.pytests/scripts/allreduce_perf/allreduce_perf_viz.pycpp/tensorrt_llm/common/customAllReduceUtils.htensorrt_llm/_torch/autotuner.pytests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#defineswhenever possible
A variable that is not modified after its initialization should be declared asconst
For naming of constants in C++, use uppercase snakecase with prefix 'k' (e.g.,kDIGIT_NUM)
Except for0,nullptr,true, andfalse, all other literals should only be used for variable initialization and not in comparisons or expressions
Use Allman indentation style for brace notation in C++ code
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do..while, orforstatement must be a compound statement (use brace-delimited statements)
Ifandelsestatements should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camelCase with first letter lowercase (e.g.,thisIsAFilename.cpp)
All types (including class names) in C++ should use PascalCase with uppercase first letter (e.g.,FooBarClass)
Local variables, methods, and namespaces in C++ should use camelCase with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camelCase prefixed with 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camelCase prefixed with 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camelCase with 's' as the first letter (e.g.,static std::once_flag sFlag;)
Public, private, and protected class member variables should use camelCase prefixed with 'm' (e.g.,mNbFooValues)
Do not use Hungarian notation in C++ except for 'apps hungarian' (e.g., 'nb' to indicate count:mNbLayers)
If a constructor parameter name conflicts with a public me...
Files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
**/*.{h,hpp,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx}: Follow Doxygen rules for documenting new C++ class interfaces and function prototypes. Use//!for C++-style single-line comments and//!<for class members
Use a preprocessor guard in C++ header files with the formatTRTLLM_<FILENAME>_H, where the filename is in uppercase with no underscores, no prefix underscores, and no trailing underscores
Files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
**/*.{h,hpp,hxx,cpp,cc,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All C++ class templates, function templates, class template member functions, and class template static members must be instantiated at least once
Files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
🧠 Learnings (24)
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tests/unittest/_torch/multi_gpu/test_user_buffers.pytests/microbenchmarks/all_reduce.pytests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tests/microbenchmarks/all_reduce.pycpp/tensorrt_llm/common/customAllReduceUtils.htests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-09-24T03:31:28.908Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.
Applied to files:
tests/microbenchmarks/all_reduce.pytensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.htests/unittest/_torch/multi_gpu/test_allreduce.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.htensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/distributed/ops.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.htensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.htensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/common/customAllReduceUtils.h
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/models/modeling_llama.py
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.
Applied to files:
tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/autotuner.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/unittest/_torch/multi_gpu/test_allreduce.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/multi_gpu/test_allreduce.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-08-27T16:22:10.695Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py:94-100
Timestamp: 2025-08-27T16:22:10.695Z
Learning: When there are inconsistent operator detection methods (like custom_op() vs target_op()), removing one method and standardizing on the other is often cleaner than supporting both methods simultaneously.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
tensorrt_llm/_torch/model_config.py
🧬 Code graph analysis (7)
tests/microbenchmarks/all_reduce.py (2)
tensorrt_llm/functional.py (1)
AllReduceFusionOp(3887-3896)tensorrt_llm/_utils.py (4)
local_mpi_rank(564-573)local_mpi_size(576-577)nvtx_range(891-910)mpi_barrier(589-591)
cpp/tensorrt_llm/common/customAllReduceUtils.h (1)
cpp/tensorrt_llm/thop/allreduceOp.cpp (8)
seq_len(1182-1231)seq_len(1182-1182)seq_len(1233-1242)seq_len(1233-1233)op(1291-1291)op(1330-1330)message_size(1244-1257)message_size(1244-1244)
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
tensorrt_llm/functional.py (4)
AllReduceFusionOp(3887-3896)AllReduceParams(3899-3938)AllReduceStrategy(3874-3884)MoEAllReduceParams(3941-3977)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2)
cpp/tensorrt_llm/thop/allreduceOp.cpp (16)
input(281-315)input(281-284)input(335-413)input(335-337)input(415-446)input(415-417)input(448-563)input(448-450)input(565-630)input(565-567)input(632-815)input(632-636)input(817-883)input(817-820)op(1291-1291)op(1330-1330)cpp/tensorrt_llm/thop/reducescatterOp.cpp (6)
input(124-127)input(124-124)op(227-227)op(245-245)op(263-263)op(282-282)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/functional.py (1)
AllReduceStrategy(3874-3884)
tensorrt_llm/_torch/models/modeling_llama.py (1)
tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce(637-850)
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
allreduce(17-60)tensorrt_llm/functional.py (1)
AllReduceStrategy(3874-3884)
🪛 Ruff (0.14.10)
tests/scripts/allreduce_perf/allreduce_perf_viz.py
573-573: Undefined name fusion_op
(F821)
tensorrt_llm/_torch/autotuner.py
874-874: Unpacked variable min_time is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tests/unittest/_torch/multi_gpu/test_allreduce.py
135-135: Unused function argument: res
(ARG001)
240-240: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
256-256: f-string without any placeholders
Remove extraneous f prefix
(F541)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
19-19: Unused function argument: residual
(ARG001)
20-20: Unused function argument: norm_weight
(ARG001)
21-21: Unused function argument: scale
(ARG001)
22-22: Unused function argument: bias
(ARG001)
23-23: Unused function argument: workspace
(ARG001)
24-24: Unused function argument: group
(ARG001)
25-25: Unused function argument: strategy
(ARG001)
27-27: Unused function argument: eps
(ARG001)
28-28: Unused function argument: trigger_completion_at_end
(ARG001)
71-71: Unused function argument: rank
(ARG001)
72-72: Unused function argument: pg
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (6)
2-2: LGTM - Consistent function signature updatesAll registration function signatures have been consistently updated to remove the
allreduce_funcparameter, replaced with hard-coded usage oftorch.ops.trtllm.allreduce.default. The import changes are also correct.Also applies to: 17-18, 113-114, 190-191, 266-267, 338-339, 409-410
20-24: LGTM - Pattern definitions correctly updatedAll
CallFunctionpattern definitions have been consistently updated to usetorch.ops.trtllm.allreduce.defaultinstead of the parameterizedallreduce_func.default. This ensures pattern matching works with the hard-coded operator.Also applies to: 117-117, 194-194, 270-270, 342-342, 416-420, 670-670
59-64: LGTM - Target patterns correctly implementedAll target pattern implementations have been updated to directly call
torch.ops.trtllm.allreduceinstead of the parameterizedallreduce_func. The logic remains correct and consistent with the pattern definitions.Also applies to: 168-173, 245-249, 316-321, 388-392
104-110: LGTM - Strategy handling preserved correctlyThe strategy-based selection logic has been correctly preserved. The
check_non_ub_strategyfunction ensures FP8/FP4 quantization patterns are only applied for non-UB strategies, and the UB conversion pattern correctly checks for AUTO strategy. This maintains the strategy selection mechanism while removing the autotuner functionality.Also applies to: 413-464
1-732: Revert appears complete and correctThe revert successfully removes the autotuner functionality by replacing the parameterized
allreduce_funcwith hard-codedtorch.ops.trtllm.allreduce.defaultthroughout the file. All changes are consistent and the logic remains sound. The strategy-based selection is preserved, ensuring that different AllReduce strategies can still be used - just without the autotuning capability.Key changes verified:
- All function signatures updated consistently
- All pattern definitions updated to use fixed operator
- All target implementations updated correctly
- All internal call sites updated
- Strategy handling logic preserved
- No syntax or logic errors introduced
719-732: External call sites have been properly updated.The external call to
register_ar_fusionsintensorrt_llm/_torch/compilation/backend.py:79-80correctly passes only three parameters (cls._custom_pass_instances,mapping,ub_enabled) matching the updated function signature. No remaining references toallreduce_funcwere found in the codebase. The refactoring is complete.tests/unittest/_torch/multi_gpu/test_user_buffers.py (3)
463-463: Add NVIDIA copyright header to the file.This file is missing the required NVIDIA copyright header. Per coding guidelines, all TensorRT-LLM source files (.py and other source files) must include an NVIDIA copyright header with the year of latest meaningful modification. Add the header at the top of the file before the imports.
Additionally, verify the test passes with the updated match counts at lines 463, 762, and 996.
⛔ Skipped due to learnings
Learnt from: galagam Repo: NVIDIA/TensorRT-LLM PR: 6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.Learnt from: xinhe-nv Repo: NVIDIA/TensorRT-LLM PR: 8534 File: scripts/format_test_list.py:1-6 Timestamp: 2025-10-22T06:53:47.017Z Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
762-762: Add NVIDIA copyright header to file.The file is missing the required NVIDIA copyright header. Per the coding guidelines, all TensorRT-LLM source files (.py, .cpp, .h, .cu, etc.) must contain an NVIDIA copyright header with the year of latest meaningful modification. Add the header at the top of the file before any imports.
⛔ Skipped due to learnings
Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2026-01-06T03:07:15.754Z Learning: Applies to **/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py} : All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modificationLearnt from: xinhe-nv Repo: NVIDIA/TensorRT-LLM PR: 8534 File: scripts/format_test_list.py:1-6 Timestamp: 2025-10-22T06:53:47.017Z Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.Learnt from: galagam Repo: NVIDIA/TensorRT-LLM PR: 6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.Learnt from: tburt-nv Repo: NVIDIA/TensorRT-LLM PR: 9881 File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36 Timestamp: 2025-12-10T19:14:58.432Z Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012 Timestamp: 2025-08-14T23:23:27.449Z Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
996-996: Add NVIDIA copyright header to file and verify test passes with updated match counts.The assertion at line 996 is correctly consistent with line 463, as both tests validate similar fusion scenarios with different quantization types (FP4 vs FP8). However, the file is missing the required NVIDIA copyright header at the top. Per coding guidelines, all TensorRT-LLM Python source files must contain an
SPDX-FileCopyrightText: Copyright (c) <year> NVIDIA CORPORATION & AFFILIATESheader.⛔ Skipped due to learnings
Learnt from: galagam Repo: NVIDIA/TensorRT-LLM PR: 6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.Learnt from: xinhe-nv Repo: NVIDIA/TensorRT-LLM PR: 8534 File: scripts/format_test_list.py:1-6 Timestamp: 2025-10-22T06:53:47.017Z Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2026-01-06T03:07:15.754Z Learning: Applies to **/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py} : All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modificationLearnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012 Timestamp: 2025-08-14T23:23:27.449Z Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.tensorrt_llm/_torch/model_config.py (1)
154-167: AllReduce strategy string mapping looks consistentThe
NCCL_SYMMETRICentry and overall mapping inget_all_reduce_strategyremain correct; the syntactic tweak (trailing comma removal) is no-op semantically.tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
16-30: Fake allreduce op signatures remain structurally correct; verify schema compatibilityThe
trtllm::allreduceandtrtllm::allreduce_pgfake kernels keep the same positional parameters but drop Python type annotations and forwardallreduce_pgthroughallreducefor shape inference. That’s fine from a behavior/shape standpoint, and the unused arguments are required to keep the signature aligned with the C++ ops despite Ruff’sARG001noise.Please double‑check that the argument names/order here still exactly match the registered C++ schemas for
trtllm::allreduceandtrtllm::allreduce_pg, sotorch.library.register_fakeandtorch.ops.trtllm.*don’t hit a schema mismatch at import time.Also applies to: 62-79
tensorrt_llm/_torch/distributed/ops.py (1)
694-753: AllReduce wiring and fallbacks look sound; verify op schemas and PG availabilityThe updated
AllReduce:
- Chooses between
torch.ops.trtllm.allreduceandallreduce_pgviaself.all_reduce_op, withrank/pgonly passed in the PG path.- Initializes
SymmetricMemoryAllReducewith a clean fallback toAllReduceStrategy.AUTOif unsupported or failing, and allocates a fusion workspace for all strategies exceptUB/NCCL/NCCL_SYMMETRIC.- Tries
symm_mem(forfusion_op == NONE) and thenMNNVLAllReducebefore falling back to the generic C++ allreduce op withstrategynormalized away fromMNNVL/SYMM_MEM.The call into
self.all_reduce_opuses keyword arguments whose names/order match the Python fake registrations incpp_custom_ops.py, so the Python → C++ boundary looks correct.Two things to explicitly confirm:
- The C++ schemas for
trtllm::allreduceandtrtllm::allreduce_pgstill match these keyword names and argument ordering (especially aroundgroup,rank, andpg).- In environments where
mpi_disabled()is true,mapping.tp_group_pgandtorch.distributedhave been initialized beforeAllReduce.forwardis first called; otherwise the PG path will assert orget_rank()will fail.Also applies to: 818-848
| # Create strategy difference heatmaps and save to data/viz directory | ||
| viz_path_diff = f"{args.data_dir}/viz/{fusion_op}/{case_name}_strategy_difference_heatmap.png" | ||
| viz_path_diff = f"{os.path.dirname(__file__)}/{args.data_dir}/viz/{fusion_op}/{case_name}_strategy_difference_heatmap.png" | ||
| visualize_strategy_difference_heatmaps(df, | ||
| fusion_op, | ||
| save_path=viz_path_diff) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l tests/scripts/allreduce_perf/allreduce_perf_viz.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 117
🏁 Script executed:
sed -n '590,620p' tests/scripts/allreduce_perf/allreduce_perf_viz.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1165
🏁 Script executed:
# Search for all visualization path constructions in the file
rg "viz_path|save_path" tests/scripts/allreduce_perf/allreduce_perf_viz.py -n -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 3581
🏁 Script executed:
# Also check how args.data_dir is used throughout
rg "args\.data_dir" tests/scripts/allreduce_perf/allreduce_perf_viz.py -n -B 1 -A 1Repository: NVIDIA/TensorRT-LLM
Length of output: 1505
Remove inconsistent os.path.dirname(__file__) prefix from strategy-difference heatmap path
Line 602 uses f"{os.path.dirname(__file__)}/{args.data_dir}..." while lines 592 and 596 use f"{args.data_dir}..." for the other two visualizations. This inconsistency breaks relative paths and creates issues when args.data_dir is absolute. Align line 602 with the other paths by removing the os.path.dirname(__file__)/ prefix.
🤖 Prompt for AI Agents
In @tests/scripts/allreduce_perf/allreduce_perf_viz.py around lines 601 - 605,
The heatmap save path for viz_path_diff incorrectly prefixes
os.path.dirname(__file__) causing inconsistent/incorrect paths when
args.data_dir is absolute; update the viz_path_diff construction to use the same
form as the other visualizations (start with f"{args.data_dir}/...") so the path
is built consistently, then call visualize_strategy_difference_heatmaps(df,
fusion_op, save_path=viz_path_diff) with the corrected viz_path_diff.
| rtol, atol = 0.05, 0.15 | ||
| try: | ||
| torch.testing.assert_close( | ||
| calc_output_tensor, | ||
| ref_output_tensor, | ||
| rtol=rtol, | ||
| atol=atol, | ||
| ) | ||
| except AssertionError: | ||
| # Calculate percentage of mismatched elements | ||
| mismatched = torch.abs(calc_output_tensor - ref_output_tensor) > ( | ||
| rtol * torch.abs(ref_output_tensor) + atol) | ||
| mismatch_percentage = (mismatched.sum() / mismatched.numel()) | ||
|
|
||
| # If more than 1% elements mismatch, raise the error | ||
| assert mismatch_percentage < 0.01, f"Large mismatched elements encountered" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Read the file to see the actual code and context
head -n 260 tests/unittest/_torch/multi_gpu/test_allreduce.py | tail -n 50Repository: NVIDIA/TensorRT-LLM
Length of output: 2363
🏁 Script executed:
# Check git history for this file to understand the change
git log --oneline tests/unittest/_torch/multi_gpu/test_allreduce.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 161
🏁 Script executed:
# Look for any comments in the file that explain the mismatch threshold
grep -n "mismatch\|tolerance\|threshold\|precision" tests/unittest/_torch/multi_gpu/test_allreduce.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 446
🏁 Script executed:
# Check the entire test function context to understand why this relaxation might be needed
sed -n '200,260p' tests/unittest/_torch/multi_gpu/test_allreduce.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2993
🏁 Script executed:
# Check git blame to see when this specific code was added
git blame -L 241,256 tests/unittest/_torch/multi_gpu/test_allreduce.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1494
🏁 Script executed:
# Search for similar mismatch percentage checks in other tests
rg "mismatch_percentage\|torch.testing.assert_close" tests/ -A 5 -B 2Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check the file header for copyright
head -n 20 tests/unittest/_torch/multi_gpu/test_allreduce.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 813
Fix the unnecessary f-string and document or reconsider the 1% mismatch threshold.
The f-string at line 256 contains no placeholders and should be converted to a regular string per Python style guidelines.
Regarding the assertion logic: The try/except pattern that allows up to 1% element mismatch is valid for handling numerical precision variations across different allreduce strategies, but this design choice lacks documentation. Consider either adding an inline comment explaining why 1% is appropriate for these operations, or tightening the threshold if 1% is too permissive.
Suggested fix
- assert mismatch_percentage < 0.01, f"Large mismatched elements encountered"
+ assert mismatch_percentage < 0.01, "Large mismatched elements encountered"For the 1% threshold, consider adding a comment:
# If more than 1% elements mismatch, raise the error
+ # (Allows for numerical precision variations in allreduce fusion operations)
assert mismatch_percentage < 0.01, "Large mismatched elements encountered"🧰 Tools
🪛 Ruff (0.14.10)
256-256: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In @tests/unittest/_torch/multi_gpu/test_allreduce.py around lines 241 - 256,
The assertion error message uses an unnecessary f-string and the 1% mismatch
allowance is undocumented; change the raised assertion to use a plain string
(replace f"Large mismatched elements encountered" with "Large mismatched
elements encountered") and add a brief inline comment next to the
mismatch_percentage check (or tighten the threshold) explaining why
mismatch_percentage < 0.01 is acceptable for calc_output_tensor vs
ref_output_tensor given rtol/atol and cross-GPU allreduce variability; ensure
the symbols rtol, atol, mismatched, and mismatch_percentage remain intact.
|
PR_Github #31067 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #31086 [ run ] triggered by Bot. Commit: |
|
PR_Github #31086 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #31123 [ run ] triggered by Bot. Commit: |
|
/bot kill |
|
PR_Github #31181 [ ] completed with state |
|
PR_Github #31123 [ run ] completed with state |
…egy tuning. (#8531)"
This reverts commit d272f1a.
Summary by CodeRabbit
Release Notes
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.