-
Notifications
You must be signed in to change notification settings - Fork 2k
[https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow #10499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/bot run |
📝 WalkthroughWalkthroughThe changes reorganize Lamport synchronization workspace memory layout, shifting the clear pointer index from 4 to 3 and storing comm_size as a 64-bit value split across two int32 slots at index 4. Function signatures are updated to use Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu (1)
1-2: Update copyright year to 2025.The file has been modified but the copyright year still shows 2022-2024. As per coding guidelines, source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
Proposed fix
/* - * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.h (1)
1-2: Update copyright year to 2025.The file has been modified but the copyright year still shows 2022-2024.
Proposed fix
/* - * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.
🤖 Fix all issues with AI agents
In @cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu:
- Around line 36-37: The inline comment incorrectly states "int64_t" while the
code reads into a uint64_t; update the comment to match the actual type by
changing the comment before the read to "Read comm_size as uint64_t from two
int32 values (low and high)". Ensure this change is made alongside the existing
read into the variable comm_size (the line using reinterpret_cast to uint64_t
from workspace) so the comment and the variable type are consistent with the
usage in allReduceFusionKernels.cu.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.hcpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cutensorrt_llm/plugin/plugin.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#defineswhenever possible
A variable that is not modified after its initialization should be declared asconst
For naming of constants in C++, use uppercase snakecase with prefix 'k' (e.g.,kDIGIT_NUM)
Except for0,nullptr,true, andfalse, all other literals should only be used for variable initialization and not in comparisons or expressions
Use Allman indentation style for brace notation in C++ code
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do..while, orforstatement must be a compound statement (use brace-delimited statements)
Ifandelsestatements should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camelCase with first letter lowercase (e.g.,thisIsAFilename.cpp)
All types (including class names) in C++ should use PascalCase with uppercase first letter (e.g.,FooBarClass)
Local variables, methods, and namespaces in C++ should use camelCase with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camelCase prefixed with 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camelCase prefixed with 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camelCase with 's' as the first letter (e.g.,static std::once_flag sFlag;)
Public, private, and protected class member variables should use camelCase prefixed with 'm' (e.g.,mNbFooValues)
Do not use Hungarian notation in C++ except for 'apps hungarian' (e.g., 'nb' to indicate count:mNbLayers)
If a constructor parameter name conflicts with a public me...
Files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.hcpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
**/*.{h,hpp,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx}: Follow Doxygen rules for documenting new C++ class interfaces and function prototypes. Use//!for C++-style single-line comments and//!<for class members
Use a preprocessor guard in C++ header files with the formatTRTLLM_<FILENAME>_H, where the filename is in uppercase with no underscores, no prefix underscores, and no trailing underscores
Files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.h
**/*.{h,hpp,hxx,cpp,cc,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All C++ class templates, function templates, class template member functions, and class template static members must be instantiated at least once
Files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.hcpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.htensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
tensorrt_llm/plugin/plugin.py
**/*.{cpp,cc,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,cu}: Use smart pointers for allocating objects on the heap in C++
Preferunique_ptrfor single resource ownership andshared_ptrfor shared resource ownership in C++. Useweak_ptronly in exceptional cases
In C++ function calls where parameters are not obvious, use inline C comments to document the parameter (e.g.,doSomeOperation(/* checkForErrors = */ false);)
Use the least forceful cast necessary in C++, or no cast if possible
Casting a pointer tovoid*in C++ should be implicit (except if removingconst)
Casting in C++ should not remove anyconstorvolatilequalification from the type of a pointer or reference
Do not use C-style casts (other than void casts) and functional notation casts (other than explicit constructor calls) in C++
Casting fromvoid*toT*in C++ should be done withstatic_cast, notreinterpret_cast
Usereinterpret_castin C++ as a last resort, whereconst_castandstatic_castwon't work
Avoiddynamic_castin C++
Do not use assignment operator in C++ subexpressions (e.g.,x = y = zorif (x = y))
When practical, a C++switchstatement controlled by anenumshould have a case for each enum value and not have a default clause
C++ switch statements should be well structured as structured multi-way branches, not as 'glorified gotos'
In C++ switch statements, prohibit fall-through except from one case label to another. Each case clause must be terminated with a break or throw
Do not end a C++ case clause with return; use break or throw instead
If a C++ switch clause is a compound statement, put the break inside the braces
Do not use C library functions in C++ whenever possible. Use C++ alternatives like brace initialization orstd::fill_n()instead ofmemset()
Files:
cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
🧠 Learnings (21)
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.h
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
tensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/plugin/plugin.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/plugin/plugin.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Applied to files:
tensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/plugin/plugin.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
tensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
tensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-12-19T06:31:54.973Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.
Applied to files:
tensorrt_llm/plugin/plugin.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/plugin/plugin.pycpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cucpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-08-14T15:36:37.610Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/kernels/mlaKernels.cu:436-439
Timestamp: 2025-08-14T15:36:37.610Z
Learning: CUDA kernels prioritize performance and should avoid runtime bounds checking or conditional operations that cause branching/warp divergence. Input validation should be done at the host level before kernel launch, not per-thread in the kernel.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-09-22T19:25:45.607Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp:170-179
Timestamp: 2025-09-22T19:25:45.607Z
Learning: In NCCLUserBufferAllocator::getNCCLDevComm(), multimem support is hard-coded to true because multimem is required for this function. The caller is responsible for ensuring multimem is available before calling this function - it should not be called if multimem is not supported.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
🧬 Code graph analysis (1)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.h (1)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu (2)
lamport_initialize(32-36)lamport_initialize(32-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1)
73-76: LGTM - Correct 64-bit comm_size handling.The changes correctly:
- Move
clear_ptrto index 3 to make room for the 64-bit size- Read
comm_sizeasuint64_tfrom indices 4-5 (low/high)- Place the read at index 4 (even), ensuring proper 8-byte alignment for the
uint64_treinterpret_castcpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.h (1)
46-46: LGTM - Correct use ofsize_tfor byte count.Using
size_tfor thebytesparameter properly handles buffer sizes that may exceed the 2GB limit ofint, which is the root cause of the overflow this PR fixes.tensorrt_llm/plugin/plugin.py (1)
740-748: LGTM - Correct 64-bit size encoding for kernel consumption.The implementation correctly:
- Splits
lamport_buffers_sizeinto low/high 32-bit parts using proper bitmasks- Places
lamport_buffers_size_lowat index 4 (even) ensuring 8-byte alignment for the kernel'suint64_tread- Uses
torch.int(32-bit) for the buffer elements, matching the kernel's expectation of twoint32valuesThis aligns with the C++ kernel code in
allReduceFusionKernels.cuandmoeAllReduceFusionKernels.cu.cpp/tensorrt_llm/kernels/communicationKernels/allReduceWorkspace.cu (3)
24-35: LGTM - Propersize_tusage in kernel and launcher.The changes correctly:
- Use
size_tfor the kernel's size parameter and index calculation (line 26)- Launch with 1024 threads per block for efficiency
- Compute grid size from byte count appropriately
48-52: LGTM - Size variables correctly widened tosize_t.Using
size_tforbuffer_size,flag_size,lamport_comm_size, andlamport_buffer_sizeprevents the integer overflow that this PR addresses. The explicitstatic_cast<size_t>(tp_size)on line 51 ensures the multiplication is performed in 64-bit arithmetic.
72-81: LGTM - Consistent 64-bit size storage in flag buffer.The implementation correctly:
- Allocates 6 integers for the flag buffer (line 76)
- Splits
lamport_comm_sizeinto low/high 32-bit parts (lines 77-78)- Places the low part at index 4 (even) for proper
uint64_talignment (line 80)- Includes helpful comments documenting the layout (lines 72-75)
This matches the kernel read pattern in
allReduceFusionKernels.cuand the Python allocation inplugin.py.
| // Read comm_size as int64_t from two int32 values (low and high) | ||
| uint64_t comm_size = *reinterpret_cast<uint64_t*>(&reinterpret_cast<int*>(workspace[NRanks * 3])[4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says "int64_t" but code uses uint64_t.
The comment should match the actual type used in the code for consistency with allReduceFusionKernels.cu.
Proposed fix
- // Read comm_size as int64_t from two int32 values (low and high)
+ // Read comm_size as uint64_t from two int32 values (low and high)
uint64_t comm_size = *reinterpret_cast<uint64_t*>(&reinterpret_cast<int*>(workspace[NRanks * 3])[4]);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Read comm_size as int64_t from two int32 values (low and high) | |
| uint64_t comm_size = *reinterpret_cast<uint64_t*>(&reinterpret_cast<int*>(workspace[NRanks * 3])[4]); | |
| // Read comm_size as uint64_t from two int32 values (low and high) | |
| uint64_t comm_size = *reinterpret_cast<uint64_t*>(&reinterpret_cast<int*>(workspace[NRanks * 3])[4]); |
🤖 Prompt for AI Agents
In @cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu
around lines 36 - 37, The inline comment incorrectly states "int64_t" while the
code reads into a uint64_t; update the comment to match the actual type by
changing the comment before the read to "Read comm_size as uint64_t from two
int32 values (low and high)". Ensure this change is made alongside the existing
read into the variable comm_size (the line using reinterpret_cast to uint64_t
from workspace) so the comment and the variable type are consistent with the
usage in allReduceFusionKernels.cu.
|
PR_Github #30890 [ run ] triggered by Bot. Commit: |
|
PR_Github #30890 [ run ] completed with state
|
18f280b to
99af488
Compare
Signed-off-by: Yilin Zhang <[email protected]>
99af488 to
c9356ce
Compare
|
/bot run |
|
PR_Github #31044 [ run ] triggered by Bot. Commit: |
syuoni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the SoL fix! @yilin-void
|
PR_Github #31044 [ run ] completed with state
|
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.