-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] [feat] Enable NCCL_SYMMETRIC as valid AllReduce tactic #10463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[None][feat] [feat] Enable NCCL_SYMMETRIC as valid AllReduce tactic #10463
Conversation
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #30770 [ run ] triggered by Bot. Commit: |
|
PR_Github #30770 [ run ] completed with state
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #30901 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughEnables NCCL_SYMMETRIC as a valid AllReduce tactic in the strategy selection by removing a TODO comment and explicitly including it in get_valid_tactics. Updates the fallback default strategy from NCCL to NCCL_SYMMETRIC when no tactic is specified. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @tensorrt_llm/_torch/custom_ops/torch_custom_ops.py:
- Around line 1721-1722: The TODO above the assignment to tactic is outdated and
contradicts the documented design that AllReduceStrategy.NCCL_SYMMETRIC is the
intentional default fallback; remove or update that TODO comment next to the
line setting tactic = AllReduceStrategy.NCCL_SYMMETRIC.value so it either (a) is
deleted if the NCCL_SYMMETRIC hanging concern is resolved, or (b) replaced with
a short clarifying note that explains why NCCL_SYMMETRIC is chosen as the
default fallback despite historical hanging concerns (mentioning any mitigation
or testing that makes it safe) and reference the
AllReduceStrategy.NCCL_SYMMETRIC symbol and the tactic variable in the comment
for clarity.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
tensorrt_llm/functional.py (1)
AllReduceStrategy(3874-3884)
🔇 Additional comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
1686-1712: LGTM: NCCL_SYMMETRIC strategy addition looks correct.The addition of
NCCL_SYMMETRICas the first valid strategy aligns with the fallback behavior change and enables auto-tuning to consider this strategy. The workspace size check appropriately protects against potential issues with large tensors.
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #30924 [ run ] triggered by Bot. Commit: |
|
We may want to hold off merging this until this can be merged: #10517 |
hyukn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But looks like the latest CI still got time out on some tests. We might need further verification.
|
PR_Github #30924 [ run ] completed with state
|
I agree. But I am OK with holding off until we have further clarification. |
a8d68f1 to
8ddb430
Compare
|
/bot run --stage-list "DGX_H100-2_GPUs-PyTorch-Others-1" |
|
PR_Github #31092 [ run ] triggered by Bot. Commit: |
|
PR_Github #31092 [ run ] completed with state
|
|
Holding off, of this for now. |
2d6c52b to
118c534
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #31572 [ run ] triggered by Bot. Commit: |
|
PR_Github #31572 [ run ] completed with state
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #31717 [ run ] triggered by Bot. Commit: |
|
PR_Github #31717 [ run ] completed with state
|
One failure was an unrelated timeout on SBSA multi, but the servers is generating responses, so no hang. |
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #31800 [ run ] triggered by Bot. Commit: |
|
PR_Github #31800 [ run ] completed with state |
|
@hyukn CI is green now. |
|
I ran the other waived tests, here: #10517 |
b06242a to
2be814f
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
905a4e2 to
1703a3e
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #32316 [ run ] triggered by Bot. Commit: |
|
PR_Github #32316 [ run ] completed with state
|
Failing tests appear to be unrelated to NCCL_SYMMETRIC. |
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #32365 [ run ] triggered by Bot. Commit: |
|
PR_Github #32365 [ run ] completed with state
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
Signed-off-by: Ludwig Schneider <[email protected]>
Signed-off-by: Ludwig Schneider <[email protected]>
It was added by mistake. Signed-off-by: Ludwig Schneider <[email protected]>
1703a3e to
686defe
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #32421 [ run ] triggered by Bot. Commit: |
|
PR_Github #32421 [ run ] completed with state |
Summary by CodeRabbit
Performance
✏️ Tip: You can customize this high-level summary in your review settings.