Skip to content

Conversation

@nv-lschneider
Copy link
Collaborator

@nv-lschneider nv-lschneider commented Jan 6, 2026

Summary by CodeRabbit

Performance

  • Expanded optimization strategies for collective communication operations to enable better performance.
  • Improved fallback strategy selection for enhanced default behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@nv-lschneider nv-lschneider changed the title **[None][feat] @coderabbitai title** [None][feat] @coderabbitai title Jan 6, 2026
@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30770 [ run ] triggered by Bot. Commit: 67a647e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30770 [ run ] completed with state SUCCESS. Commit: 67a647e
/LLM/main/L0_MergeRequest_PR pipeline #23753 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@hyukn
Copy link
Collaborator

hyukn commented Jan 7, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30901 [ run ] triggered by Bot. Commit: 67a647e

@nv-lschneider nv-lschneider self-assigned this Jan 7, 2026
@nv-lschneider nv-lschneider marked this pull request as ready for review January 7, 2026 16:09
@nv-lschneider nv-lschneider requested a review from a team as a code owner January 7, 2026 16:10
@nv-lschneider nv-lschneider requested a review from hyukn January 7, 2026 16:10
@coderabbitai coderabbitai bot changed the title [None][feat] @coderabbitai title [None][feat] [feat] Enable NCCL_SYMMETRIC as valid AllReduce tactic Jan 7, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Enables NCCL_SYMMETRIC as a valid AllReduce tactic in the strategy selection by removing a TODO comment and explicitly including it in get_valid_tactics. Updates the fallback default strategy from NCCL to NCCL_SYMMETRIC when no tactic is specified.

Changes

Cohort / File(s) Summary
AllReduce Tactic Strategy
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Adds NCCL_SYMMETRIC to the set of valid AllReduce tactics in get_valid_tactics() by deactivating a TODO comment. Changes the fallback default in AllReduceRunner.forward() from NCCL.value to NCCL_SYMMETRIC.value when tactic is -1.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and does not follow the required template structure. Provide a complete PR description including: a clear explanation of the issue/motivation (Description section), details on what test cases validate the changes (Test Coverage section), and confirmation of the PR Checklist items. The current description only contains '@coderabbitai summary' which is a placeholder.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed Title check skipped as CodeRabbit has written the PR title.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @tensorrt_llm/_torch/custom_ops/torch_custom_ops.py:
- Around line 1721-1722: The TODO above the assignment to tactic is outdated and
contradicts the documented design that AllReduceStrategy.NCCL_SYMMETRIC is the
intentional default fallback; remove or update that TODO comment next to the
line setting tactic = AllReduceStrategy.NCCL_SYMMETRIC.value so it either (a) is
deleted if the NCCL_SYMMETRIC hanging concern is resolved, or (b) replaced with
a short clarifying note that explains why NCCL_SYMMETRIC is chosen as the
default fallback despite historical hanging concerns (mentioning any mitigation
or testing that makes it safe) and reference the
AllReduceStrategy.NCCL_SYMMETRIC symbol and the tactic variable in the comment
for clarity.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77be1b7 and 67a647e.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3874-3884)
🔇 Additional comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

1686-1712: LGTM: NCCL_SYMMETRIC strategy addition looks correct.

The addition of NCCL_SYMMETRIC as the first valid strategy aligns with the fallback behavior change and enables auto-tuning to consider this strategy. The workspace size check appropriately protects against potential issues with large tensors.

@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30924 [ run ] triggered by Bot. Commit: a8d68f1

@nv-lschneider
Copy link
Collaborator Author

We may want to hold off merging this until this can be merged: #10517
It unwaives a test that was previously hanging with NCCL_SYMMETRIC

Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But looks like the latest CI still got time out on some tests. We might need further verification.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30924 [ run ] completed with state SUCCESS. Commit: a8d68f1
/LLM/main/L0_MergeRequest_PR pipeline #23886 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-lschneider
Copy link
Collaborator Author

LGTM. But looks like the latest CI still got time out on some tests. We might need further verification.

I agree.
I took a deep look into the logs of the failing tests.
And there are 2 insights:
a) it mostly fails x84 not SBSA.
b) looking at the logs, it seems like the tests are timing out while the servers are still actively outputting startup logs.
That doesn't look like a true hang to me. So it might be unrelated.
I saw an investigation of this is on-going. But so far, I still assume it is unrelated to NCCL_SYMMETRIC.

But I am OK with holding off until we have further clarification.

@nv-lschneider nv-lschneider force-pushed the lschneider/autotune-nccl-symm branch from a8d68f1 to 8ddb430 Compare January 8, 2026 16:43
@nv-lschneider
Copy link
Collaborator Author

/bot run --stage-list "DGX_H100-2_GPUs-PyTorch-Others-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31092 [ run ] triggered by Bot. Commit: 8ddb430

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31092 [ run ] completed with state FAILURE. Commit: 8ddb430
/LLM/main/L0_MergeRequest_PR pipeline #24009 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-lschneider
Copy link
Collaborator Author

Holding off, of this for now.

@nv-lschneider nv-lschneider force-pushed the lschneider/autotune-nccl-symm branch 3 times, most recently from 2d6c52b to 118c534 Compare January 12, 2026 13:29
@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31572 [ run ] triggered by Bot. Commit: 118c534

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31572 [ run ] completed with state SUCCESS. Commit: 118c534
/LLM/main/L0_MergeRequest_PR pipeline #24413 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@hyukn
Copy link
Collaborator

hyukn commented Jan 13, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31717 [ run ] triggered by Bot. Commit: 118c534

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31717 [ run ] completed with state SUCCESS. Commit: 118c534
/LLM/main/L0_MergeRequest_PR pipeline #24541 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-lschneider
Copy link
Collaborator Author

PR_Github #31717 [ run ] completed with state SUCCESS. Commit: 118c534 /LLM/main/L0_MergeRequest_PR pipeline #24541 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

One failure was an unrelated timeout on SBSA multi, but the servers is generating responses, so no hang.
And another failure is: a worker restart failure. might be unrelated.

@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31800 [ run ] triggered by Bot. Commit: 118c534

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31800 [ run ] completed with state SUCCESS. Commit: 118c534
/LLM/main/L0_MergeRequest_PR pipeline #24614 completed with status: 'SUCCESS'

@nv-lschneider
Copy link
Collaborator Author

@hyukn CI is green now.
Should be good to merge, right?

@nv-lschneider
Copy link
Collaborator Author

I ran the other waived tests, here: #10517
And they passed. So doing full testing there. But that shouldn't necessarily hold us back here.

@nv-lschneider nv-lschneider force-pushed the lschneider/autotune-nccl-symm branch from b06242a to 2be814f Compare January 15, 2026 13:55
@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@nv-lschneider nv-lschneider force-pushed the lschneider/autotune-nccl-symm branch 2 times, most recently from 905a4e2 to 1703a3e Compare January 16, 2026 16:14
@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32316 [ run ] triggered by Bot. Commit: 1703a3e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32316 [ run ] completed with state SUCCESS. Commit: 1703a3e
/LLM/main/L0_MergeRequest_PR pipeline #25043 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-lschneider
Copy link
Collaborator Author

PR_Github #32316 [ run ] completed with state SUCCESS. Commit: 1703a3e /LLM/main/L0_MergeRequest_PR pipeline #25043 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Failing tests appear to be unrelated to NCCL_SYMMETRIC.
Re-trying.

@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32365 [ run ] triggered by Bot. Commit: 1703a3e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32365 [ run ] completed with state SUCCESS. Commit: 1703a3e
/LLM/main/L0_MergeRequest_PR pipeline #25080 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

Signed-off-by: Ludwig Schneider <[email protected]>
Signed-off-by: Ludwig Schneider <[email protected]>
It was added by mistake.

Signed-off-by: Ludwig Schneider <[email protected]>
@nv-lschneider nv-lschneider force-pushed the lschneider/autotune-nccl-symm branch from 1703a3e to 686defe Compare January 17, 2026 22:32
@nv-lschneider
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32421 [ run ] triggered by Bot. Commit: 686defe

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32421 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 8 PM PST on 1/17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants