[None][fix] [fix] Make NCCL resource manager destructor exception-safe#10166
Conversation
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
📝 WalkthroughWalkthroughHardens the NCCL resource manager's destructor and cleanup path to be exception-safe during static destruction. Introduces a destruction-state flag, moves resource cleanup outside the mutex, wraps cleanup and logging calls in try-catch blocks, and prevents re-entrant cleanup. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🧰 Additional context used📓 Path-based instructions (3)**/*.{cpp,h,cu,cuh}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.h📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,h,cu,cuh,py}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
🧠 Learnings (11)📚 Learning: 2025-09-23T15:12:38.312ZApplied to files:
📚 Learning: 2025-09-23T15:01:00.070ZApplied to files:
📚 Learning: 2025-09-23T15:01:00.070ZApplied to files:
📚 Learning: 2025-09-16T09:30:09.716ZApplied to files:
📚 Learning: 2025-10-13T19:45:03.518ZApplied to files:
📚 Learning: 2025-09-23T15:12:38.312ZApplied to files:
📚 Learning: 2025-09-23T14:58:05.372ZApplied to files:
📚 Learning: 2025-08-20T06:56:02.889ZApplied to files:
📚 Learning: 2025-09-02T13:42:44.885ZApplied to files:
📚 Learning: 2025-08-21T09:41:49.347ZApplied to files:
📚 Learning: 2025-09-22T19:25:45.607ZApplied to files:
🧬 Code graph analysis (2)cpp/tensorrt_llm/common/ncclUtils.h (1)
cpp/tensorrt_llm/common/opUtils.cpp (1)
🪛 Cppcheck (2.18.0)cpp/tensorrt_llm/common/ncclUtils.cpp[error] 51-51: There is an unknown macro here somewhere. Configuration is required. If TRTLLM_NAMESPACE_BEGIN is a macro then please configure it. (unknownMacro) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (8)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/bot run |
|
PR_Github #29190 [ run ] triggered by Bot. Commit: |
|
PR_Github #29190 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29427 [ run ] triggered by Bot. Commit: |
|
PR_Github #29427 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
Reproduce steps:
|
|
PR_Github #29464 [ run ] triggered by Bot. Commit: |
|
PR_Github #29464 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29466 [ run ] triggered by Bot. Commit: |
|
PR_Github #29466 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29489 [ run ] triggered by Bot. Commit: |
|
PR_Github #29489 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29658 [ run ] triggered by Bot. Commit: |
|
PR_Github #29658 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #30040 [ run ] triggered by Bot. Commit: |
|
PR_Github #30040 [ run ] completed with state |
|
Maybe we should add multi-gpu stages for more coverage? |
|
/bot run --only-multi-gpu-test |
|
PR_Github #30387 [ run ] triggered by Bot. Commit: |
|
PR_Github #30387 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #30402 [ run ] triggered by Bot. Commit: |
|
PR_Github #30402 [ run ] completed with state |
|
/bot run --only-multi-gpu-test |
|
PR_Github #30413 [ run ] triggered by Bot. Commit: |
|
PR_Github #30413 [ run ] completed with state |
|
/bot run --reuse-pipeline |
|
PR_Github #30450 Bot args parsing error: usage: /bot [-h] |
|
/bot reuse-pipeline |
|
PR_Github #30451 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #30451 [ reuse-pipeline ] completed with state |
NVIDIA#10166) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.
Description
It was reported that the clean up of NCCL resources is not always clean and could end in segfaults.
I am struggling to reproduce this exact error.
But upon inspection of the code, I can see why they is a potential problem when tearing down the static ResourceManager.
In particular the unprotected access of the Logger is potentially the issue here.
I updated the code, to be a lot more conservative in the tear-down process to prevent problem in the future.
Test Coverage
No new test needed.
PR Checklist