[None][fix] [fix] Make NCCL resource manager destructor exception-safe by nv-lschneider · Pull Request #10166 · NVIDIA/TensorRT-LLM

nv-lschneider · 2025-12-19T22:15:10Z

Summary by CodeRabbit

Bug Fixes
- Improved stability during application shutdown by strengthening exception handling in resource cleanup sequences.
- Enhanced robustness of cleanup operations to prevent crashes in static destruction scenarios through safer locking patterns and guarded exception handling.
- Made logging operations during teardown more resilient to prevent failure propagation during shutdown procedures.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

It was reported that the clean up of NCCL resources is not always clean and could end in segfaults.

I am struggling to reproduce this exact error.
But upon inspection of the code, I can see why they is a potential problem when tearing down the static ResourceManager.

In particular the unprotected access of the Logger is potentially the issue here.

I updated the code, to be a lot more conservative in the tear-down process to prevent problem in the future.

Test Coverage

No new test needed.

PR Checklist

Please check this after reviewing the above items as appropriate for this PR.

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

coderabbitai · 2025-12-19T22:18:09Z

📝 Walkthrough

Walkthrough

Hardens the NCCL resource manager's destructor and cleanup path to be exception-safe during static destruction. Introduces a destruction-state flag, moves resource cleanup outside the mutex, wraps cleanup and logging calls in try-catch blocks, and prevents re-entrant cleanup.

Changes

Cohort / File(s)	Change Summary
NCCL Resource Manager Header `cpp/tensorrt_llm/common/ncclUtils.h`	Changed destructor from defaulted to explicit declaration; added private atomic flag `mIsDestroying` to track destruction state and prevent re-entrant cleanup.
NCCL Resource Manager Implementation `cpp/tensorrt_llm/common/ncclUtils.cpp`	Added custom destructor that safely cleans up registered resources outside the mutex; extended `cleanupResources` with destruction checks and exception handling; wrapped resource cleanup and logging in try-catch to prevent exceptions during static destruction.
Operation Utilities `cpp/tensorrt_llm/common/opUtils.cpp`	Protected logging call in NCCL comm destructor path with try-catch to prevent logging failures from propagating during static destruction.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Exception-handling patterns — Multiple similar try-catch blocks around logging and cleanup; verify exception swallowing is intentional and complete
Mutex and atomic interaction — Confirm proper lock acquire/release semantics and that the atomic flag prevents data races
Resource iteration during cleanup — Verify moving resources outside the lock preserves expected cleanup order and doesn't introduce use-after-free

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title contains a placeholder (@coderabbitai title) that was not replaced with an actual title, leaving only the ticket reference and type tags.	Replace the @coderabbitai title placeholder with an actual descriptive title, such as '[None][fix] Harden NCCL resource cleanup during static destruction'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description covers the core issue, root cause analysis, and approach, with the PR checklist box checked. However, the description lacks detail on what 'more conservative' changes entail and test coverage explanation is minimal.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20b69a9 and f3000d8.

📒 Files selected for processing (3)

cpp/tensorrt_llm/common/ncclUtils.cpp (3 hunks)
cpp/tensorrt_llm/common/ncclUtils.h (2 hunks)
cpp/tensorrt_llm/common/opUtils.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{cpp,h,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu,cuh}: Closing braces of namespaces should have a comment saying the namespace it closes: } // namespace foo
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
For naming of constants in C++, follow the naming section conventions
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization in C++
Use the Allman indentation style in C++
Put the semicolon for an empty for or while loop in a new line in C++
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements) in C++
If and else should always be followed by brace-delimited statements, even if empty or a single statement in C++
C++ filenames should use camel case with first letter lowercase: thisIsASubDir and thisIsAFilename.cpp
All files involved in the compilation of a compilation target (.exe/.so) must have filenames that are case-insensitive unique in C++
All types (including class names) in C++ should use camel case with uppercase first letter: FooBarClass
Local variables, methods and namespaces in C++ should use camel case with first letter lowercase: localFooBar
Non-magic-number global variables that are non-static and not defined in anonymous namespace in C++ should use camel case prefixed by a lower case 'g': gDontUseGlobalFoos
Non-magic-number global variables that are static or defined in an anonymous namespace in C++ should use camel case prefixed by a lower case 's': sMutableStaticGlobal
Locally visible static variables in C++ should use camel case with lowercase prefix 's' as the first letter: static std::once_flag sFlag;
Public, private and protected class member variables in C++ should use camel case prefi...

Files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/opUtils.cpp

**/*.h

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.h: Use a preprocessor guard in C++ header files with the format TRTLLM_<FILENAME>_H derived from the filename in all caps
The preprocessor guard name in C++ must have prefix TRTLLM_ followed by the filename, all in caps. Only use the file name, not directory names
Do not use prefix with underscore in C++ preprocessor guard symbols as such symbols are reserved in C++ standard for compilers or implementation
Do not use trailing underscore in C++ preprocessor guard symbols (unlike Google C++ guideline)

Files:

cpp/tensorrt_llm/common/ncclUtils.h

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/opUtils.cpp

🧠 Learnings (11)

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-09-16T09:30:09.716Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-10-13T19:45:03.518Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h
cpp/tensorrt_llm/common/opUtils.cpp

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

cpp/tensorrt_llm/common/ncclUtils.h

📚 Learning: 2025-09-22T19:25:45.607Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp:170-179
Timestamp: 2025-09-22T19:25:45.607Z
Learning: In NCCLUserBufferAllocator::getNCCLDevComm(), multimem support is hard-coded to true because multimem is required for this function. The caller is responsible for ensuring multimem is available before calling this function - it should not be called if multimem is not supported.

Applied to files:

cpp/tensorrt_llm/common/opUtils.cpp

🧬 Code graph analysis (2)

cpp/tensorrt_llm/common/ncclUtils.h (1)

cpp/tensorrt_llm/common/ncclUtils.cpp (1)

NcclCommResourceManager (40-78)

cpp/tensorrt_llm/common/opUtils.cpp (1)

cpp/tensorrt_llm/common/ncclUtils.cpp (6)

getInstance (34-38)

getInstance (34-34)

getInstance (217-221)

getInstance (217-217)

getInstance (345-349)

getInstance (345-345)

🪛 Cppcheck (2.18.0)

cpp/tensorrt_llm/common/ncclUtils.cpp

[error] 51-51: There is an unknown macro here somewhere. Configuration is required. If TRTLLM_NAMESPACE_BEGIN is a macro then please configure it.

(unknownMacro)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

cpp/tensorrt_llm/common/ncclUtils.h (2)

29-29: LGTM!

The <atomic> include is correctly added to support the new mIsDestroying atomic member variable.

143-149: Well-designed destruction safety mechanism.

The custom destructor declaration and mIsDestroying atomic flag properly enable the destruction-safe cleanup pattern. The atomic ensures visibility across threads during the static destruction phase, and the implementation (in ncclUtils.cpp) correctly uses memory_order_release when setting and memory_order_acquire when loading for proper synchronization.

cpp/tensorrt_llm/common/opUtils.cpp (2)

126-129: LGTM!

The comment clarifies the destruction guard mechanism - cleanupResources will return early if the singleton is being destroyed, allowing the destructor to handle cleanup proactively.

135-143: Appropriate destruction-safe logging.

Wrapping the logging call in try-catch prevents potential segfaults if the Logger singleton is destroyed before this deleter runs during static destruction. This is consistent with the pattern applied in ncclUtils.cpp.

cpp/tensorrt_llm/common/ncclUtils.cpp (4)

40-78: Solid destruction-safe cleanup implementation.

The destructor correctly:

Sets mIsDestroying with memory_order_release before any cleanup

Moves resources out of the map under the lock

Clears the map while holding the lock

Performs cleanup outside the lock to avoid deadlocks

Wraps each cleanup call in try-catch to prevent exceptions from escaping the destructor

This ensures resources are cleaned up in a controlled manner during static destruction, avoiding races with cleanupResources calls from shared_ptr deleters.

103-108: Correct early-return guard for destruction phase.

The memory_order_acquire load correctly synchronizes with the memory_order_release store in the destructor, ensuring that if we see mIsDestroying == true, we also see all prior writes made by the destructor. Returning early prevents double-cleanup and potential use-after-free.

113-152: Conservative but appropriate defensive coding for static destruction.

The try-catch around mutex acquisition handles the edge case where the mutex itself may be in an undefined state during static destruction. The double-check of mIsDestroying after acquiring the lock (lines 119-123) correctly handles the race where destruction starts between the first check and lock acquisition.

Moving resources out under the lock, then processing cleanup outside the lock, is the right pattern to avoid deadlocks if cleanup callbacks try to access the manager.

157-196: Maximally defensive cleanup loop.

Each logging call is individually wrapped in try-catch, ensuring that:

Logging failures don't prevent the actual cleanup() call

Exceptions from one resource's cleanup don't prevent cleanup of subsequent resources

Error logging failures during exception handling don't mask the original error

While verbose, this level of defensiveness is appropriate for static destruction code paths where the Logger may be destroyed at any point.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nv-lschneider · 2025-12-19T23:48:58Z

/bot run

tensorrt-cicd · 2025-12-19T23:54:10Z

PR_Github #29190 [ run ] triggered by Bot. Commit: 2209ce0

tensorrt-cicd · 2025-12-20T02:32:33Z

PR_Github #29190 [ run ] completed with state SUCCESS. Commit: 2209ce0
/LLM/main/L0_MergeRequest_PR pipeline #22395 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nv-lschneider · 2025-12-22T14:41:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-22T14:47:20Z

PR_Github #29427 [ run ] triggered by Bot. Commit: 2209ce0

tensorrt-cicd · 2025-12-22T17:09:27Z

PR_Github #29427 [ run ] completed with state SUCCESS. Commit: 2209ce0
/LLM/main/L0_MergeRequest_PR pipeline #22611 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nv-lschneider · 2025-12-22T21:23:05Z

/bot run --disable-fail-fast

taylor-yb-lee · 2025-12-22T21:27:22Z

Reproduce steps:

The original symptom happend only with CI's prebuilt wheel. With wheel built from source locally, the symptom was not observed
CMD: for i in {1..10}; do mpirun -n 1 trtllm-bench --model /lustre/fsw/coreai_comparch_trtllm/yeonbokl/dev/models/Llama-3.3-70B-FP8 --model_path /lustre/fsw/coreai_comparch_trtllm/yeonbokl/dev/models/Llama-3.3-70B-FP8 throughput --max_batch_size 1 --max_num_tokens 1024 --kv_cache_free_gpu_mem_fraction 0.8 --tp 4 --concurrency 1 --num_requests 3 --extra_llm_api_options extra_llm_api_config.yml --dataset outputs_req3_trace/isl1024osl20_dataset.txt --warmup 0 --streaming --backend pytorch --max_seq_len 1044; done
With prebuilt wheel from CI with main branch :
!!!!!!! Segfault encountered !!!!!!! File "<unknown>", line 0, in tensorrt_llm::_v1::common::nccl_util::NcclCommResourceManager::cleanupResources(ncclComm*) File "<unknown>", line 0, in std::_Sp_counted_deleter<ncclComm**, tensorrt_llm::_v1::getComm(std::set<int, std::less<int>, std::allocator<int> > const&)::{lambda(ncclComm**)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() File "<unknown>", line 0, in std::map<std::set<int, std::less<int>, std::allocator<int> >, std::shared_ptr<ncclComm*>, std::less<std::set<int, std::less<int>, std::allocator<int> > >, std::allocator<std::pair<std::set<int, std::less<int>, std::allocator<int> > const, std::shared_ptr<ncclComm*> > > >::~map() File "<unknown>", line 0, in exit File "<unknown>", line 0, in _start File "<unknown>", line 0, in 0xffffffffffffffff
With prebuilt wheel from CI with this PR
- No segfault observed

tensorrt-cicd · 2025-12-22T21:29:35Z

PR_Github #29464 [ run ] triggered by Bot. Commit: 2209ce0

tensorrt-cicd · 2025-12-22T22:24:17Z

PR_Github #29464 [ run ] completed with state SUCCESS. Commit: 2209ce0
/LLM/main/L0_MergeRequest_PR pipeline #22646 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nv-lschneider · 2025-12-22T22:32:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-22T22:38:40Z

PR_Github #29466 [ run ] triggered by Bot. Commit: 591739e

tensorrt-cicd · 2025-12-23T02:07:41Z

PR_Github #29466 [ run ] completed with state SUCCESS. Commit: 591739e
/LLM/main/L0_MergeRequest_PR pipeline #22649 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nv-lschneider · 2025-12-23T02:25:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T02:31:08Z

PR_Github #29489 [ run ] triggered by Bot. Commit: 591739e

tensorrt-cicd · 2025-12-23T04:04:44Z

PR_Github #29489 [ run ] completed with state SUCCESS. Commit: 591739e
/LLM/main/L0_MergeRequest_PR pipeline #22670 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nv-lschneider · 2025-12-23T21:30:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-23T21:37:05Z

PR_Github #29658 [ run ] triggered by Bot. Commit: 591739e

tensorrt-cicd · 2025-12-23T21:37:06Z

PR_Github #29658 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 PM PST on 12/23.

nv-lschneider · 2025-12-28T01:09:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-28T01:15:28Z

PR_Github #30040 [ run ] triggered by Bot. Commit: b80c07c

tensorrt-cicd · 2025-12-28T04:40:23Z

PR_Github #30040 [ run ] completed with state SUCCESS. Commit: b80c07c
/LLM/main/L0_MergeRequest_PR pipeline #23117 completed with status: 'SUCCESS'

hyukn · 2025-12-31T06:23:30Z

Maybe we should add multi-gpu stages for more coverage?

nv-lschneider · 2026-01-02T14:02:00Z

/bot run --only-multi-gpu-test

tensorrt-cicd · 2026-01-02T14:07:42Z

PR_Github #30387 [ run ] triggered by Bot. Commit: b80c07c

tensorrt-cicd · 2026-01-02T17:59:41Z

PR_Github #30387 [ run ] completed with state SUCCESS. Commit: b80c07c
/LLM/main/L0_MergeRequest_PR pipeline #23416 (Partly Tested) completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

nv-lschneider · 2026-01-02T18:12:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-02T18:18:32Z

PR_Github #30402 [ run ] triggered by Bot. Commit: ffe6e8c

tensorrt-cicd · 2026-01-02T21:43:47Z

PR_Github #30402 [ run ] completed with state SUCCESS. Commit: ffe6e8c
/LLM/main/L0_MergeRequest_PR pipeline #23431 completed with status: 'SUCCESS'

nv-lschneider · 2026-01-02T21:57:45Z

/bot run --only-multi-gpu-test

tensorrt-cicd · 2026-01-02T22:03:14Z

PR_Github #30413 [ run ] triggered by Bot. Commit: ffe6e8c

tensorrt-cicd · 2026-01-03T01:42:39Z

PR_Github #30413 [ run ] completed with state SUCCESS. Commit: ffe6e8c
/LLM/main/L0_MergeRequest_PR pipeline #23443 (Partly Tested) completed with status: 'SUCCESS'

Tabrizian · 2026-01-03T14:30:09Z

/bot run --reuse-pipeline

tensorrt-cicd · 2026-01-03T14:35:43Z

PR_Github #30450 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --reuse-pipeline

Tabrizian · 2026-01-03T14:40:45Z

/bot reuse-pipeline

tensorrt-cicd · 2026-01-03T14:46:16Z

PR_Github #30451 [ reuse-pipeline ] triggered by Bot. Commit: ffe6e8c

tensorrt-cicd · 2026-01-03T15:25:02Z

PR_Github #30451 [ reuse-pipeline ] completed with state SUCCESS. Commit: ffe6e8c
Reusing PR_Github #30413 (Partly Tested) for commit ffe6e8c

NVIDIA#10166) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>

Adding multiple layers of static tear-down protection

f3000d8

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

coderabbitai bot changed the title ~~[None][fix] @coderabbitai title~~ [None][fix] [fix] Make NCCL resource manager destructor exception-safe Dec 19, 2025

Merge branch 'main' into lschneider/fix-static-nccl-tear-down

2209ce0

nv-lschneider self-assigned this Dec 22, 2025

Merge branch 'main' into lschneider/fix-static-nccl-tear-down

591739e

Merge branch 'main' into lschneider/fix-static-nccl-tear-down

b80c07c

hyukn approved these changes Dec 31, 2025

View reviewed changes

Merge branch 'main' into lschneider/fix-static-nccl-tear-down

ffe6e8c

Tabrizian enabled auto-merge (squash) January 3, 2026 01:36

Tabrizian merged commit 59045a0 into NVIDIA:main Jan 3, 2026
5 checks passed

nv-lschneider mentioned this pull request Jan 15, 2026

[https://nvbugs/5741392][fix] [chore] Remove test exemptions from waivers tile #10517

Merged

1 task

Conversation

nv-lschneider commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

Uh oh!

coderabbitai bot commented Dec 19, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

nv-lschneider commented Dec 19, 2025

Uh oh!

tensorrt-cicd commented Dec 19, 2025

Uh oh!

tensorrt-cicd commented Dec 20, 2025

Uh oh!

nv-lschneider commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

nv-lschneider commented Dec 22, 2025

Uh oh!

taylor-yb-lee commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

nv-lschneider commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

nv-lschneider commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

nv-lschneider commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

nv-lschneider commented Dec 28, 2025

Uh oh!

tensorrt-cicd commented Dec 28, 2025

Uh oh!

tensorrt-cicd commented Dec 28, 2025

Uh oh!

hyukn commented Dec 31, 2025

Uh oh!

nv-lschneider commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

nv-lschneider commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

nv-lschneider commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 3, 2026

Uh oh!

Tabrizian commented Jan 3, 2026

nv-lschneider commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

taylor-yb-lee commented Dec 22, 2025 •

edited

Loading