ROCM-1896 fix shutdown ordering race condition #2161
+54
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details
race condition at shutdown causes segmentation fault, instead have an atexit handler to set global shutdown state and avoid use after free bugs.
Work item: "Internal", or link to GitHub issue (if applicable).
ROCM-1896
What were the changes?
Added atexit handler to set shutdown flag and prevent other threads from running into segmentation faults
Why were the changes made?
Without these changes, an error condition during rccl (like no open files, or running out of hip memory), could result in a segmentation fault, instead of gracefully exiting with an error reason.
How was the outcome achieved?
Handle exit conditions and prevent other threads from running into segfault
Additional Documentation:
cherry pick from develop