Skip to content

fix segmentation fault#3978

Open
songhappy wants to merge 1 commit intohuggingface:mainfrom
songhappy:lora
Open

fix segmentation fault#3978
songhappy wants to merge 1 commit intohuggingface:mainfrom
songhappy:lora

Conversation

@songhappy
Copy link
Copy Markdown

What does this PR do?

This PR fixes SIGSEGV (exit code 139) on XPU.

When using accelerate launch on Intel XPU, the parent process crashes with SIGSEGV (exit code 139) after all worker processes complete successfully. The root cause is a C++ static destruction order fiasco in libccl.so (Intel oneCCL). The parent process loads libccl.so as a side effect of import torch (line 26 of launch.py), which registers ~30+ file-scope std::map objects, a logger singleton, and other global C++ objects with non-trivial destructors. The parent never initializes or uses oneCCL — it only spawns workers via torch.distributed.run and waits for them. However, when the parent's Python interpreter shuts down, it triggers C++ static destructors across all loaded shared libraries, and the interdependencies among oneCCL's global objects cause a use-after-free crash. This does not occur on CUDA because libnccl.so's global objects have simpler destruction semantics. The fix calls os._exit(0) in the success path of multi_gpu_launcher() when XPU is available, which immediately terminates the parent process without running Python finalizers or C++ static destructors — this is safe because the parent has no remaining work after workers exit, and os._exit() is the standard pattern for avoiding teardown-order crashes in launcher processes (e.g., multiprocessing forkserver uses the same approach).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a parent-process SIGSEGV seen when using accelerate launch on Intel XPU by avoiding Python/C++ teardown that triggers oneCCL static destructor order issues.

Changes:

  • Add an XPU-only success-path early termination in multi_gpu_launcher() after torch.distributed.run completes successfully.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

else:
raise
else:
if is_xpu_available():
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os._exit() bypasses normal shutdown (finally blocks, atexit handlers, logging/IO flushing). To reduce the chance of losing buffered output, consider explicitly flushing stdout/stderr (and/or logging handlers) immediately before the forced exit, and add a brief inline comment explaining that this is intentionally avoiding oneCCL static destructor teardown on XPU.

Suggested change
if is_xpu_available():
if is_xpu_available():
# Intentionally bypass normal Python shutdown on XPU to avoid oneCCL
# static destructor teardown; flush stdio and logging before forced exit.
sys.stdout.flush()
sys.stderr.flush()
logging.shutdown()

Copilot uses AI. Check for mistakes.
Comment on lines +1033 to +1034
import os as _os
_os._exit(0)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling os._exit(0) here will unconditionally terminate the entire Python process whenever multi_gpu_launcher() completes successfully on an XPU system. That’s risky when launch_command()/multi_gpu_launcher() are invoked programmatically (e.g., from a larger Python process or test runner), since it can prematurely end the host process and skip remaining work/tests. Consider gating this behavior behind an explicit CLI-only signal (e.g., an env var set by the accelerate entrypoint, or a dedicated argument/flag) so library/test usage can return normally while the CLI path still avoids the oneCCL teardown crash.

Suggested change
import os as _os
_os._exit(0)
# On XPU systems, a hard process exit is used to avoid a known oneCCL teardown crash.
# This can be disabled for library/test usage by setting the environment variable
# ACCELERATE_DISABLE_XPU_PROCESS_EXIT to a truthy value (e.g. "1", "true", "yes").
disable_exit = os.environ.get("ACCELERATE_DISABLE_XPU_PROCESS_EXIT", "").lower()
if disable_exit not in {"1", "true", "yes"}:
import os as _os
_os._exit(0)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants