fix segmentation fault by songhappy · Pull Request #3978 · huggingface/accelerate

songhappy · 2026-03-17T19:06:04Z

What does this PR do?

This PR fixes SIGSEGV (exit code 139) on XPU.

When using accelerate launch on Intel XPU, the parent process crashes with SIGSEGV (exit code 139) after all worker processes complete successfully. The root cause is a C++ static destruction order fiasco in libccl.so (Intel oneCCL). The parent process loads libccl.so as a side effect of import torch (line 26 of launch.py), which registers ~30+ file-scope std::map objects, a logger singleton, and other global C++ objects with non-trivial destructors. The parent never initializes or uses oneCCL — it only spawns workers via torch.distributed.run and waits for them. However, when the parent's Python interpreter shuts down, it triggers C++ static destructors across all loaded shared libraries, and the interdependencies among oneCCL's global objects cause a use-after-free crash. This does not occur on CUDA because libnccl.so's global objects have simpler destruction semantics. The fix calls os._exit(0) in the success path of multi_gpu_launcher() when XPU is available, which immediately terminates the parent process without running Python finalizers or C++ static destructors — this is safe because the parent has no remaining work after workers exit, and os._exit() is the standard pattern for avoiding teardown-order crashes in launcher processes (e.g., multiprocessing forkserver uses the same approach).

Copilot

Pull request overview

Fixes a parent-process SIGSEGV seen when using accelerate launch on Intel XPU by avoiding Python/C++ teardown that triggers oneCCL static destructor order issues.

Changes:

Add an XPU-only success-path early termination in multi_gpu_launcher() after torch.distributed.run completes successfully.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T15:14:12Z

src/accelerate/commands/launch.py

            else:
                raise
+        else:
+            if is_xpu_available():


os._exit() bypasses normal shutdown (finally blocks, atexit handlers, logging/IO flushing). To reduce the chance of losing buffered output, consider explicitly flushing stdout/stderr (and/or logging handlers) immediately before the forced exit, and add a brief inline comment explaining that this is intentionally avoiding oneCCL static destructor teardown on XPU.

Suggested change

if is_xpu_available():

if is_xpu_available():

# Intentionally bypass normal Python shutdown on XPU to avoid oneCCL

# static destructor teardown; flush stdio and logging before forced exit.

sys.stdout.flush()

sys.stderr.flush()

logging.shutdown()

Copilot · 2026-03-20T15:14:12Z

src/accelerate/commands/launch.py

+                import os as _os
+                _os._exit(0)


Calling os._exit(0) here will unconditionally terminate the entire Python process whenever multi_gpu_launcher() completes successfully on an XPU system. That’s risky when launch_command()/multi_gpu_launcher() are invoked programmatically (e.g., from a larger Python process or test runner), since it can prematurely end the host process and skip remaining work/tests. Consider gating this behavior behind an explicit CLI-only signal (e.g., an env var set by the accelerate entrypoint, or a dedicated argument/flag) so library/test usage can return normally while the CLI path still avoids the oneCCL teardown crash.

Suggested change

import os as _os

_os._exit(0)

# On XPU systems, a hard process exit is used to avoid a known oneCCL teardown crash.

# This can be disabled for library/test usage by setting the environment variable

# ACCELERATE_DISABLE_XPU_PROCESS_EXIT to a truthy value (e.g. "1", "true", "yes").

disable_exit = os.environ.get("ACCELERATE_DISABLE_XPU_PROCESS_EXIT", "").lower()

if disable_exit not in {"1", "true", "yes"}:

import os as _os

_os._exit(0)

fix segmentation fault

790df67

SunMarc requested a review from Copilot March 20, 2026 15:00

Copilot started reviewing on behalf of SunMarc March 20, 2026 15:01 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix segmentation fault#3978

fix segmentation fault#3978
songhappy wants to merge 1 commit intohuggingface:mainfrom
songhappy:lora

songhappy commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            if is_xpu_available():
+            if is_xpu_available():
+                # Intentionally bypass normal Python shutdown on XPU to avoid oneCCL
+                # static destructor teardown; flush stdio and logging before forced exit.
+                sys.stdout.flush()
+                sys.stderr.flush()
+                logging.shutdown()

-                import os as _os
-                _os._exit(0)
+                # On XPU systems, a hard process exit is used to avoid a known oneCCL teardown crash.
+                # This can be disabled for library/test usage by setting the environment variable
+                # ACCELERATE_DISABLE_XPU_PROCESS_EXIT to a truthy value (e.g. "1", "true", "yes").
+                disable_exit = os.environ.get("ACCELERATE_DISABLE_XPU_PROCESS_EXIT", "").lower()
+                if disable_exit not in {"1", "true", "yes"}:
+                    import os as _os
+                    _os._exit(0)

Conversation

songhappy commented Mar 17, 2026

What does this PR do?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants