Skip to content

Commit 3b437e6

Browse files
zhuhan0facebook-github-bot
authored andcommitted
Use spawn for multiprocessing start method (#3284)
Summary: Pull Request resolved: #3284 CUDA context initialization is not fork-safe. If a CUDA context is created in a parent process, and then the process is forked (using `os.fork()`), the child process may encounter errors or undefined behavior when using CUDA. This is because the CUDA driver and runtime are not designed to be safely duplicated via `fork()`. It's recommended to use `spawn` or `forkserver`. Among the two, `forkserver` needs to be use carefully and specifically, it's recommended to call `multiprocessing.set_start_method('forkserver')` at the very start of the program, and the parent process also needs to avoid initializing the CUDA context. When upgrading APS to CUDA 12.8, we encountered a test failure, and the test is apparently initializing the CUDA context before starting up two children processes, and I suspect that caused the test to hang - [post](https://fb.workplace.com/groups/319878845696681/posts/1494595861558301). It's hard to avoid initializing the CUDA context early in this test, because it checks the GPU count in the test method's decorator - [code](https://fburl.com/code/27naz2eg). Among the `spawn` and `forkserver` start methods, `spawn` is less efficient but it's the most robust. Let's switch to that instead to avoid any potential undefined behaviors with CUDA 12.8 and multiprocessing. Reviewed By: adamomainz, weifengpy Differential Revision: D80305233 fbshipit-source-id: 228b09d7a40bfa8b4d7ee3c3d926db5c631fffcb
1 parent 8439419 commit 3b437e6

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

torchrec/distributed/test_utils/multi_process.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,9 +107,16 @@ def __init__(
107107
) -> None:
108108
super().__init__(methodName)
109109

110+
# In CUDA 12.8 we're seeing hangs from using forkserver, so we're
111+
# switching to spawn.
110112
# AMD's HIP runtime doesn't seem to work with forkserver; hipMalloc will fail
111113
# Therefore we use spawn for HIP runtime until AMD fixes the issue
112-
self._mp_init_mode: str = mp_init_mode if torch.version.hip is None else "spawn"
114+
if (
115+
torch.version.cuda is not None and torch.version.cuda >= "12.8"
116+
) or torch.version.hip is not None:
117+
self._mp_init_mode: str = "spawn"
118+
else:
119+
self._mp_init_mode: str = mp_init_mode
113120
logging.info(f"Using {self._mp_init_mode} for multiprocessing")
114121

115122
@seed_and_log

0 commit comments

Comments
 (0)