Skip to content

Commit ff4dddf

Browse files
pytorchbotkwen2501
andauthored
[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (pytorch#154085)
[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (pytorch#154055) Work around issues like pytorch#153960, pytorch#152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: pytorch#154055 Approved by: https://github.com/atalman (cherry picked from commit 87fc5af) Co-authored-by: Ke Wen <[email protected]>
1 parent e8f8a35 commit ff4dddf

File tree

1 file changed

+6
-3
lines changed

1 file changed

+6
-3
lines changed

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1064,9 +1064,12 @@ bool ProcessGroupNCCL::useNonblocking() {
10641064
useNonblocking_ = nbEnv;
10651065
}
10661066
// 3rd priority: automatically use nonblocking if we are in eager init mode
1067-
else if (getBoundDeviceId()) {
1068-
useNonblocking_ = true;
1069-
}
1067+
// Note: this automatic selection is disabled in torch 2.7.1 to work around a
1068+
// hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the
1069+
// bug. See https://github.com/pytorch/pytorch/issues/153960
1070+
// else if (getBoundDeviceId()) {
1071+
// useNonblocking_ = true;
1072+
// }
10701073
// 4th priority: otherwise, nonblocking = false to preserve old behavior
10711074
else {
10721075
useNonblocking_ = false;

0 commit comments

Comments
 (0)