Commit 74878ac
[PGNCCL] Make sure we do not use split for P2P comm creation (pytorch#139013)
Resolve comment pytorch#138527 (comment)
There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.
Pull Request resolved: pytorch#139013
Approved by: https://github.com/shuqiangzhang1 parent fb2c750 commit 74878ac
File tree
2 files changed
+28
-1
lines changed- test/distributed
- torch/csrc/distributed/c10d
2 files changed
+28
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
982 | 982 | | |
983 | 983 | | |
984 | 984 | | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
985 | 1007 | | |
986 | 1008 | | |
987 | 1009 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2401 | 2401 | | |
2402 | 2402 | | |
2403 | 2403 | | |
2404 | | - | |
| 2404 | + | |
| 2405 | + | |
| 2406 | + | |
| 2407 | + | |
| 2408 | + | |
| 2409 | + | |
2405 | 2410 | | |
2406 | 2411 | | |
2407 | 2412 | | |
| |||
0 commit comments