-
Notifications
You must be signed in to change notification settings - Fork 805
Description
[rank1]:[W1229 22:06:38.480071392 ProcessGroupNCCL.cpp:1662] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W1229 22:06:38.558655037 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=67, addr=[bogon]:36214, remote=[bogon]:43145): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x758a06b785e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0x5ba8bfe (0x7589f02d5bfe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5baa458 (0x7589f02d7458 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5babc3e (0x7589f02d8c3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x298 (0x7589f02d2298 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7589b19d19f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc2c3 (0x7589a17d82c3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94b43 (0x758a07c63b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x758a07cf4bb4 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[W1229 22:06:38.572032284 ProcessGroupNCCL.cpp:1662] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank2]:[W1229 22:06:38.561582067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=81, addr=[bogon]:48958, remote=[bogon]:37057): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7d06537785e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0x5ba8bfe (0x7d063cad5bfe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5baa458 (0x7d063cad7458 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5babc3e (0x7d063cad8c3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x298 (0x7d063cad2298 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7d05fe1d19f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc2c3 (0x7d05edfd82c3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94b43 (0x7d06544e9b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7d065457abb4 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[W1229 22:06:38.573117272 ProcessGroupNCCL.cpp:1662] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank3]:[W1229 22:06:39.093650234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=80, addr=[bogon]:48976, remote=[bogon]:37057): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7da9c03785e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0x5ba8bfe (0x7da9a9ad5bfe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5baa458 (0x7da9a9ad7458 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5babc3e (0x7da9a9ad8c3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x298 (0x7da9a9ad2298 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7da96b1d19f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc2c3 (0x7da95afd82c3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94b43 (0x7da9c14a4b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7da9c1535bb4 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[W1229 22:06:39.103877509 ProcessGroupNCCL.cpp:1662] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W1229 22:06:39.480183011 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=81, addr=[bogon]:48968, remote=[bogon]:37057): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7ae393d785e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0x5ba8bfe (0x7ae37d4d5bfe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5baa458 (0x7ae37d4d7458 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5babc3e (0x7ae37d4d8c3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x298 (0x7ae37d4d2298 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7ae33ebd19f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc2c3 (0x7ae32e9d82c3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94b43 (0x7ae394e51b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7ae394ee2bb4 in /lib/x86_64-linux-gnu/libc.so.6)