Replies: 1 comment
-
看样子像是nccl或者mem出错了。 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am using Docker and experimenting with examples of AISHELL. I have ran the script run.sh in "DeepSpeech/examples/aishell/s0".
I am getting following errors shown, I am not able to debug the issue of error. Errors are listed below. It will be very good if proper steps and manuals are provided:
Errors:
E0819 07:01:47.576184 7372 pybind.cc:1584] Invalid CUDAPlace(1), must inside [0, 1), because GPU number on your machine is 1
W0819 07:01:47.595293 7371 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:45449 failed 1 times with reason: Connection refused retry after 0.5 seconds
E0819 07:01:47.598155 7373 pybind.cc:1584] Invalid CUDAPlace(2), must inside [0, 1), because GPU number on your machine is 1
E0819 07:01:47.598351 7374 pybind.cc:1584] Invalid CUDAPlace(3), must inside [0, 1), because GPU number on your machine is 1
C++ Traceback (most recent call last):
0 paddle::imperative::NCCLParallelContext::Init()
1 paddle::imperative::NCCLParallelContext::BcastNCCLId(std::vector<ncclUniqueId, std::allocator >&, int, int)
2 void paddle::platform::SendBroadCastCommID(std::vector<std::string, std::allocator<std::string > >, std::vector<ncclUniqueId, std::allocator >)
3 paddle::framework::SignalHandle(char const, int)
4 paddle::platform::GetCurrentTraceBackStringabi:cxx11
Error Message Summary:
FatalError:
Termination signal
is detected by the operating system.[TimeInfo: *** Aborted at 1629356507 (unix time) try "date -d @1629356507" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x1CA5) received by PID 7371 (TID 0x7f80c4287740) from PID 7333 ***]
I am not able to trace what is source of problem
Beta Was this translation helpful? Give feedback.
All reactions