forked from kimiyoung/transformer-xl
-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
The long local run is hanging with all 8 processes having identical stack trace and 100% nvidia-smi GPU utilization
#8 0x00007f5178571f92 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9 0x00007f517d0984bf in ?? ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f517d075573 in ?? ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f517d0aed86 in cudaMemcpyAsync ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f518ba39566 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f518ba3bbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f518aa70902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f517d8e5685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f517fb0392a in at::native::item(at::Tensor const&) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f517fe0de15 in at::TypeDefault::item(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f517dadf418 in torch::autograd::VariableType::item(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f51be448756 in torch::autograd::dispatch_to_CLong(at::Tensor const&) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f51be4499f0 in torch::autograd::THPVariable_item(_object*, _object*) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055dd363e1bda in _PyCFunction_FastCallDict ()
There's only one place in training which uses .item
train_loss += loss.float().item()
Figure out of that's connected, and maybe change code to not use item() here
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels