-
-
Notifications
You must be signed in to change notification settings - Fork 657
Description
β Questions/Help/Support
I am new to the ignite library, and am trying to run multi-GPU (single node) training that is configured in this repo, and here is the specific entrypoint I am running where ignite.distributed.Parallel is called.
The error I am facing seems quite generic and that suggests a setup issue:
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/user/.../lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/user/.../lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.../BehindTheScenes/models/bts/trainer.py", line 115, in forward
to_base_pose = torch.inverse(poses[:, :1, :, :])
RuntimeError: lazy wrapper should be called at most onceI found this error and explanation on torch repo: pytorch/pytorch#90613. It suggests newer CUDA versions can handle it better and provides a hack: calling torch.inverse(...) at the entrpoint to force the lazy initialisation of torch.linalg to happen. This doesn't fix the problem for :(
I don't want to get into the details of the other repo with this question, just first wanted to ask if there is a known workaround when using the ignite library (apart from upgrading CUDA)?
Thanks in advance for any help!
Version info
The full environment given by the above repo: https://github.com/Brummi/BehindTheScenes/blob/main/environment.yml
- OS:
Ubuntu 18.04.5 LTS - CUDA
Driver Version: 470.182.03 - Python env:
cuda 11.6.1 0 nvidia cuda-toolkit 11.6.1 0 nvidia cuda-tools 11.6.1 0 nvidia python 3.10.12 h955ad1f_0 pytorch 1.13.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch pytorch-cuda 11.6 h867d48c_1 pytorch pytorch-ignite 0.4.12 pypi_0 pypi