Skip to content

Workaround for torch.linalg lazy init for multi-GPU setup?Β #3004

@Nicholas-Autio-Mitchell

Description

❓ Questions/Help/Support

I am new to the ignite library, and am trying to run multi-GPU (single node) training that is configured in this repo, and here is the specific entrypoint I am running where ignite.distributed.Parallel is called.

The error I am facing seems quite generic and that suggests a setup issue:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/user/.../lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/user/.../lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/.../BehindTheScenes/models/bts/trainer.py", line 115, in forward
    to_base_pose = torch.inverse(poses[:, :1, :, :])
RuntimeError: lazy wrapper should be called at most once

I found this error and explanation on torch repo: pytorch/pytorch#90613. It suggests newer CUDA versions can handle it better and provides a hack: calling torch.inverse(...) at the entrpoint to force the lazy initialisation of torch.linalg to happen. This doesn't fix the problem for :(

I don't want to get into the details of the other repo with this question, just first wanted to ask if there is a known workaround when using the ignite library (apart from upgrading CUDA)?

Thanks in advance for any help!


Version info

The full environment given by the above repo: https://github.com/Brummi/BehindTheScenes/blob/main/environment.yml

  • OS: Ubuntu 18.04.5 LTS
  • CUDA Driver Version: 470.182.03
  • Python env:
    cuda                      11.6.1                        0    nvidia
    cuda-toolkit              11.6.1                        0    nvidia
    cuda-tools                11.6.1                        0    nvidia
    python                    3.10.12              h955ad1f_0
    pytorch                   1.13.1          py3.10_cuda11.6_cudnn8.3.2_0    pytorch
    pytorch-cuda              11.6                 h867d48c_1    pytorch
    pytorch-ignite            0.4.12                   pypi_0    pypi

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions