Skip to content

Multi-GPU Distributed Training Process Getting Stuck #11

@kangx326

Description

@kangx326

Hi, Yinyu! I am currently working with your ScenePriors repository and attempting to implement multi-GPU distributed training. However, I have encountered an issue where one of the training processes becomes stuck and does not progress along with the others, as shown in the following figure.
微信图片_20240108184011
If you have previously experienced this issue or if you are aware of any potential reasons that might cause a process to hang during distributed training with multiple GPUs?
Thanks a lot.

Best,
Xin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions