Multi-GPU Distributed Training Process Getting Stuck

Hi, Yinyu! I am currently working with your ScenePriors repository and attempting to implement multi-GPU distributed training. However, I have encountered an issue where one of the training processes becomes stuck and does not progress along with the others, as shown in the following figure.
![微信图片_20240108184011](https://github.com/yinyunie/ScenePriors/assets/35297996/ad33dca8-ac69-4f38-8a5b-8b10a8629526)
If you have previously experienced this issue or if you are aware of any potential reasons that might cause a process to hang during distributed training with multiple GPUs?
Thanks a lot.

Best,
Xin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Distributed Training Process Getting Stuck #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-GPU Distributed Training Process Getting Stuck #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions