Auto3DSeg: NCCL timeout error #7411
Unanswered
pwrightkcl
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
For the auto3dseg experts:
I'm getting an error about NCCL timeouts, which some searching suggests is due to loading a large dataset. I've found potential solutions online but I'm hoping for some advice on which is correct how to implement in autoseg3d.
Here's the error:
In this SO thread the main answer suggests setting the SHM or
export NCCL_P2P_LEVEL=NVL
, which I can test. The second answer talks about setting the timeout in a Python line initialising a torch process group (which seems likely as I have large datasets). I have no idea how to drill down to that in autoseg3d, so could use some advice on how to do that or that I should leave it alone.Many thanks!
Beta Was this translation helpful? Give feedback.
All reactions