Distributed Data Parallel with UNETR #3798
Unanswered
chrisfoulon
asked this question in
Q&A
Replies: 2 comments
-
Hi @ahatamiz , Could you please help take a look at this question? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @chrisfoulon We have provided UNETR with DDP support in the research repository: For distributed training, you would need to add Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have been trying to get DDP to work with the UNETR but it seems to be hanging forever after one epoch.
The weird thing is that when I run the same code using a UNet, the code is running several epoch without problems.
(If I don't use DDP both models are training normally)
I am trying to make it work on a single machine (linux) with 3 GPUs. With the UNETR the loop of an epoch finished and then the 3 gpus are used at pretty 100% for ever (until I kill it) without reachung the second epoch's code.
I am just wondering whether I am missing something that I should add to make it work with the UNETR. I know that this is a more complicated model than the UNet so there might me something that I have to do to enable ddp with the UNETR I dont know :s.
Thank you in advance for your help!
Chris.
Beta Was this translation helpful? Give feedback.
All reactions