Replies: 1 comment 2 replies
-
Hi @ahxmeds, since the error is happened when loading images, I guess it was due to OOM. I would suggest you to set Hope it can help you, thanks! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to train a
UNet
model usingDistributedDataParallel
on a Microsoft Azure Standard NC24s v3 Virtual Machine with 24 vCPUs and 4 GPUs (16 GiB per GPU) on the dataset from this challenge. My code given below is roughly based on the tutorial here.I run this code in a VS Code terminal (Linux) with
torchrun
using the following command:The code runs and trains perfectly when I use a small dataset (like n=16), but gives the following error if I use the whole training (n=419) and validation (n=105) dataset.
I tried to set the
OMP_NUM_THREADS
variable from 1-6 but that didn't help. Please let me know what could the issue be as the error log doesn't seem very explanatory to me.Edit: For all 4 processes, the
Loading dataset
progress bar fails at either171/419
,172/419
, or186/419
.Beta Was this translation helpful? Give feedback.
All reactions