You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I asked this question here #6103 on the MONAI forum and a similar question on the PyTorch forum here.
Basically, I want to train a SwinUNETR network using input patch-size (192, 192, 192) (and also perform validation/testing on inference window size (192, 192, 192). It is crucial for me to use this very patch size. I am working on a single node with 4 GPUs (16 GiB RAM each). I tried wrapping this SwinUNETR with torch.nn.DataParallel(.), torch.nn.parallel.DistributedDataParallel(.), as well as the more advanced parallelization, torch.distributed.fsdp.FullyShardedDataParallel(.), but for all those cases, I ran out of CUDA GPU memory.
Do we have an efficient implementation of SwinUNET pipeline with FSDP parallelization? Is it possible to find another way to solve this problem?
Alternatively, does the input patch size during training/inference matter for the performance of SwinUNETR? I know it matters for UNet as can be seen from some of my results shown in #6103 as well as the discussion in #2924.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I asked this question here #6103 on the MONAI forum and a similar question on the PyTorch forum here.
Basically, I want to train a
SwinUNETR
network using input patch-size(192, 192, 192)
(and also perform validation/testing on inference window size(192, 192, 192)
. It is crucial for me to use this very patch size. I am working on a single node with 4 GPUs (16 GiB RAM each). I tried wrapping thisSwinUNETR
withtorch.nn.DataParallel(.)
,torch.nn.parallel.DistributedDataParallel(.)
, as well as the more advanced parallelization,torch.distributed.fsdp.FullyShardedDataParallel(.)
, but for all those cases, I ran out of CUDA GPU memory.Do we have an efficient implementation of
SwinUNET
pipeline withFSDP
parallelization? Is it possible to find another way to solve this problem?Alternatively, does the input patch size during training/inference matter for the performance of
SwinUNETR
? I know it matters forUNet
as can be seen from some of my results shown in #6103 as well as the discussion in #2924.Please let me know.
Beta Was this translation helpful? Give feedback.
All reactions