Script Killed with DDP + wandb? #11727
Unanswered
dmandair
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, appreciate the help on this! I've been having some issues with DDP in pytorch lightning while also using wandb. As a preface, I tested my script on a development node with a small batch size using only one GPU and it ran just fine (just slow) - the issues come up when I start to use multiple GPUs (4) using DDP with pytorch lightening. Unfortunately, I don't have much of an error to give - the script runs but ends quickly with the log file only reporting 'Killed barlow.py'. I thought initially this might be a memory issue and have since decreased my batch size significantly - from the gpu memory usage report, however, I don't think it's a memory issue. Is there something that pops at as wrong with how I'm initializing/using wandb along with pytorch lightening in these runs? Any thoughts on what exactly is going on? Also of note, when I log validation losses, I use the flag 'sync_dist' = True. A portion of my code is reproduced below. Happy to show any other parts of my code if helpful as well.
Just so you have it, the debug.log file from wandb is here:
Beta Was this translation helpful? Give feedback.
All reactions