General distributed batching question #19212
Unanswered
haydn-jones
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've gotten myself a bit confused when it comes to iterable datasets and DDP / various sharding strategies. I was using a huggingface IterableDataset in my dataloader with DDP, and I expected that Lightning would handle properly splitting the batch across devices, but that doesn't seem to be the case (i.e. Rank 0 and Rank 1 get the same batch). I've added the following code to my LightningDataModule which seems to fix this:
Now, when I use FSDP / DeepSpeed stage 3, should I continue having the ranks see a different batch? I'm not experienced with sharding, so I'm not sure what to expect at all. Honestly to me it seems like with sharding you would logically have one program driving all the ranks, which means you would want one batch and the processing of that batch is sharded across the GPUs, though I have no idea how it works fundamentally.
Edit: Also, the typing on the setup hook is wrong. Stage is an enum not a string.
Beta Was this translation helpful? Give feedback.
All reactions