General distributed batching question #19212

haydn-jones · 2023-12-25T18:07:04Z

haydn-jones
Dec 25, 2023

I've gotten myself a bit confused when it comes to iterable datasets and DDP / various sharding strategies. I was using a huggingface IterableDataset in my dataloader with DDP, and I expected that Lightning would handle properly splitting the batch across devices, but that doesn't seem to be the case (i.e. Rank 0 and Rank 1 get the same batch). I've added the following code to my LightningDataModule which seems to fix this:

    def setup(self, stage: str) -> None:
        splits = {TrainerFn.FITTING: "train", TrainerFn.VALIDATING: "val", TrainerFn.TESTING: "test"}

        ds: IterableDataset = datasets.load_dataset(DS, streaming=True, split=splits[stage])
        ds, collate_fn = prepare_ds(ds, self.vocab)

        self.ds = datasets.distributed.split_dataset_by_node(ds, rank=self.trainer.global_rank, world_size=self.trainer.world_size)
        self.collate_fn = collate_fn

Now, when I use FSDP / DeepSpeed stage 3, should I continue having the ranks see a different batch? I'm not experienced with sharding, so I'm not sure what to expect at all. Honestly to me it seems like with sharding you would logically have one program driving all the ranks, which means you would want one batch and the processing of that batch is sharded across the GPUs, though I have no idea how it works fundamentally.

Edit: Also, the typing on the setup hook is wrong. Stage is an enum not a string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

General distributed batching question #19212

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

General distributed batching question #19212

Uh oh!

Uh oh!

haydn-jones Dec 25, 2023

Replies: 0 comments

haydn-jones
Dec 25, 2023