What happens if the sampler of the training dataloader has a varying size for each epoch ? #16652

juliendenize · 2023-02-06T16:43:23Z

juliendenize
Feb 6, 2023

Hi,

I'm currently writing my own sampler of data and because of my requirements, for some epochs there are fewer or more data sampled.

the code roughly looks like this:

class MySampler(Sampler):
    def __init__(
        self,
        data_source,
        max_num_samples,
    ) -> None:
        super().__init__(data_source)
        self.max_num_samples = max_num_samples
        self.num_samples=0

    def __iter__(self) -> List[Any]:
        g = torch.Generator()
        g.manual_seed(self.seed + self.epoch)

        indices = [None for i in range(self.max_num_samples)]
        global_idx = 0
        for idx in range(torch.randint(self.max_num_windows, (1), generator=g)):
            indices[global_idx] = self.data_source[idx]
            global_idx +=1
        
        indices = indices[: global_idx + 1]
        self.num_samples = len(indices)

        return iter(indices)

    def __len__(self) -> int:
        return self.num_samples

If I leave self.num_samples=0 lightning does not perform training and says there are no training samples, probably because of some initialization inside lightning. I changed it to self.num_samples = max_num_samples to make it work, but my questions would be:

Is it ok to have a varying number of batches at each epoch without using iterable datasets?
What is initialized behind the scenes that lightning needs ? I assume that it will try to compute the number of training steps which will be wrong in my case.

Thanks in advance.

Answered by awaelchli

Feb 10, 2023

Hi @juliendenize

Is it ok to have a varying number of batches at each epoch without using iterable datasets?

It is ok if training with single-device/single-GPU training.
It is NOT ok if training with DDP in general (your training loop will fall out of sync and eventually hang)
It is ok if training with DDP AND you can guarantee that each process/GPU has the same epoch length. If the length changes from epoch N to N + 1, it has to change the same way in all processes.

If you aren't intending to train with DDP, you should be good. If you do, and since you have a custom sampler, you will have to make your sample distributed (let me know if you need details on this).

What is initialized …

View full answer

awaelchli · 2023-02-10T02:44:02Z

awaelchli
Feb 10, 2023

Hi @juliendenize

Is it ok to have a varying number of batches at each epoch without using iterable datasets?

It is ok if training with single-device/single-GPU training.
It is NOT ok if training with DDP in general (your training loop will fall out of sync and eventually hang)
It is ok if training with DDP AND you can guarantee that each process/GPU has the same epoch length. If the length changes from epoch N to N + 1, it has to change the same way in all processes.

If you aren't intending to train with DDP, you should be good. If you do, and since you have a custom sampler, you will have to make your sample distributed (let me know if you need details on this).

What is initialized behind the scenes that lightning needs ? I assume that it will try to compute the number of training steps which will be wrong in my case.

I think all we do is call len() on the dataloader to determine the number. Since you implement that as well in the sampler, that should be correct. Every epoch, the loop will call __iter__ on the dataloader, and this will invoke your __iter__ on the sampler.
Let me know if you need references to the code.

Let me know if this helps.

Just a heads up: For future questions, consider posting them in our new Forum over at lightning.ai/forums. We are slowly beginning the migration away from GH discussions.

1 reply

juliendenize Feb 10, 2023
Author

Thanks @awaelchli for your detailed answer.

In my use case, I'm using DDP but I took inspiration from Pytorch lightning to make sure my sampler is distributed and that every process has the same number of sampled elements from the dataset at each epoch.

So based on your answer my pipeline should be OK., thanks again and next time I have a question, I'll use the dedicated forum.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happens if the sampler of the training dataloader has a varying size for each epoch ? #16652

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What happens if the sampler of the training dataloader has a varying size for each epoch ? #16652

Uh oh!

Uh oh!

juliendenize Feb 6, 2023

Replies: 1 comment · 1 reply

Uh oh!

awaelchli Feb 10, 2023

Uh oh!

juliendenize Feb 10, 2023 Author

juliendenize
Feb 6, 2023

Replies: 1 comment 1 reply

awaelchli
Feb 10, 2023

juliendenize Feb 10, 2023
Author