What happens if the sampler of the training dataloader has a varying size for each epoch ? #16652
-
Hi, I'm currently writing my own sampler of data and because of my requirements, for some epochs there are fewer or more data sampled. the code roughly looks like this: class MySampler(Sampler):
def __init__(
self,
data_source,
max_num_samples,
) -> None:
super().__init__(data_source)
self.max_num_samples = max_num_samples
self.num_samples=0
def __iter__(self) -> List[Any]:
g = torch.Generator()
g.manual_seed(self.seed + self.epoch)
indices = [None for i in range(self.max_num_samples)]
global_idx = 0
for idx in range(torch.randint(self.max_num_windows, (1), generator=g)):
indices[global_idx] = self.data_source[idx]
global_idx +=1
indices = indices[: global_idx + 1]
self.num_samples = len(indices)
return iter(indices)
def __len__(self) -> int:
return self.num_samples If I leave
Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
If you aren't intending to train with DDP, you should be good. If you do, and since you have a custom sampler, you will have to make your sample distributed (let me know if you need details on this).
I think all we do is call len() on the dataloader to determine the number. Since you implement that as well in the sampler, that should be correct. Every epoch, the loop will call Let me know if this helps. Just a heads up: For future questions, consider posting them in our new Forum over at lightning.ai/forums. We are slowly beginning the migration away from GH discussions. |
Beta Was this translation helpful? Give feedback.
Hi @juliendenize
If you aren't intending to train with DDP, you should be good. If you do, and since you have a custom sampler, you will have to make your sample distributed (let me know if you need details on this).