Dataset with variable sized batches #9391
Unanswered
GaurangTandon
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
you can add a custom collate_fn, and limit the batches there. Return items of a single sentence from the dataset and accumulate and shuffle them there to generate the final input data. def collate_fun(batch):
batch = shuffle(flatten(batch))
batch = batch[:actual_batch_size]
batch, target = ... # separate target from input data
batch = torch.LongTensor(batch)
target = torch.LongTensor(target)
return batch, target this way you can keep the actual batch_size same by adding some similar value for batch_size inside the dataloader. Just some initial thoughts. There might be a better way. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, I'm training CBOW using Pytorch. I have a million sentences, and they generate more than ten million training points. Each sentence contributes a variable amount of datapoints. I cannot store all of them in memory.
Therefore, I wanted to implement something of the following:
Thus I want to ask is there a way to modify getitem to return a variable sized batch instead of a single item in the batch?
For example, my first batch may have size 10 (because sentence length is 10) but the second batch may have size 30 (because second sentence length is 30) and so on... In Tensorflow Keras, I was using data generators and they were working fine. Can't figure out the alternative here.
Beta Was this translation helpful? Give feedback.
All reactions