Dataset with variable sized batches #9391

GaurangTandon · 2021-09-09T06:08:13Z

GaurangTandon
Sep 9, 2021

Hi all, I'm training CBOW using Pytorch. I have a million sentences, and they generate more than ten million training points. Each sentence contributes a variable amount of datapoints. I cannot store all of them in memory.

Therefore, I wanted to implement something of the following:

class DataGenerator(Dataset):
    def __init__(self, sentences):
        self.sents = sentences

    def __len__(self):
        return len(self.sents)

    def __getitem__(self, index):
        sentence = self.sents[index]
        # TODO: return a batch of size (len(sentence), _) rather than (1, _)

Thus I want to ask is there a way to modify getitem to return a variable sized batch instead of a single item in the batch?

For example, my first batch may have size 10 (because sentence length is 10) but the second batch may have size 30 (because second sentence length is 30) and so on... In Tensorflow Keras, I was using data generators and they were working fine. Can't figure out the alternative here.

rohitgr7 · 2021-09-09T15:22:00Z

rohitgr7
Sep 9, 2021

you can add a custom collate_fn, and limit the batches there. Return items of a single sentence from the dataset and accumulate and shuffle them there to generate the final input data.

def collate_fun(batch):
    batch = shuffle(flatten(batch))
    batch = batch[:actual_batch_size]
    batch, target = ... # separate target from input data
    batch = torch.LongTensor(batch)
    target = torch.LongTensor(target)
    return batch, target

this way you can keep the actual batch_size same by adding some similar value for batch_size inside the dataloader.

Just some initial thoughts. There might be a better way.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset with variable sized batches #9391

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dataset with variable sized batches #9391

Uh oh!

Uh oh!

GaurangTandon Sep 9, 2021

Replies: 1 comment

Uh oh!

rohitgr7 Sep 9, 2021

GaurangTandon
Sep 9, 2021

rohitgr7
Sep 9, 2021