Big dataset, what is the best way to proceed? #6779

TommasoBendinelli · 2021-04-01T08:39:31Z

TommasoBendinelli
Apr 1, 2021

Hello,
I am trying to train a model with a big dataset (Over 40 GB). This dataset is too big for being loaded at the very beginning on the RAM. So I was planning to load it into chunks. However, with the current dataloader API only way of workings are clear to me

Load the entire dataset at the very beginning before training (i.e. in the init in the dataloader)
Load on a sample at the time during the getitem phase.

Ideally, I would need a hybrid solution where I am loading a chuck of data samples, a process entirely, and then move to the next chunks (with multiple workers working in parallel). Is it possible, if so, how?

Best,
Tommaso

ananthsub · 2021-04-01T09:47:58Z

ananthsub
Apr 1, 2021

Have you tried using an iterable dataset? https://pytorch.org/docs/stable/data.html#iterable-style-datasets

0 replies

awaelchli · 2021-04-04T22:39:29Z

awaelchli
Apr 4, 2021

I think what you are asking for is exactly what PyTorch Dataloaders already implement. They load chunks (batches) with multiple async workers in parallel. There is always never the need to load the whole dataset to RAM. If you want one sample at a time, you set batch_size=1 (default)

0 replies

superhero-7 · 2023-02-15T02:49:25Z

superhero-7
Feb 15, 2023

Actually, I wann do the same things, how do you fix this problem finally?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Big dataset, what is the best way to proceed? #6779

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Big dataset, what is the best way to proceed? #6779

Uh oh!

TommasoBendinelli Apr 1, 2021

Replies: 3 comments

Uh oh!

Uh oh!

ananthsub Apr 1, 2021

Uh oh!

awaelchli Apr 4, 2021

Uh oh!

superhero-7 Feb 15, 2023

TommasoBendinelli
Apr 1, 2021

ananthsub
Apr 1, 2021

awaelchli
Apr 4, 2021

superhero-7
Feb 15, 2023