Big dataset, what is the best way to proceed? #6779
Replies: 3 comments
-
Have you tried using an iterable dataset? https://pytorch.org/docs/stable/data.html#iterable-style-datasets |
Beta Was this translation helpful? Give feedback.
-
I think what you are asking for is exactly what PyTorch Dataloaders already implement. They load chunks (batches) with multiple async workers in parallel. There is always never the need to load the whole dataset to RAM. If you want one sample at a time, you set batch_size=1 (default) |
Beta Was this translation helpful? Give feedback.
-
Actually, I wann do the same things, how do you fix this problem finally? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am trying to train a model with a big dataset (Over 40 GB). This dataset is too big for being loaded at the very beginning on the RAM. So I was planning to load it into chunks. However, with the current dataloader API only way of workings are clear to me
Ideally, I would need a hybrid solution where I am loading a chuck of data samples, a process entirely, and then move to the next chunks (with multiple workers working in parallel). Is it possible, if so, how?
Best,
Tommaso
Beta Was this translation helpful? Give feedback.
All reactions