H5Dataset with PyTorch DataLoader #4
-
|
Hello, I wanted to use the H5Dataset with PyTorch's DataLoader to load data with multiple workers. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hello 👋 The advantage of Here is an example ( >>> import lampe
>>> import time
>>> import torch
>>> import torch.utils.data as data
>>> import tqdm
>>>
>>> dataset = lampe.data.H5Dataset('train.h5', batch_size=64 * 1024, chunk_size=1024, chunk_step=64)
>>> for theta, x in tqdm.tqdm(dataset, total=16):
... time.sleep(1)
...
100%|██████████| 16/16 [00:18<00:00, 1.18s/it]
>>>
>>> dataloader = data.DataLoader(dataset, batch_size=None, num_workers=1)
>>> for theta, x in tqdm.tqdm(dataloader, total=16):
... time.sleep(1)
...
100%|██████████| 16/16 [00:16<00:00, 1.04s/it]You can see that the time per iteration is closer to 1s in the second case. |
Beta Was this translation helpful? Give feedback.
Hello 👋
The advantage of
DataLoaderwithnum_workers > 0is that the data processing is concurrent with the main process. Therefore, if a loop iteration takes longer than it takes to fetch a batch and transfer it to the main process, the next iteration will not have to wait for data. However, if iterations are fast, the overhead added by the transfer of data between processes could outweigh the benefits.Here is an example (
train.h5contains 1M samples) where using aDataLoaderis worthwhile. The effect is accentuated by the (very) large batch size.