Using the for loop to traverse the dataloader is slow #7257

yaoysyao · 2023-04-28T14:46:23Z

yaoysyao
Apr 28, 2023

When I constructed the dataloader and used the for loop for model training, I found that the for loop was always slow and stuck in the process of retrieving data from the for loop, which made it impossible to proceed to the next step and needed to wait for a long time before continuing to run. I tried to reduce the batchsize as follows: 8,16,32, etc., but I got the same result, sometimes it took an hour to complete the training of epoch, and a large number of times stopped in the for loop. I thought it was the problem of tqdm at the beginning, but it was still the same after I cancelled the use of tqdm, the following is my example code:
for batch in tqdm(data_loader, desc=' - training', file=sys.stdout):
#It takes a long time to get to this point,It takes a long time for start to print out
print('start')
for batch in data_loader:
#It takes a long time to get to this point,It takes a long time for start to print out
print('start')

The picture is a screenshot of each epoch waiting, I had to wait a long time for the progress bar to start the model training, a lot of time stuck in the for loop, I got the batch parameter took a long time.
Has anyone had the same problem?

Answered by rusty1s

May 17, 2023

I see. It is not necessarily a good idea to use a data_list with num_workers > 0 since the loader has to move every element of this list to shared memory separately. One thing you can try is to convert your data_list to an InMemoryDataset:

class MyDataset(InMemoryDataset):

    def __init__(self, data_list):
        super().__init__(None)
        self.data, self.slices = self.collate(data_list)


dataset = MyDataset(train_data_list)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=2,
)

View full answer

LukeLIN-web · 2023-04-28T19:17:55Z

LukeLIN-web
Apr 28, 2023

Could you provide a smallest reproduction code?

7 replies

rusty1s May 13, 2023
Maintainer

Thanks for sending the data over and sorry for answering so late. Here is how I loaded the data:

import time

import torch
import tqdm
from torch_geometric.loader import DataLoader

train_data_list, val_data_list, test_data_list = torch.load('2_train_data.pt')
print(len(train_data_list))
print(train_data_list[0])

loader = DataLoader(train_data_list, batch_size=32, shuffle=True)

t = time.perf_counter()
for batch in tqdm.tqdm(loader):
    pass
print(time.perf_counter() - t)

and it took 0.795 seconds for me in total, which doesn't seem that slow. How fast is it for you?

yaoysyao May 15, 2023
Author

Thanks for sending the data over and sorry for answering so late. Here is how I loaded the data:

import time

import torch
import tqdm
from torch_geometric.loader import DataLoader

train_data_list, val_data_list, test_data_list = torch.load('2_train_data.pt')
print(len(train_data_list))
print(train_data_list[0])

loader = DataLoader(train_data_list, batch_size=32, shuffle=True)

t = time.perf_counter()
for batch in tqdm.tqdm(loader):
    pass
print(time.perf_counter() - t)

and it took 0.795 seconds for me in total, which doesn't seem that slow. How fast is it for you?

Thank you for your answer and taking the time to reproduce the code, I think I found the problem, when I added num_workers this parameter in the dataloader, the time increased significantly, here are the different results of me setting this parameter:

When there are num_workers=multiprocessing.cpu_count() parameters：
39933
Data(x=[41, 300], edge_index=[2, 122], edge_attr=[122, 1], y=[1])
100%|██████████| 1248/1248 [10:13<00:00, 2.03it/s]
613.8635810999999

When there are num_workers=8 parameters：
39933
Data(x=[41, 300], edge_index=[2, 122], edge_attr=[122, 1], y=[1])
100%|██████████| 1248/1248 [09:02<00:00, 2.30it/s]
542.7546514000001

When there are num_workers=4 parameters：
39933
Data(x=[41, 300], edge_index=[2, 122], edge_attr=[122, 1], y=[1])
100%|██████████| 1248/1248 [05:02<00:00, 4.12it/s]
302.8591625
When there are num_workers=2 parameters：
39933
Data(x=[41, 300], edge_index=[2, 122], edge_attr=[122, 1], y=[1])
100%|██████████| 1248/1248 [02:46<00:00, 7.48it/s]
166.8343352

When no parameters are num_workers：
39933
Data(x=[41, 300], edge_index=[2, 122], edge_attr=[122, 1], y=[1])
100%|██████████| 1248/1248 [00:03<00:00, 315.30it/s]
3.9775690999999966

After the above tests, I think this problem is related to the num_workers parameter, and the smaller the num_workers value, the smaller the impact

yaoysyao May 15, 2023
Author

Thanks for sending the data over and sorry for answering so late. Here is how I loaded the data:

import time

import torch
import tqdm
from torch_geometric.loader import DataLoader

train_data_list, val_data_list, test_data_list = torch.load('2_train_data.pt')
print(len(train_data_list))
print(train_data_list[0])

loader = DataLoader(train_data_list, batch_size=32, shuffle=True)

t = time.perf_counter()
for batch in tqdm.tqdm(loader):
    pass
print(time.perf_counter() - t)

and it took 0.795 seconds for me in total, which doesn't seem that slow. How fast is it for you?

Here is my test code:
train_data_list, val_data_list, test_data_list = torch.load('2_train_data.pt')
print(len(train_data_list))
print(train_data_list[0])
loader = DataLoader(train_data_list, batch_size=32, shuffle=True, num_workers=8)
t = time.perf_counter()
for batch in tqdm.tqdm(loader):
pass
print(time.perf_counter() - t)

rusty1s May 17, 2023
Maintainer

I see. It is not necessarily a good idea to use a data_list with num_workers > 0 since the loader has to move every element of this list to shared memory separately. One thing you can try is to convert your data_list to an InMemoryDataset:

class MyDataset(InMemoryDataset):

    def __init__(self, data_list):
        super().__init__(None)
        self.data, self.slices = self.collate(data_list)


dataset = MyDataset(train_data_list)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=2,
)

Answer selected by yaoysyao

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using the for loop to traverse the dataloader is slow #7257

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using the for loop to traverse the dataloader is slow #7257

Uh oh!

yaoysyao Apr 28, 2023

Replies: 1 comment · 7 replies

Uh oh!

LukeLIN-web Apr 28, 2023

Uh oh!

rusty1s May 13, 2023 Maintainer

Uh oh!

yaoysyao May 15, 2023 Author

Uh oh!

yaoysyao May 15, 2023 Author

Uh oh!

rusty1s May 17, 2023 Maintainer

yaoysyao
Apr 28, 2023

Replies: 1 comment 7 replies

LukeLIN-web
Apr 28, 2023

rusty1s May 13, 2023
Maintainer

yaoysyao May 15, 2023
Author

yaoysyao May 15, 2023
Author

rusty1s May 17, 2023
Maintainer