Data loading when having thousands of graphs #2444

cdglissov · 2021-04-22T11:12:56Z

cdglissov
Apr 22, 2021

I have 6700 graphs of different sizes. All the graphs can't be in memory, so I save each graph as a .pt file on the disk and then create a Dataset class. The problem is that the data loads extremely slow (~10 mins per epoch) when using a data loader (10 workers and a batch size of 150). It's quite fast when only using a subset of 1000 graphs (~15 secs per epoch), but when using the full dataset of 6700 .pt files it becomes very slow. I guess it's due to some overhead, maybe how I store the graphs in 6700 individual files, do any of you know a more efficient way to store the data when it is data objects containing graphs of different sizes? Perhaps I am misunderstanding something? The dataset code can be seen below. Thanks a lot and appreciate all the great work done with Pytorch Geometric!

class CDataset(Dataset):
  def __init__(self, root, pre_filter=None, pre_transform=None):
      super(CDataset, self).__init__(root, pre_filter, pre_transform)
      
  def atoi(self, text):
    return int(text) if text.isdigit() else text
  
  def natural_keys(self, text):
    return [ self.atoi(c) for c in re.split(r'(\d+)', text) ]
  
  @property
  def raw_file_names(self):
    path_to_raw = os.listdir(self.root+"/raw")
    path_to_raw.sort(key=self.natural_keys)
    return path_to_raw
  
  @property
  def processed_file_names(self):
    names = []
    for i in range(len(self.raw_paths)):
      names.append('data_{}.pt'.format(i))
    names.sort(key=self.natural_keys)
    return names

  def download(self):
    pass

  def process(self):
      i = 0
      for raw_path in self.raw_paths:
        
        data = torch.load(raw_path)
        data = data if self.pre_filter is None else self.pre_filter(data)
        data = data if self.pre_transform is None else self.pre_transform(data)
        torch.save(data, osp.join(self.processed_dir, 'data_{}.pt'.format(i)))
        i += 1

  def len(self):
    return len(self.processed_file_names)

  def get(self, idx):
    data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
    return data

cdglissov · 2021-04-22T12:45:55Z

cdglissov
Apr 22, 2021
Author

I actually fixed it by modifying my data, so I am able to store it in memory now, very fast training now. The question is not relevant anymore, unless someone still wants to answer how to optimally store a lot of graphs :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data loading when having thousands of graphs #2444

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data loading when having thousands of graphs #2444

Uh oh!

Uh oh!

cdglissov Apr 22, 2021

Replies: 1 comment

Uh oh!

cdglissov Apr 22, 2021 Author

cdglissov
Apr 22, 2021

cdglissov
Apr 22, 2021
Author