Optimum strategy to train on very large datasets #4277

diamondspark · 2022-03-16T16:34:32Z

diamondspark
Mar 16, 2022

Hi all,
I am trying to train on roughly 2 Million molecular graph data points. What is the best strategy to load data in such a case? I follow the tutorial for large datasets here https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
I have the following questions in particular

What happens when I call MyOwnDataset() when process() is being skipped. It takes forever to create this object even though all the graphs are precomputed and saved to the memory. What happens under the hood on this call? Any way to make this faster?
My first epoch is about 10 times slower than subsequent epochs. Any insights as to why this could be and how to make this better. The rate-limiting step is

for _, data in tqdm(enumerate(train_loader),total = len(train_loader)):
    pass

I plan to do inference on about a Billion data points following model training. The forward pass would be similarly slow as Images #2.

Kindly advise.

Thank you!

rusty1s · 2022-03-17T13:02:06Z

rusty1s
Mar 17, 2022
Maintainer

There might be a bottleneck in the check of whether to skip processing. It will ensure that all files in processed_file_names actually exist. Importantly, do not check for all 2M files here, only check for the first 10 ones for example.
Not really sure, there might be some clever caching happening in your OS.

Depending on your graph sizes, you can also think about storing your data in batches (rather than individual graphs). This should increase I/O by a huge amount.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimum strategy to train on very large datasets #4277

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimum strategy to train on very large datasets #4277

Uh oh!

Uh oh!

diamondspark Mar 16, 2022

Replies: 1 comment

Uh oh!

rusty1s Mar 17, 2022 Maintainer

diamondspark
Mar 16, 2022

rusty1s
Mar 17, 2022
Maintainer