Strategy for creating and training on very large dataset #4391

jake-rbh · 2022-03-31T20:40:33Z

jake-rbh
Mar 31, 2022

Hi, I am working on node classification on a large graph that has

two types of nodes: 40M A nodes and 600M B nodes
two type of edges: 700M A-to-A edges and 2B A-to-B edges

All these data are stored on S3 in sharded parquet files. Is there any best practice or pointers for loading and training on such large graph? For example:

when reading parquets into tensors to create HeteroData, is it possible to not load everything into memory? (I believe torch_geometric.data.Dataset is irrelevant here since we are dealing with a single graph)
how practical is training on GPU and distributed training? Is there any overhead or special (production) consideration when using sampling or subgraph loaders such as NeighborLoader? Would GPU limit modeling choice such that only small models like SGC can be used?

Answered by rusty1s

Apr 1, 2022

You can read your Parquet files in batches and convert them to the desired output format, but PyG currently expects that both node feature matrix x and edge connectivity edge_index fits into memory. I think it is still an open research/engineering problem on how to best achieve this without storing everything in RAM. For example, you can input your data into a graph database (and sample from there), or you could create mini-batches beforehand and store them on disk (the Pinterest approach). An alternative is to just fit edge_index into memory (for efficient sampling), and make use of memory-mapped I/O to query node features from disk.
Distributed training is easily doable via PyTorch Lig…

View full answer

rusty1s · 2022-04-01T11:23:34Z

rusty1s
Apr 1, 2022
Maintainer

You can read your Parquet files in batches and convert them to the desired output format, but PyG currently expects that both node feature matrix x and edge connectivity edge_index fits into memory. I think it is still an open research/engineering problem on how to best achieve this without storing everything in RAM. For example, you can input your data into a graph database (and sample from there), or you could create mini-batches beforehand and store them on disk (the Pinterest approach). An alternative is to just fit edge_index into memory (for efficient sampling), and make use of memory-mapped I/O to query node features from disk.
Distributed training is easily doable via PyTorch Lightning for example. The hard part about it is to distribute the data. You would either need to share data across each machine, or let each machine only operate on a subgraph of your data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strategy for creating and training on very large dataset #4391

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strategy for creating and training on very large dataset #4391

Uh oh!

jake-rbh Mar 31, 2022

Replies: 1 comment

Uh oh!

rusty1s Apr 1, 2022 Maintainer

jake-rbh
Mar 31, 2022

rusty1s
Apr 1, 2022
Maintainer