Dataset returning subgraphs of larger graphs #9000

ppalasek · 2024-03-01T16:24:24Z

ppalasek
Mar 1, 2024

Hi, I am developing a dataset in which each sample is a subgraph of a much larger graph. The graphs are heterogeneous and contain edges of different types and feature lengths. There are overlaps between subgraphs and generating and storing all of them would end up using too much memory (even on disk).

I have a mapping from a sample index to all the indices in the larger graph that should be included in the subgraph. All the edges between the selected nodes should also be kept. Currently I just pass the node indices to the subgraph method of the HeteroData class which then creates the sample, i.e. the subgraph.

The dataset I am working on extends the InMemoryDataset and overrides the get method to load up the required larger graph (the same way it's done in the standard InMemoryDataset) and then creates and returns the requested subgraph on the fly using the .subgraph method of HeteroData. This can give me around 50 samples per second on a single CPU.

I wanted to check if this is a valid approach or is there a better way of achieving this? Is there an existing Sampler that I could use or modify for this use case that would be faster than what I described above?

I want to iterate through this dataset using a DataLoader with multiple workers to speed it up further. In case I use the version of InMemoryDataset described above, would each worker have to load its own copy of the dataset into memory or can they use a single instance of the data?

Thank you for your advice!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset returning subgraphs of larger graphs #9000

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Dataset returning subgraphs of larger graphs #9000

Uh oh!

ppalasek Mar 1, 2024

Replies: 0 comments

ppalasek
Mar 1, 2024