Keeping #batches consistent across multiple graphs in link prediction when minibatching #3745

jasperhyp · 2021-12-22T04:21:06Z

jasperhyp
Dec 22, 2021

Hi there, I am doing a link prediction task on multiple graphs. Those graphs are of the same kind, but have distinct numbers of nodes/edges ranging from ~1000 edges to ~30000 edges (graphs are fed to the model in the form of a dictionary). I do not need to do minibatching now, but the graphs may expand later to ~100,000 edges, so it's better to start minibatching now. To properly train the model, it might be better if I can include samples from all graphs in each minibatch. My plan is that I use GraphSAINTEdgeSampler to partition each graph into a fixed number of minibatches, and then construct a list of dictionary, which I can then iterate through.

However, I found that GraphSAINTEdgeSampler seems to sample edges repetitively (say for one graph I have ~10000 edges, it is sampling ~20000 in total when I sample some random batch size, and ~5000 for some smaller batch size...), and there is no explicit way to fix the number of batches (with flexible batch size across graphs). Thanks in advance for your suggestions!

P.S. I didn't plan to use DataLoader & RandomLinkSplit as is the usual case for link prediction and batching because it does not seem like DataLoader can properly handle my situation of multiple graphs.

Answered by rusty1s

Dec 22, 2021

GraphSAINTEdgeSampler should never sample edges twice. Do you have a small script to reproduce this?
You are right that GraphSAINTEdgeSampler does not fix the number of batches. This is due to how GraphSAINT is defined, i.e. nodes are potentially re-sampled across iterations.
I'm not entirely sure why you need GraphSAINTEdgeSampler in the first place. Processing 100k edges should be easily doable in full-batch mode, and you can even use DataLoader to stack multiple graphs together if GPU memory allows.

View full answer

rusty1s · 2021-12-22T09:44:25Z

rusty1s
Dec 22, 2021
Maintainer

GraphSAINTEdgeSampler should never sample edges twice. Do you have a small script to reproduce this?
You are right that GraphSAINTEdgeSampler does not fix the number of batches. This is due to how GraphSAINT is defined, i.e. nodes are potentially re-sampled across iterations.
I'm not entirely sure why you need GraphSAINTEdgeSampler in the first place. Processing 100k edges should be easily doable in full-batch mode, and you can even use DataLoader to stack multiple graphs together if GPU memory allows.

12 replies

rusty1s Dec 27, 2021
Maintainer

Why do want to re-process already processed node embeddings once again? Input and output dimensionalities are fixed, and so you have to keep the original node embeddings for mini-batching as well.

jasperhyp Dec 27, 2021
Author

Why do want to re-process already processed node embeddings once again? Input and output dimensionalities are fixed, and so you have to keep the original node embeddings for mini-batching as well.

We are learning the node embeddings, so they are updated (re-processed) every batch. I'm also currently using the original node embeddings for each minibatch, and use the processed node embeddings after the final minibatch of the final epoch as the final node embedding to extract. Hopefully the results won't be hugely affected compared with full-batch training.

rusty1s Dec 28, 2021
Maintainer

So you are saying you apply the GNN+NeighborSampler strategy without any learning taking place? There is no need to do this IMO. NeighborSampler is well able to learn model parameters end-to-end.

jasperhyp Dec 28, 2021
Author

So you are saying you apply the GNN+NeighborSampler strategy without any learning taking place? There is no need to do this IMO. NeighborSampler is well able to learn model parameters end-to-end.

Ahhh not really though. We are learning GNN parameters throughout, and the embeddings transformed through the model from original input in each minibatch would be refined as training goes on. Link prediction is used as a task to get some loss. The model itself is end-to-end, and NeighborSampler is just one way to sample links for minibatching. Looking back again, I guess it looks like a reasonable strategy?

rusty1s Dec 29, 2021
Maintainer

Got it. As far as I understand, you are using link prediction as a self-supervision loss to learn node embeddings. This is indeed a reasonable strategy. Importantly, you need to keep original input node features for training your link prediction model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keeping #batches consistent across multiple graphs in link prediction when minibatching #3745

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Keeping #batches consistent across multiple graphs in link prediction when minibatching #3745

Uh oh!

Uh oh!

jasperhyp Dec 22, 2021

Replies: 1 comment · 12 replies

Uh oh!

rusty1s Dec 22, 2021 Maintainer

Uh oh!

rusty1s Dec 27, 2021 Maintainer

Uh oh!

Uh oh!

jasperhyp Dec 27, 2021 Author

Uh oh!

rusty1s Dec 28, 2021 Maintainer

Uh oh!

jasperhyp Dec 28, 2021 Author

Uh oh!

rusty1s Dec 29, 2021 Maintainer

jasperhyp
Dec 22, 2021

Replies: 1 comment 12 replies

rusty1s
Dec 22, 2021
Maintainer

rusty1s Dec 27, 2021
Maintainer

jasperhyp Dec 27, 2021
Author

rusty1s Dec 28, 2021
Maintainer

jasperhyp Dec 28, 2021
Author

rusty1s Dec 29, 2021
Maintainer