DataLoader generates subgraph contains only labeled nodes? #4520

JiaruiWang · 2022-04-23T16:24:23Z

JiaruiWang
Apr 23, 2022

Thank you for this amazing framework and all your hard work on this!

I have questions on the training subgraph from data loaders.
My task is a node classification problem on a 60M nodes graph. 20M nodes are labeled, and 40M nodes are unlabeled. I created the dataset with a 16M nodes training mask, a 2M nodes validation mask, and a 2M nodes testing mask, out of 20M labeled nodes.
If I feed the 16M training mask into NeighborLoader or GraphSAINTSampler, do they generate the subgraph contains only the labeled nodes? or the subgraph is generated from one labeled node and its neighbors which are both labeled and unlabeled Nodes?
If the subgraph contains only labeled nodes, then the node features and messages from the unlabeled nodes are missing in the training. What's the best practice for this?

Thank you very much

Padarn · 2022-04-24T07:29:04Z

Padarn
Apr 24, 2022
Collaborator

The sampling in NeighborLoader will consider the whole graph when building the sample (from the data you provide), but the first batch_size nodes in the returned batch will be those that you're actually sampling.

For example if you do

NeighborLoader(
    data=data,
    input_nodes=mask_index,
    batch_size=5
)

Then each batch will have the first 5 nodes from the mask_index you provided.

2 replies

JiaruiWang Apr 24, 2022
Author

Thank you very much for your answer.
I understand NeighborLoader now. How about the GraphSAINTSampler? GraphSAINTSampler does not have a parameter input_nodes like NeigborLoader to apply the training mask. I found one answer from rusty1s: #2499 (comment)

You can split your data into inductive training, validation, and test (sub)-graphs via torch_geometric.utils.subgraph directly in the prepare_data method and before initializing the GraphSAINTSampler. However, please note that GraphSAINT can only make use of sampling during training, in particular because nodes may be sampled more than once during a single epoch. For validation and testing, it's therefore best to operate on the complete graph.

Because the GraphSAINTSampler samples a subgraph for each GCN layer within the sampler, should I use the complete graph or training subgraph as input to the GraphSAINTSampler?
Since the training subgraphs might lose the connections to the rest part of the complete graph, will it lose the messages from the completed graph before the GraphSAINTSampler is applied during training?

Padarn Apr 25, 2022
Collaborator

I think the standard way would be to sample the whole graph, and just mask when you are calculating loss, similar to what is done in this example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataLoader generates subgraph contains only labeled nodes? #4520

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DataLoader generates subgraph contains only labeled nodes? #4520

Uh oh!

JiaruiWang Apr 23, 2022

Replies: 1 comment · 2 replies

Uh oh!

Padarn Apr 24, 2022 Collaborator

Uh oh!

Uh oh!

JiaruiWang Apr 24, 2022 Author

Uh oh!

Padarn Apr 25, 2022 Collaborator

JiaruiWang
Apr 23, 2022

Replies: 1 comment 2 replies

Padarn
Apr 24, 2022
Collaborator

JiaruiWang Apr 24, 2022
Author

Padarn Apr 25, 2022
Collaborator