Different results depending on batch size at test time #4637

aaran2 · 2022-05-13T11:35:44Z

aaran2
May 13, 2022

Hello! First of all thanks for all the great work you guys are doing here :). I'm having a tricky issue (similar to #2665) where the results and the embeddings produced for the same node differ drastically depending on the batch size. I'm doing link prediction on a heterogeneous graph, with the new LinkNeighborLoader. The graph is actually a bipartite graph with two node types (type A and B) and 2 edge types (plus the reverse edges), where each node type only interacts with the other type. The objective is simply to predict the if there is an edge between two nodes. I don't think this is related to the issue but there's somewhat of a temporal component and nodes of type B change across snapshots, and during inference I add a bunch of new edges with new nodes of type B by appending them to the original graph and remapping the IDs, something similar to:

inference_graph = graph.clone()
inference_graph["B"].x = torch.cat((graph["B"].x, test_feats["B"]))
# Remap node indices
incoming_edges[1,:] += len(graph["B"].x)
# Only this connection is added since I do not want to update nodes of type A with the new node B during inference
inference_graph["A", "rev_link_2", "B"].edge_index = torch.cat((inference_graph["A", "rev_link _2", "B"].edge_index, incoming_edges), dim=1)

I then want to perform link prediction for each of these new nodes, on the other edge type. Let's assume the following scenario:

l1 = LinkNeighborLoader(inference_graph, edge_label_index=(("A", "link", "B"),
                  torch.tensor([[717607], [18003125]])),
                  neg_sampling_ratio=0.0, shuffle=False,
                  num_neighbors=[64,64,64],
                  batch_size=1, num_workers=6)


l2 = LinkNeighborLoader(inference_graph, edge_label_index=(("A", "link", "B"),
                  torch.tensor([[717607, 717607], [18003125, 19015616]])),
                  neg_sampling_ratio=0.0, shuffle=False,
                  num_neighbors=[64,64,64],
                  batch_size=2, num_workers=6)

model.eval()
with torch.no_grad():
    for batch in l1:
        batch = batch.to(device)
        logits = model(batch.x_dict, batch.edge_index_dict, batch["A", "link", "B"].edge_label_index
        print(torch.sigmoid(logits))
   
     for batch in l2:
        batch = batch.to(device)
        logits = model(batch.x_dict, batch.edge_index_dict, batch["A", "link", "B"].edge_label_index
        print(torch.sigmoid(logits))

With this code we get these results:

The output for edge 717607 -> 18003125 changes from 0.99 to 0.03. The sampled subgraph in both scenarios is identical, since there's less connections than the number of specified neighbors. Furthermore, both edges in l2 have the exact same features and subgraphs, and they get different results. Using a negative_sampling_ratio > 0 also changes the results. In this case, depending on the negative edge sampled, results for the positive edge fluctuate all over. The output embeddings of the very first conv layers are different, so it seems to be an issue outside the model

Any ideas as to what might be happening?

Answered by Padarn

May 14, 2022

Hey @aaran2 thanks for the question.

Do you mind providing a small example with your model and a dataset to reproduce this? I couldn't reproduce the bug on a very simple model unfortunately.

View full answer

Padarn · 2022-05-14T00:18:31Z

Padarn
May 14, 2022
Collaborator

Hey @aaran2 thanks for the question.

Do you mind providing a small example with your model and a dataset to reproduce this? I couldn't reproduce the bug on a very simple model unfortunately.

6 replies

rusty1s May 14, 2022
Maintainer

Can you try to add add_self_loops=False as argument to the GATConv calls? This will produce incorrect behavior in bipartite message passing, and may hopefully resolve this issue :)

Padarn May 15, 2022
Collaborator

Good point @rusty1s - do you think we should handle this inside HetroConv somehow?

@aaran2 thanks for the example - let me know if @rusty1s's suggestion does not resolve the problem and I'll try to look into it a bit more.

rusty1s May 15, 2022
Maintainer

That's a good suggestion. We could check whether any GNN is used that supports add_self_loops and set it to False automatically in case src_node_type != dst_node_type.

Padarn May 15, 2022
Collaborator

I'll open an issue for it

aaran2 May 16, 2022
Author

Hello @rusty1s and @Padarn. Indeed setting add_self_loops=False seems to have fixed it, thanks a lot for the help :)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different results depending on batch size at test time #4637

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Different results depending on batch size at test time #4637

Uh oh!

Uh oh!

aaran2 May 13, 2022

Replies: 1 comment · 6 replies

Uh oh!

Padarn May 14, 2022 Collaborator

Uh oh!

rusty1s May 14, 2022 Maintainer

Uh oh!

Padarn May 15, 2022 Collaborator

Uh oh!

rusty1s May 15, 2022 Maintainer

Uh oh!

Padarn May 15, 2022 Collaborator

Uh oh!

aaran2 May 16, 2022 Author

aaran2
May 13, 2022

Replies: 1 comment 6 replies

Padarn
May 14, 2022
Collaborator

rusty1s May 14, 2022
Maintainer

Padarn May 15, 2022
Collaborator

rusty1s May 15, 2022
Maintainer

Padarn May 15, 2022
Collaborator

aaran2 May 16, 2022
Author