High AUC for link prediction at initialization depending on `num_neighbors` value of `LinkNeighborLoader` #8782

kieranrcampbell · 2024-01-17T02:19:36Z

kieranrcampbell
Jan 17, 2024

I'm having very strange behaviour on a link prediction problem I would really appreciate any guidance on.

The input graph is undirected with 4367 nodes with 392027 edges:

I then perform splitting using RandomLinkSplit:

transform = RandomLinkSplit(is_undirected=True, 
                            num_val=0.25, 
                            num_test=0.25, 
                            add_negative_train_samples=False,
                            neg_sampling_ratio=1)
train_data, val_data, test_data = transform(data_no_search)

I'm going to create two validation data loaders with differing values for num_neighbors to illustrate the following behaviour, one with [-1] and one with [-1,20,5]:

val_loader = LinkNeighborLoader(
    val_data,
    num_neighbors=[-1],
    batch_size=128,
    edge_label_index=val_data.edge_label_index, # This and below were commented out
    edge_label=val_data.edge_label,
    neg_sampling_ratio=0,
    shuffle=True
)

val_loader2 = LinkNeighborLoader(
    val_data,
    num_neighbors=[-1, 20, 5],
    batch_size=128,
    edge_label_index=val_data.edge_label_index, # This and below were commented out
    edge_label=val_data.edge_label,
    neg_sampling_ratio=0,
    shuffle=True
)

I'm using the validation loader for an example here but the behaviour described below occurs for train and test splits also.

I can then draw a single sample from each:

v1 = next(iter(val_loader))
v2 = next(iter(val_loader2))

I then instantiate a model I'll call sl, this is essentially a torch_geometric.nn.GAE with two GCNConv hidden layers for the encoder and Relus (though I don't think the precise architecture matters). We can then code a basic forward function (again, no training whatsoever):

def basicforward(d):
    with torch.no_grad():
        x, edge_index = d.x, d.edge_index # message passing edges
        edge_label_index = d.edge_label_index # training edges
        edge_label = d.edge_label

        pos_label_index = edge_label_index[:, edge_label == 1]
        neg_label_index = edge_label_index[:, edge_label == 0]

        z = sl.encode(x, edge_index)
        loss = sl.recon_loss(z, pos_label_index, neg_label_index)
        auc, ap = sl.test(z, pos_label_index, neg_label_index)
        print(f"auc: {auc}, ap: {ap}, loss: {loss}")

Calling this on v1 and v2 gives:

In other words, with num_neighbors = [-1] we get the expected AUC (~0.5), while with num_neighbors = [-1,20,5] we get significantly better AUC than we'd expect at random, with losses that reflect this.

I know num_neighbours controls how links are sampled, but I can't see how it would give this behaviour. Again, there's absolutely no training here so it's not an overfitting issue, and GCNConv normalizes by the degree matrix so I can't see how it would be able to predict links in already dense parts of the graph.

Any insights would be hugely appreciated, thanks.

rusty1s · 2024-01-17T20:11:52Z

rusty1s
Jan 17, 2024
Maintainer

Really interesting, let me try to come up with an explanation :) Your graph seems to be very dense, given rise to the assumption that an untrained GCN yields approximately equal embeddings for every node in your graph (thus explains the 0.5 AUC). However, if neighbor sampling is used, node features get much more discriminative, and with the inductive bias of the untrained GCN, it is already able to find similar pairs of nodes.

0 replies

kieranrcampbell · 2024-04-08T21:52:52Z

kieranrcampbell
Apr 8, 2024
Author

Hi @rusty1s
I realize I forgot to reply saying thanks for your explanation for this! I will close the issue but hopefully useful for those with similar behaviour.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High AUC for link prediction at initialization depending on `num_neighbors` value of `LinkNeighborLoader` #8782

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

High AUC for link prediction at initialization depending on num_neighbors value of LinkNeighborLoader #8782

Uh oh!

kieranrcampbell Jan 17, 2024

Replies: 2 comments

Uh oh!

rusty1s Jan 17, 2024 Maintainer

Uh oh!

kieranrcampbell Apr 8, 2024 Author

High AUC for link prediction at initialization depending on `num_neighbors` value of `LinkNeighborLoader` #8782

kieranrcampbell
Jan 17, 2024

rusty1s
Jan 17, 2024
Maintainer

kieranrcampbell
Apr 8, 2024
Author