Understanding how edge_index and edge_label_index relate to message passing #6923

davidfstein · 2023-03-15T19:45:36Z

davidfstein
Mar 15, 2023

Hi Everyone,

I'm working on a link prediction project using GNNs. So far I have achieved some promising results, but I wanted to understand the relationship between the edge indices and the message passing process to have more confidence that my results are legitimate. My understanding is that messages are passed between edges held in the "edge_index" attribute, but not between edges in the "edge_label_index" attribute. I believe the "edge_label_index" edges are only used for supervision and/or for assessment. Could anyone confirm if this is true or correct me if I'm wrong?

Thanks!

Answered by wsad1

Mar 16, 2023

Yep you are right.
But one thing to add is that some edges in edge_label_index might be in edge_index, as in some edges might be used for supervision and message passing.

View full answer

wsad1 · 2023-03-16T05:30:32Z

wsad1
Mar 16, 2023
Maintainer

Yep you are right.
But one thing to add is that some edges in edge_label_index might be in edge_index, as in some edges might be used for supervision and message passing.

1 reply

CtrlMj Nov 13, 2025

@wsad1 It's been a while but I'm just gonna ask anyway. If some edges in edge_label_index exist in edge_index, wouldn't that result in data leakage? We are using that subset of edges in our message passing calculations, and then making predictions on that very subset

davidfstein · 2023-03-16T15:38:05Z

davidfstein
Mar 16, 2023
Author

Got it thank you! I implemented my own split to ensure that edge_label_index and edge_index are completely disjoint in the testing and validation data splits. Just wanted to confirm my understanding to rule out data leakage issues.

0 replies

meredith-martz · 2023-06-24T00:51:54Z

meredith-martz
Jun 24, 2023

I'm a bit confused still about edge_label_index and edge_index. I followed the setup of creating a training & validation set plus loaders for each using the Link Prediction on MovieLens tutorial. I'm now trying to look at my validation results and assess the incorrect predictions.

I've modified the very last code block from the tutorial to include a list where I keep track of the edges assessed during the validation so that later I can look at the original data using the indices:

from sklearn.metrics import roc_auc_score
preds = []
ground_truths = []
node_a, node_b = [], []
for sampled_data in tqdm(val_loader):
    with torch.no_grad():
        sampled_data.to(device)
        preds.append(model(sampled_data))
        ground_truths.append(sampled_data["user", "rates", "movie"].edge_label)
        node_a += list(np.array(sampled_data["user", "rates", "movie"].edge_label_index[0]))
        node_b += list(np.array(sampled_data["user", "rates", "movie"].edge_label_index[1]))
pred = torch.cat(preds, dim=0).cpu().numpy()
ground_truth = torch.cat(ground_truths, dim=0).cpu().numpy()
auc = roc_auc_score(ground_truth, pred)
print()
print(f"Validation AUC: {auc:.4f}")

Specifically I added:

node_a += list(np.array(sampled_data["user", "rates", "movie"].edge_label_index[0]))
node_b += list(np.array(sampled_data["user", "rates", "movie"].edge_label_index[1]))

which I would have thought would give me a list of the node indices that make up each edge assessed, but when I look at the indices in my lists node_a and node_b, the indices don't match back up to my original data.

For example if I take node_a[i] and node_b[i] where ground_truth[i] == 1, I find that that pair of indices does not exist in my original edges in ("user", "rates", "movie").

Is there some transformation I need to do in order to get the original indices?

3 replies

meredith-martz Jun 24, 2023

I see there are n_id and e_id attributes, but I'm not still not entirely sure how to use these to map back the indices coming out of edge_label_index. The values stored in edge_label_index don't match back to the dimensions of sampled_data["user", "rates", "movie"].e_id.

meredith-martz Jun 24, 2023

Nevermind, was able to figure it out. Here's my solution for any future folks looking at this:

cur_node_a = np.array(sampled_data["user"].n_id)[sampled_data["user", "rates", "movie"].edge_label_index[0]]
cur_node_b = np.array(sampled_data["movie"].n_id)[sampled_data["user", "rates", "movie"].edge_label_index[1]]
node_a += list(cur_node_a)
node_b += list(cur_node_b)

rusty1s Jun 27, 2023
Maintainer

Yes, that is correct. The sampled versions of edge_label_index remap the indices, and you can use n_id to map them back to original space.

ecoy1 · 2023-07-04T07:01:58Z

ecoy1
Jul 4, 2023

Hi. I'm facing a similar issue understanding how these work. So I'm using a different dataset but my edges for user 0 are movies 0 till 50. So when I print the edge_index of data, I see tuples as (0, 0) ... (0, 50). Once I do RandomLinkSplit however, I see new edges that weren't in the original graph.

To give an example, I printed edge_index and edge_label index of train/test/val data where I'm printing out the movies for user 0 edges:

Train edge_index:
tensor([19,  9, 47, 31, 15, 46, 22, 11, 24, 21, 13, 18, 50,  8, 35, 37, 33, 41,
        38,  2, 44,  5, 34, 23, 31, 39,  0, 30, 45, 10,  6,  4, 27, 48, 29, 26,
        25, 42, 14, 43, 40,  3])
Train edge_label_index:
tensor([19,  9, 47, 31, 15, 46, 22, 11, 24, 21, 13, 18, 50,  8, 35, 37, 33, 41,
        38,  2, 44,  5, 34, 23, 31, 39,  0, 30, 45, 10,  6,  4, 27, 48, 29, 26,
        25, 42, 14, 43, 40,  3])
Val edge_index:
tensor([19,  9, 47, 31, 15, 46, 22, 11, 24, 21, 13, 18, 50,  8, 35, 37, 33, 41,
        38,  2, 44,  5, 34, 23, 31, 39,  0, 30, 45, 10,  6,  4, 27, 48, 29, 26,
        25, 42, 14, 43, 40,  3])
Val edge_label_index:
tensor([   28,    49,    20,    32,    12,    16,     7,    17,  2799, 24912,
        26827,  2900, 19263,  7033, 28625, 33115, 11532, 16189, 18671])
Test edge_index:
tensor([19,  9, 47, 31, 15, 46, 22, 11, 24, 21, 13, 18, 50,  8, 35, 37, 33, 41,
        38,  2, 44,  5, 34, 23, 31, 39,  0, 30, 45, 10,  6,  4, 27, 48, 29, 26,
        25, 42, 14, 43, 40,  3, 28, 49, 20, 32, 12, 16,  7, 17])
Test edge_label_index:
tensor([    1,    36,   711, 19846])

And this is how I call RandomLinkSplit


transform = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    #disjoint_train_ratio=0.3,
    #neg_sampling_ratio=2.0,
    add_negative_train_samples=False,
    edge_types=("user", "contains", "movie"),
    rev_edge_types=("movie", "rev_contains", "user"),
)
train_data, val_data, test_data = transform(data)

It all makes sense up till the val_data edge_label_index. I don't understand where the values 2799, 24912, 26827, 2900, 19263, 7033, 28625, 33115, 11532, 16189, 18671 come from. These edges don't exist in the original data and I'm also not doing negative sampling. Same goes for test_label_index with 711, 19846.

Alongside this, I also wanted to ask whether it's possible to split the links in RandomLinkSplit per user based where the test and val ratios are per user links instead of all the links in the graph.

Thanks a lot

1 reply

rusty1s Jul 9, 2023
Maintainer

Do you have a minimal example to reproduce? This looks indeed like negative sampling is applied somewhere.

Understanding how edge_index and edge_label_index relate to message passing #6923

Uh oh!

davidfstein Mar 15, 2023

Replies: 4 comments · 5 replies

Uh oh!

Uh oh!

wsad1 Mar 16, 2023 Maintainer

Uh oh!

CtrlMj Nov 13, 2025

Uh oh!

davidfstein Mar 16, 2023 Author

Uh oh!

meredith-martz Jun 24, 2023

Uh oh!

meredith-martz Jun 24, 2023

Uh oh!

meredith-martz Jun 24, 2023

Uh oh!

rusty1s Jun 27, 2023 Maintainer

Uh oh!

ecoy1 Jul 4, 2023

Uh oh!

rusty1s Jul 9, 2023 Maintainer

davidfstein
Mar 15, 2023

Replies: 4 comments 5 replies

wsad1
Mar 16, 2023
Maintainer

davidfstein
Mar 16, 2023
Author

meredith-martz
Jun 24, 2023

rusty1s Jun 27, 2023
Maintainer

ecoy1
Jul 4, 2023

rusty1s Jul 9, 2023
Maintainer