What I'm doing wrong when traying to perfom Link Prediction in Heterogeneous Graph? #7264

jameswpm · 2023-04-30T08:08:27Z

jameswpm
Apr 30, 2023

I posted a question on CrossValidated SE site trying to get broader attention to understand better how to work with Link prediction, but I guess I'll have more luck posting my questions here from now on (I was not aware of this Q&A session on the repository, but I'm glad to discover it 😁)

Trying to understand GNNs better, I copied the code from this blog post from PyG documentation. I copied and pasted the code without modifications, which works as described in the post. It is a link prediction on the MovieLens dataset.

After some tests, I noted that, indeed, the standard pipeline for Link prediction is to predict the missing links in the same graph/dataset, so I have the answer to my previous question.

I'm wondering why my predictions are not working.

Assuming I will make a prediction in the same graph, I did the following:

Get the ideal cut-off threshold based on roc
Create a loop to iterate over all users, defining a link with every movie;
Pass this full-connected graph to my model to get predictions
Get the probabilities and filter them based on the threshold

Code:

# get threshold after training
auc = roc_auc_score(ground_truth, pred) # score used in the blog post
from sklearn.metrics import roc_curve

fpr, tpr, roc_tr = roc_curve(ground_truth, pred)
# From https://stackoverflow.com/a/48218787/3943162
optimal = np.argmax(tpr - for)
threshold = roc_tr[optimal]

total_users = len(unique_user_id)
total_movies = len(movies_df)
predictions= []
for user_id in tqdm(range(0, total_users)):

      user_row = torch.tensor([user_id] * total_movies)
      all_movie_ids = torch.arange(total_movies)
      edge_label_index = torch.stack([user_row, all_movie_ids], dim=0)
      data["user", "rates", "movie"].edge_label_index = edge_label_index

      with torch.no_grad():
             pred = model(data)

      probabilities = torch.sigmoid(pred)
      pred_labels = (pred > threshold ).long()

      for idx, elem in enumerate(pred_labels):
            
            if elem.item() == 1:
                  movie_id = edge_label_index[1][idx]
                  predictions.append(movie_id )
                  print("The user {} will enjoy the movie {}".format(user_id, movie_id))

I got the idea for this loop from this blog post. The difference is that in the post, they are predicting the rating for the movie, and I want to predict just the existence of the link.

My problem is: My pred_labels is a tensor full of zeros. It means I can't predict any new links (even the original ones are not correctly predicted). What I'm doing wrong? Is this the right way to do the Link prediction?

Answered by devanshamin

Apr 30, 2023

Hi @jameswpm,

Solution:

You're not using the probabilities to compute pred_labels. Replace pred in pred_labels = (pred > threshold ).long() with probabilities.
I would recommend you look at your chosen threshold.

I added your code (with minor modifications) to the Google Colab notebook shared by the PyG team for this dataset and it worked fine.

from tqdm.auto import tqdm

model = model.cpu() 
model.eval() 

total_users = len(unique_user_id) 
total_movies = len(movies_df) 
threshold = 0.5
predictions = {} 

for user_id in tqdm(range(0, total_users)): 
    user_row = torch.tensor([user_id] * total_movies) 
    all_movie_ids = torch.arange(total_movies) 
    edge_label_index = torch.stack([u…

View full answer

devanshamin · 2023-04-30T14:06:00Z

devanshamin
Apr 30, 2023

Hi @jameswpm,

Solution:

You're not using the probabilities to compute pred_labels. Replace pred in pred_labels = (pred > threshold ).long() with probabilities.
I would recommend you look at your chosen threshold.

I added your code (with minor modifications) to the Google Colab notebook shared by the PyG team for this dataset and it worked fine.

from tqdm.auto import tqdm

model = model.cpu() 
model.eval() 

total_users = len(unique_user_id) 
total_movies = len(movies_df) 
threshold = 0.5
predictions = {} 

for user_id in tqdm(range(0, total_users)): 
    user_row = torch.tensor([user_id] * total_movies) 
    all_movie_ids = torch.arange(total_movies) 
    edge_label_index = torch.stack([user_row, all_movie_ids], dim=0) 
    data["user", "rates", "movie"].edge_label_index = edge_label_index
    
    with torch.no_grad(): 
        pred = model(data) 
    probabilities = torch.sigmoid(pred) 
    pred_labels = (probabilities > threshold).long() 
    
    recommended_movies = all_movie_ids[pred_labels == 1].tolist() 
    predictions[user_id] = recommended_movies
    print("The user {} will enjoy the following movies:\n{}".format(user_id, recommended_movies), end="\n\n")

0 replies

jameswpm · 2023-04-30T16:36:49Z

jameswpm
Apr 30, 2023
Author

Hi, Thanks for the answer @devanshamin . You are right; I haven't tried posting the code in their Google colab. As you mentioned, using probabilities to get the recommendations actually did the trick.

My confusion is probably due to a change I made in the original code, which may cause the strange behavior: I changed the encoder to consider the movie title instead of the genre.

I did it because I want to create more of a "predictor" rather than a "recommender" somehow, so I want to check if my predictor can predict the movie the user already rated with very high precision (maybe the title is not the best feature, but it was a try).

With your confirmation, I know my prediction strategy is correct, and my model is probably wrong (at least not best adapted) for a predictor.

If you have any tips on how to adapt this example to perform more precise predictions and not only recommendations, please let me know, but my original question is answered.

2 replies

devanshamin Apr 30, 2023

Below are the few things I would recommend,

Modify the inference script to recommend movies based on a predicted rating threshold,

from tqdm.auto import tqdm

model = model.cpu() 
model.eval() 
total_users = len(unique_user_id) 
total_movies = len(movies_df) 
prediction_threshold = 0.5
rating_threshold = 4
predictions = {}

for user_id in tqdm(range(1, total_users + 1)): 
    user_row = torch.tensor([user_id] * total_movies) 
    all_movie_ids = torch.arange(total_movies) 
    edge_label_index = torch.stack([user_row, all_movie_ids], dim=0) 
    data["user", "rates", "movie"].edge_label_index = edge_label_index
    
    with torch.no_grad(): 
        pred = model(data) 

    pred_ratings = pred.clamp(min=0, max=5)
    probabilities = torch.sigmoid(pred) 
    pred_labels = (probabilities > prediction_threshold).long() 
    
    recommend_movies = all_movie_ids[pred_labels == 1]
    recommend_movie_ratings = pred_ratings[pred_labels == 1]
    rating_to_movies = {}
    for movie, rating in zip(recommend_movies, recommend_movie_ratings): 
        rating_to_movies.setdefault(round(rating.item(), 1), []).append(movie.item()) 
    
    # Recommend movies that having predicted rating of greater than `rating_threshold`
    recommended_movies = recommend_movies[recommend_movie_ratings > rating_threshold].tolist() 
    predictions[user_id] = recommended_movies
    print("The user {} will enjoy the following movies:\n{}".format(user_id, recommended_movies), end="\n\n")

Use metric such as Mean Reciprocal Rank (MRR). MRR is the average reciprocal rank of the first correct prediction.
In the Google Colab notebook, the rating values (i.e., 5.0, 4.0, 3.5 etc.) are not used. You can use rating values as edge_attr (i.e., data["movie", "rates", "user"].edge_attr = edge_attr) and modify the Model class to utilize edge_attr.
If you want your model to predict accurate ratings, you can frame the task as multi-class classification where each class will be the different ratings (i.e., 5.0, 4.0, 2.5 etc.). Note: For this dataset the ratings are categorical.

jameswpm May 1, 2023
Author

Thank you for the tips. it'll help me a lot.

I struggled with most of the papers and tutorials I found since most of them simply do not show the final inference part of the code, finishing the tutorial/paper just with the validation/test scores.

I got that this is very useful (and enough to show the use of GNNs), but understanding a little bit more about the inference of new links is also very useful to understand the problem.

Your suggestion will help with it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What I'm doing wrong when traying to perfom Link Prediction in Heterogeneous Graph? #7264

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What I'm doing wrong when traying to perfom Link Prediction in Heterogeneous Graph? #7264

Uh oh!

jameswpm Apr 30, 2023

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

devanshamin Apr 30, 2023

Uh oh!

Uh oh!

jameswpm Apr 30, 2023 Author

Uh oh!

Uh oh!

devanshamin Apr 30, 2023

Uh oh!

jameswpm May 1, 2023 Author

jameswpm
Apr 30, 2023

Replies: 2 comments 2 replies

devanshamin
Apr 30, 2023

jameswpm
Apr 30, 2023
Author

jameswpm May 1, 2023
Author