What I'm doing wrong when traying to perfom Link Prediction in Heterogeneous Graph? #7264
-
I posted a question on CrossValidated SE site trying to get broader attention to understand better how to work with Link prediction, but I guess I'll have more luck posting my questions here from now on (I was not aware of this Q&A session on the repository, but I'm glad to discover it 😁) Trying to understand GNNs better, I copied the code from this blog post from PyG documentation. I copied and pasted the code without modifications, which works as described in the post. It is a link prediction on the MovieLens dataset. After some tests, I noted that, indeed, the standard pipeline for Link prediction is to predict the missing links in the same graph/dataset, so I have the answer to my previous question. I'm wondering why my predictions are not working. Assuming I will make a prediction in the same graph, I did the following:
Code: # get threshold after training
auc = roc_auc_score(ground_truth, pred) # score used in the blog post
from sklearn.metrics import roc_curve
fpr, tpr, roc_tr = roc_curve(ground_truth, pred)
# From https://stackoverflow.com/a/48218787/3943162
optimal = np.argmax(tpr - for)
threshold = roc_tr[optimal]
total_users = len(unique_user_id)
total_movies = len(movies_df)
predictions= []
for user_id in tqdm(range(0, total_users)):
user_row = torch.tensor([user_id] * total_movies)
all_movie_ids = torch.arange(total_movies)
edge_label_index = torch.stack([user_row, all_movie_ids], dim=0)
data["user", "rates", "movie"].edge_label_index = edge_label_index
with torch.no_grad():
pred = model(data)
probabilities = torch.sigmoid(pred)
pred_labels = (pred > threshold ).long()
for idx, elem in enumerate(pred_labels):
if elem.item() == 1:
movie_id = edge_label_index[1][idx]
predictions.append(movie_id )
print("The user {} will enjoy the movie {}".format(user_id, movie_id))
I got the idea for this loop from this blog post. The difference is that in the post, they are predicting the rating for the movie, and I want to predict just the existence of the link. My problem is: My |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hi @jameswpm, Solution:
I added your code (with minor modifications) to the Google Colab notebook shared by the PyG team for this dataset and it worked fine. from tqdm.auto import tqdm
model = model.cpu()
model.eval()
total_users = len(unique_user_id)
total_movies = len(movies_df)
threshold = 0.5
predictions = {}
for user_id in tqdm(range(0, total_users)):
user_row = torch.tensor([user_id] * total_movies)
all_movie_ids = torch.arange(total_movies)
edge_label_index = torch.stack([user_row, all_movie_ids], dim=0)
data["user", "rates", "movie"].edge_label_index = edge_label_index
with torch.no_grad():
pred = model(data)
probabilities = torch.sigmoid(pred)
pred_labels = (probabilities > threshold).long()
recommended_movies = all_movie_ids[pred_labels == 1].tolist()
predictions[user_id] = recommended_movies
print("The user {} will enjoy the following movies:\n{}".format(user_id, recommended_movies), end="\n\n") |
Beta Was this translation helpful? Give feedback.
-
Hi, Thanks for the answer @devanshamin . You are right; I haven't tried posting the code in their Google colab. As you mentioned, using probabilities to get the recommendations actually did the trick. My confusion is probably due to a change I made in the original code, which may cause the strange behavior: I changed the encoder to consider the movie title instead of the genre. I did it because I want to create more of a "predictor" rather than a "recommender" somehow, so I want to check if my predictor can predict the movie the user already rated with very high precision (maybe the title is not the best feature, but it was a try). With your confirmation, I know my prediction strategy is correct, and my model is probably wrong (at least not best adapted) for a predictor. If you have any tips on how to adapt this example to perform more precise predictions and not only recommendations, please let me know, but my original question is answered. |
Beta Was this translation helpful? Give feedback.
Hi @jameswpm,
Solution:
probabilities
to computepred_labels
. Replacepred
inpred_labels = (pred > threshold ).long()
withprobabilities
.I added your code (with minor modifications) to the Google Colab notebook shared by the PyG team for this dataset and it worked fine.