Edge-level multilabel classification #7873

songsong0425 · 2023-08-11T09:10:25Z

songsong0425
Aug 11, 2023

Greetings, always thank you for your great helps.
I have questions about multi-label classification and link prediction.

I'm trying to perform multi-labeled edge classification like image classification.
I constructed the code as below, but some parts are ambiguous.
Let me describe the code step-by-step.

First, I splitted data into training:val:test=7:1:2. For simplicity, I only attached results for training data

# Each edge has one or two of five labels represented by one-hot encoding
train_data
# Data(edge_index=[2, 2159112], pos_edge_label=[3955], pos_edge_label_index=[2, 3955], edge_label=[3955, 5], edge_label_index=[2, 3955], x=[47822, 512])

train_data.edge_label
# tensor([[1, 0, 0, 0, 0],
#         [0, 0, 0, 0, 1],
#         [0, 0, 1, 0, 0],
#         ...,
#         [1, 0, 0, 0, 0],
#         [0, 0, 0, 0, 1],
#         [0, 0, 0, 0, 1]])

After defining datasets, I used LinkNeighborLoader for DataLoader because I'd like to run a link prediction task.

from torch_geometric.loader import LinkNeighborLoader

train_loader = LinkNeighborLoader(train_data, edge_label_index=train_data.edge_label_index, batch_size=16, shuffle=True, neg_sampling_ratio=1.0, num_neighbors=[3,3])

To check the data in the loader, I run the below code and found that edge_label was changed from 5 to 2.

for i in test_loader:
    print(i)
# Data(edge_index=[2, 48], pos_edge_label=[1129], pos_edge_label_index=[2, 1129], edge_label=[2], edge_label_index=[2, 2], x=[43, 512], n_id=[43], e_id=[48], num_sampled_nodes=[3], num_sampled_edges=[2], input_id=[1])
# Data(edge_index=[2, 46], pos_edge_label=[1129], pos_edge_label_index=[2, 1129], edge_label=[2], edge_label_index=[2, 2], x=[45, 512], n_id=[45], e_id=[46], num_sampled_nodes=[3], num_sampled_edges=[2], input_id=[1])
# ...

When I didn't check it and get a result, I was glad to get a good performance. But the model returns only two labels' probability. It is not matched to my aim.

model_opt_a = GAT().to(device)
model_opt_a.load_state_dict(torch.load('checkpoint.pt'))
model_opt_a.eval()

y_pred, y_pred_prob, y_true = [], [], []
for data in tqdm(test_loader):
    #data = data.to(device)
    y_true.append(data.edge_label)
    
    z, a1, a2 = model_opt_a(data.x.to(device), data.edge_index.to(device))
    out = ((z[data.edge_label_index[0]] * z[data.edge_label_index[1]]).sum(dim=-1)).view(-1)
    out_sig = ((z[data.edge_label_index[0]] * z[data.edge_label_index[1]]).sum(dim=-1)).view(-1).sigmoid()
    y_pred.append((out_sig>0.5).float().cpu())
    y_pred_prob.append((out_sig).float().cpu())
    
y, pred, pred_prob = torch.cat(y_true, dim=0).numpy(), torch.cat(y_pred, dim=0).numpy(), torch.cat(y_pred_prob, dim=0).detach().numpy()

y_true
# [tensor([1., 0.]),
# tensor([1., 0.]),
# tensor([1., 0.]),
# ...

out_sig
# tensor([0.8084, 0.5471], device='cuda:0', grad_fn=<SigmoidBackward0>)

Does anyone have an idea for dealing with this problem? Also, if you know how to do multi-label classification with link prediction, please tell me the approaches.
Thank you for reading!

Answered by wsad1

Aug 11, 2023

To fix your issue pass your edge label to the edge_label argument in LinkNeighborLoader. As described in the docs if you don't pass edge_label it is assumed to be a tensor of zeros.
Further leave neg_sampling_ratio as None, since its not clear what the label of a negative edge would be in this case.

View full answer

wsad1 · 2023-08-11T13:32:30Z

wsad1
Aug 11, 2023
Maintainer

To fix your issue pass your edge label to the edge_label argument in LinkNeighborLoader. As described in the docs if you don't pass edge_label it is assumed to be a tensor of zeros.
Further leave neg_sampling_ratio as None, since its not clear what the label of a negative edge would be in this case.

1 reply

songsong0425 Aug 11, 2023
Author

Thank you for your reply! The problem was resolved by passing the edge_label argument in LinkNeighborLoader.

songsong0425 · 2023-08-12T08:47:15Z

songsong0425
Aug 12, 2023
Author

Sorry for bothering you, but may I ask one more question?

When I train the model with the five-dimensional label referencing PPI example in PyG code, there was an error as below:
ValueError: Target size (torch.Size([16, 5])) must be the same as input size (torch.Size([16]))
Although the error message suggests the dimension mismatch between tensors in the loss function, I couldn't find where I should check.
Solved above error by remove sum and view in out calculation.

def train(num_epochs):
    for epoch in range(1, num_epochs+1):
        tr_losses = 0
        val_losses = 0
        
        model.train()
        for data in tqdm(train_loader):
            data = data.to(device)
            optimizer.zero_grad()
        
            z, a1, a2 = model(data.x, data.edge_index)
            
            out = ((z[data.edge_label_index[0]] * z[data.edge_label_index[1]]))
            tr_loss = criterion(out, data.edge_label)
            tr_losses += tr_loss.item()
            tr_loss.backward()
            optimizer.step()
        avg_tr_loss = tr_losses/len(train_loader.dataset)

But I'm confusing how to calculate the accuracy and other evaluation metrics for multiple labels. Now I set the metric as below, but the performance looks so bad. Are there any problems with this code?

        model.eval()
        with torch.no_grad():
            y_val_pred, y_val_pred_prob, y_val_true = [], [], []
            for data in tqdm(val_loader):
                #data = data.to(device)
                y_val_true.append(data.edge_label)
                
                z, a1, a2 = model(data.x.to(device), data.edge_index.to(device))
                out = ((z[data.edge_label_index[0]] * z[data.edge_label_index[1]]))
                out_sig = ((z[data.edge_label_index[0]] * z[data.edge_label_index[1]])).sigmoid()
                val_loss = criterion(out, data.edge_label.float().to(device))
                val_losses += val_loss.item()
                #y_val_pred.append((out>0).float().cpu())
                y_val_pred.append((out_sig>0.5).float().cpu())
                y_val_pred_prob.append((out_sig).float().cpu())
                
        avg_val_loss = val_losses/len(val_loader.dataset)
        y, pred, pred_prob = torch.cat(y_val_true, dim=0).numpy(), torch.cat(y_val_pred, dim=0).numpy(), torch.cat(y_val_pred_prob, dim=0).numpy()
        val_f1 = f1_score(y, pred, average='weighted')
        val_auc = roc_auc_score(y, pred_prob, average='weighted')
        val_aupr = average_precision_score(y, pred_prob, average='weighted')
        val_acc = accuracy_score(y, pred)

0 replies

sadrahkm · 2024-04-04T14:01:59Z

sadrahkm
Apr 4, 2024

Could you please provide a little more detail about how you created the Data object and RandomLinkSplit for multilabel classification since I'm facing the same challenge. I don't exactly know how I should prepare the data for a multilabel problem.
Also, do you know any resource discussing the multilabel problem in PyG

1 reply

songsong0425 Apr 5, 2024
Author

Simply, I changed the labels to the tensor and converted them to the Data object.

label_tensor = torch.tensor(df.values) # If the labels are already set to tensor, ignore.
ex = Data(..., edge_label=label_tensor, ...)
print(ex.edge_label)
# tensor([[0.0, 1.0, 0.0, 0.0],
#         ...,
#         [0.0, 0.0, 0.0, 1.0]], dtype=torch.float32)

transform = RandomLinkSplit(num_val=0.1, num_test=0.2, ...)
tr, val, ts = transform(ex)

Unfortunately, I don't know about the multi-label classification task using PyG.

Edge-level multilabel classification #7873

Uh oh!

Uh oh!

songsong0425 Aug 11, 2023

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

wsad1 Aug 11, 2023 Maintainer

Uh oh!

Uh oh!

songsong0425 Aug 11, 2023 Author

Uh oh!

Uh oh!

songsong0425 Aug 12, 2023 Author

Uh oh!

Uh oh!

sadrahkm Apr 4, 2024

Uh oh!

songsong0425 Apr 5, 2024 Author

songsong0425
Aug 11, 2023

Replies: 3 comments 2 replies

wsad1
Aug 11, 2023
Maintainer

songsong0425 Aug 11, 2023
Author

songsong0425
Aug 12, 2023
Author

sadrahkm
Apr 4, 2024

songsong0425 Apr 5, 2024
Author