GNNs for imbalanced node classification #2886

bsaghafi · 2021-07-21T16:36:59Z

bsaghafi
Jul 21, 2021

When using very imbalanced data, my experience is that GNN methods like GraphSAGE and GCN perform poorly, although I am using class ratio to weight the loss function accordingly, but still the classifier only predicts the majority class. Is there any feature or method other than loss weighting that can be used here?

For better context, my problem is a binary classification where the class ratio is 400:1. Also I am using the ROC AUC metric on the validation set to determine the best number of epochs to train. I have also tried other metrics such as PR AUC and f1-score.

rusty1s · 2021-07-22T15:15:19Z

rusty1s
Jul 22, 2021
Maintainer

This also aligns with my impression that GNNs on imbalanced labels do perform poorly, although I feel this is a problem with NNs in general. As an alternative to loss re-weighting, you can also over- or under-sample the respective labels, although this requires you to apply GNNs in mini-batch mode, e.g., via NeighborSampler.

Sadly, I do not have any better advice for you at this point in time.

2 replies

bsaghafi Jul 22, 2021
Author

When you mention over-sampling, using methods like SMOTE gives us the node attributes, how about the edges? how to come up with the edges for the newly added nodes to the graph?

rusty1s Jul 22, 2021
Maintainer

I was referring to something like this: https://github.com/ufoym/imbalanced-dataset-sampler

Romain-Nicolle · 2023-04-03T14:18:56Z

Romain-Nicolle
Apr 3, 2023

This is an old topic but I recently had this exact problem and I think I found a solution and I hope it will be able to help someone. So the function torch_geometric.transform.RandomNodeSplit() have arguments that can allow you to choose the number of samples for each class (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.transforms.RandomNodeSplit.html?highlight=RandomNodeSplit#torch_geometric.transforms.RandomNodeSplit). I don't know when this function was updated, so maybe these parameters didn't exist when this question was first asked.

With split = 'random' or split = 'train_rest', you can further use the num_train_per_class parameter. Obviously it's not perfect because if you have a really unbalanced dataset, you end using only a small fraction of your dataset for training.

from torch_geometric.transforms import RandomNodeSplit
data = RandomNodeSplit(split = 'random', num_train_per_class = s, num_val = v, num_test = t)(data)

With s, t and v such as: s * c (number of classes) + v + t = total_number_of_nodes

Hope this can help someone. But if someone else managed to do it in a way more effective way with a specific dataloader, I'll be interested to know how.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GNNs for imbalanced node classification #2886

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

GNNs for imbalanced node classification #2886

Uh oh!

Uh oh!

bsaghafi Jul 21, 2021

Replies: 2 comments · 2 replies

Uh oh!

rusty1s Jul 22, 2021 Maintainer

Uh oh!

bsaghafi Jul 22, 2021 Author

Uh oh!

rusty1s Jul 22, 2021 Maintainer

Uh oh!

Uh oh!

Romain-Nicolle Apr 3, 2023

bsaghafi
Jul 21, 2021

Replies: 2 comments 2 replies

rusty1s
Jul 22, 2021
Maintainer

bsaghafi Jul 22, 2021
Author

rusty1s Jul 22, 2021
Maintainer

Romain-Nicolle
Apr 3, 2023