Nan for loss in training #4641

JiaruiWang · 2022-05-14T05:04:17Z

JiaruiWang
May 14, 2022

Hello,

I am training a GraphSAGE model on a directed graph. The node features are in, out, and undirected distances to some anchor nodes. Some nodes are not reachable to the anchor nodes as a directed graph (in or out direction), the respect distances are float('inf').
During the training, the loss is nan from the first epoch.
If I remove the in and the out distance features, the loss is not nan anymore. But the in and the out direction distance are important for me.
Is there a workaround to avoid nan loss while keeping the feature information for the unreachable node distances?

Thank you very much

Padarn · 2022-05-14T08:31:49Z

Padarn
May 14, 2022
Collaborator

I didn't quite understand how you set your data up, could you provide an example?

1 reply

JiaruiWang May 14, 2022
Author

Sure.
Here are 156 features for one target node, These features are distances from the target node to the other 52 anchor nodes. For each anchor node, the target node has 3 distances to the anchor node, in direction distance, out direction distance, and undirected distance (treat all the edges as undirected). 52 * 3 = 156 distance features. For in direction and out direction, some anchor nodes are not reachable from the target node. These unreachable distances are inf. These inf causes loss to be nan.
[4., 4., 3., 5., inf, 4., 4., inf, 3., 5., inf, 4., 4., inf, 3., 5., inf, 3.,
5., inf, 3., 5., inf, 3., 4., 3., 3., 5., inf, 4., 5., 5., 4., 3., 4., 3.,
5., inf, 4., 5., 3., 3., 5., inf, 3., 4., inf, 3., 5., 4., 3., 5., inf, 4.,
5., inf, 3., 4., 4., 3., 5., inf, 3., 5., 4., 3., 5., 3., 3., 4., 4., 3.,
5., 4., 3., 4., inf, 4., 4., 4., 3., 5., 5., 3., 3., 4., 2., 5., 5., 4.,
5., inf, 4., 5., 4., 3., 6., inf, 3., 5., inf, 4., 3., 4., 3., 3., 4., 3.,
4., 4., 3., 3., 2., 2., 5., inf, 4., 4., 3., 3., 4., 4., 3., 5., 3., 3.,
5., 4., 3., 5., 5., 3., 4., 4., 3., 3., 4., 3., 5., inf, 4., 5., 4., 3.,
4., 3., 2., 4., 4., 3., 4., 3., 3., 5., inf, 3.]

If I use only undirected distance as the feature, all the anchor nodes will be reachable from the target node. There will be no inf in the 52 features. The loss is not nan anymore.
[4., 4., 3., 5., 3., 5., 4., 3., 5., 3., 3., 4., 4., 3.,
5., 4., 3., 4.,4., 4., 4., 3., 5., 5., 3., 3., 4., 2., 5., 5., 4.,
5.,4., 5., 4., 3., 6., 3., 5., 4., 3., 4., 3., 3., 4., 3.,
4., 4., 3., 3., 2., 2,]

I want to keep the in direction distance and out direction distance. Is it possible?

Padarn · 2022-05-14T09:06:44Z

Padarn
May 14, 2022
Collaborator

Could you model your data as a graph, with edge features giving the distance to the anchor nodes? That way you could quite naturally exclude the unreachable nodes by excluding those connections from your graph. If thats not realistic, then you probably need to share the model architecture.

…

On Sat, 14 May 2022, 4:45 pm JiaruiWang, ***@***.***> wrote: Sure. Here are 156 features for one target node, These features are distances from the target node to the other 52 anchor nodes. For each anchor node, the target node has 3 distances to the anchor node, in direction distance, out direction distance, and undirected distance (treat all the edges as undirected). 52 * 3 = 156 distance features. For in direction and out direction, some anchor nodes are not reachable from the target node. These unreachable distances are inf. These inf causes loss to be nan. [4., 4., 3., 5., inf, 4., 4., inf, 3., 5., inf, 4., 4., inf, 3., 5., inf, 3., 5., inf, 3., 5., inf, 3., 4., 3., 3., 5., inf, 4., 5., 5., 4., 3., 4., 3., 5., inf, 4., 5., 3., 3., 5., inf, 3., 4., inf, 3., 5., 4., 3., 5., inf, 4., 5., inf, 3., 4., 4., 3., 5., inf, 3., 5., 4., 3., 5., 3., 3., 4., 4., 3., 5., 4., 3., 4., inf, 4., 4., 4., 3., 5., 5., 3., 3., 4., 2., 5., 5., 4., 5., inf, 4., 5., 4., 3., 6., inf, 3., 5., inf, 4., 3., 4., 3., 3., 4., 3., 4., 4., 3., 3., 2., 2., 5., inf, 4., 4., 3., 3., 4., 4., 3., 5., 3., 3., 5., 4., 3., 5., 5., 3., 4., 4., 3., 3., 4., 3., 5., inf, 4., 5., 4., 3., 4., 3., 2., 4., 4., 3., 4., 3., 3., 5., inf, 3.] If I use only undirected distance as the feature, all the anchor nodes will be reachable from the target node. There will be no inf in the 52 features. The loss is not nan anymore. [4., 4., 3., 5., 3., 5., 4., 3., 5., 3., 3., 4., 4., 3., 5., 4., 3., 4.,4., 4., 4., 3., 5., 5., 3., 3., 4., 2., 5., 5., 4., 5.,4., 5., 4., 3., 6., 3., 5., 4., 3., 4., 3., 3., 4., 3., 4., 4., 3., 3., 2., 2,] I want to keep the in direction distance and out direction distance. Is it possible? — Reply to this email directly, view it on GitHub <#4641 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGRPN2SV5FS5XDUZ5PCYB3VJ5R2JANCNFSM5V5BRQDQ> . You are receiving this because you commented.Message ID: ***@***.*** com>

-- By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

2 replies

JiaruiWang May 14, 2022
Author

My task is a node classification problem. I don't understand how this will help?
Here is my code.
https://github.com/JiaruiWang/country_gnn/blob/master/us_pyg_node_classification_sage_sparseT.py

JiaruiWang May 14, 2022
Author

By the way, the model won't get better than accuracy for
Train: 0.0244, Val: 0.0532, Test: 0.0519
They stop improving at this point.
It's too bad. Where did I do wrong?

Padarn · 2022-05-14T09:46:34Z

Padarn
May 14, 2022
Collaborator

Okay I understand the problem you describe now that I see your model. I guess if you really want to continue with this model, the easiest thing to do would be to use a very large number rather than inf. You could also filter out any inf values before you calculate loss. What I was suggesting is that your anchor nodes can be also considered part of the graph instead of building features in x, you can use the graph structure of your problem.

…

On Sat, 14 May 2022, 5:35 pm JiaruiWang, ***@***.***> wrote: By the way, the model won't get better than accuracy for Train: 0.0244, Val: 0.0532, Test: 0.0519 It's too bad. Where did I do wrong? — Reply to this email directly, view it on GitHub <#4641 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGRPNZCBRJS7CM76QHQC5DVJ5XWRANCNFSM5V5BRQDQ> . You are receiving this because you commented.Message ID: ***@***.*** com>

-- By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

0 replies

JiaruiWang · 2022-05-14T09:52:01Z

JiaruiWang
May 14, 2022
Author

I don’t understand what you mean by using the graph structure of the problem. All the anchor nodes are in the graph. On Sat, May 14, 2022 at 2:46 AM Padarn Wilson ***@***.***> wrote:

…

Okay I understand the problem you describe now that I see your model. I guess if you really want to continue with this model, the easiest thing to do would be to use a very large number rather than inf. You could also filter out any inf values before you calculate loss. What I was suggesting is that your anchor nodes can be also considered part of the graph instead of building features in x, you can use the graph structure of your problem. On Sat, 14 May 2022, 5:35 pm JiaruiWang, ***@***.***> wrote: > By the way, the model won't get better than accuracy for > Train: 0.0244, Val: 0.0532, Test: 0.0519 > It's too bad. Where did I do wrong? > > — > Reply to this email directly, view it on GitHub > < #4641 (reply in thread) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAGRPNZCBRJS7CM76QHQC5DVJ5XWRANCNFSM5V5BRQDQ > > . > You are receiving this because you commented.Message ID: > ***@***.*** > com> > -- By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law. — Reply to this email directly, view it on GitHub <#4641 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADEMCKW4VEX34KZ77QUGT5LVJ5ZANANCNFSM5V5BRQDQ> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

0 replies

Padarn · 2022-05-14T10:00:14Z

Padarn
May 14, 2022
Collaborator

Sorry I must misunderstand what you're trying to model here. It would make more sense to me to use edge attributes or weight to include the distance information and use a model that uses these features in the message passing.. but perhaps this doesn't make sense for you.

If you want to solve your initial problem, I'd suggest filtering out the inf values from your loss before using backwards .. otherwise your gradients will be huge.

5 replies

JiaruiWang May 14, 2022
Author

The task is the U.S. Facebook page classification. In my dataset, nodes are U.S. Facebook pages, edges are "like" relationships between two pages. Page A likes page B, the edge will be A ----> B. I also have the geolocation information for each page, in which state the page is. The State label is the label y. I want to train a classifier to classify the state label for pages based on the graph and training labels.
What model do you suggest? I know little about GNN.
P-GNN seems a reasonable choice for this clustering/classification problem. Hence, I adopted the concept of distance to anchor nodes from P-GNN.

Padarn May 15, 2022
Collaborator

Sorry I'm not super familiar with that model - it does sound reasonable.

In the paper they suggest

Since we aim to map nodes that are close in the network to similar embeddings, we further transform the distance ... to a (0, 1) range

The transformation they use would be in torch

x = 1/(d+1)

which would transform all your infinite distances to 0.

JiaruiWang May 16, 2022
Author

Thank you very much for the transformation suggestion.

It would make more sense to me to use edge attributes or weight to include the distance information and use a model that uses these features in the message passing.

I also want to explore your idea on this if you think it's more reasonable. Could you explain more about your idea and how to do it? Is there a similar architecture that I can learn from?

Thank you so much

Padarn May 16, 2022
Collaborator

After understanding your problem better, I don't think my suggestion makes much sense. I thought your anchor nodes were specific locations you had geographic information about, not sampled from the input graph (as I now understand having read the P-GCN paper).

I think you could still do as I suggest by just adding distance between geo-locations as an edge-feature instead of using the distance encoding, but this loses the actual positional encoding feature you are trying to build, so I would not suggest it.

JiaruiWang May 20, 2022
Author

Thank you very much. x = 1/(d+1) works!

JiaruiWang · 2022-05-20T04:37:56Z

JiaruiWang
May 20, 2022
Author

However, I run into another problem. The label distribution for my data is very imbalanced. There are 52 label classes in the dataset, 6,000,000 nodes. Most of the class counts are less than 1.5%, the largest class is 15% of the total data.

If I separate the dataset into 90% train, 5% validation, and 5% test randomly. The model will classify all the nodes into the largest label class.
If I make the training set from 20,000 nodes for each class, 1,000,000 nodes in total. The model will also classify all the nodes into one class, but the class is random.

Do you have any suggestions? Is this underfitting?

6 replies

JiaruiWang May 20, 2022
Author

Thank you for your reply. Yes, this paper is related to the first observation, that the model classifies all the nodes into the largest label class.
Is there a reasonable guess for the second observation? Why does the model classify all the nodes into a random class?

Padarn May 20, 2022
Collaborator

My first guess would be that your model cannot distinguish between the classes given the features/graph available - if there is no predictive relationship between features and label then any random label will be the best you can achieve.

JiaruiWang May 20, 2022
Author

Node2vec or random walk can generate node embeddings. Do you think these node embeddings are good features for the nodes?

Padarn May 20, 2022
Collaborator

Actually I'm not sure sorry - I guess your results would indicate no 😓

rusty1s May 21, 2022
Maintainer

You can also take a look at https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html#torch_geometric.loader.ImbalancedSampler.

Padarn · 2022-10-11T09:11:10Z

Padarn
Oct 11, 2022
Collaborator

I see, then why not use edge attribute features instead of using node features?

…

On Sat, 14 May 2022, 5:52 pm JiaruiWang, ***@***.***> wrote: I don’t understand what you mean by using the graph structure of the problem. All the anchor nodes are in the graph. On Sat, May 14, 2022 at 2:46 AM Padarn Wilson ***@***.***> wrote: > Okay I understand the problem you describe now that I see your model. > > I guess if you really want to continue with this model, the easiest thing > to do would be to use a very large number rather than inf. You could also > filter out any inf values before you calculate loss. > > What I was suggesting is that your anchor nodes can be also considered part > of the graph instead of building features in x, you can use the graph > structure of your problem. > > > > On Sat, 14 May 2022, 5:35 pm JiaruiWang, ***@***.***> wrote: > > > By the way, the model won't get better than accuracy for > > Train: 0.0244, Val: 0.0532, Test: 0.0519 > > It's too bad. Where did I do wrong? > > > > — > > Reply to this email directly, view it on GitHub > > < > #4641 (reply in thread) > >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AAGRPNZCBRJS7CM76QHQC5DVJ5XWRANCNFSM5V5BRQDQ > > > > . > > You are receiving this because you commented.Message ID: > > ***@***.*** > > com> > > > > -- > > > By communicating with Grab Inc and/or its subsidiaries, associate > companies and jointly controlled entities (“Grab Group”), you are deemed > to > have consented to the processing of your personal data as set out in the > Privacy Notice which can be viewed at https://grab.com/privacy/ > <https://grab.com/privacy/> > > > This email contains confidential information > and is only for the intended recipient(s). If you are not the intended > recipient(s), please do not disseminate, distribute or copy this email > Please notify Grab Group immediately if you have received this by mistake > and delete this email from your system. Email transmission cannot be > guaranteed to be secure or error-free as any information therein could be > intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain > viruses. Grab Group do not accept liability for any errors or omissions in > the contents of this email arises as a result of email transmission. All > intellectual property rights in this email and attachments therein shall > remain vested in Grab Group, unless otherwise provided by law. > > — > Reply to this email directly, view it on GitHub > < #4641 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ADEMCKW4VEX34KZ77QUGT5LVJ5ZANANCNFSM5V5BRQDQ > > . > You are receiving this because you authored the thread.Message ID: > ***@***.*** > com> > — Reply to this email directly, view it on GitHub <#4641 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGRPN45I7ZC3JIJI34GEL3VJ5ZU3ANCNFSM5V5BRQDQ> . You are receiving this because you commented.Message ID: ***@***.*** com>

-- By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ <https://grab.com/privacy/> This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

0 replies

Nan for loss in training #4641

Uh oh!

Uh oh!

JiaruiWang May 14, 2022

Replies: 7 comments · 14 replies

Uh oh!

Padarn May 14, 2022 Collaborator

Uh oh!

JiaruiWang May 14, 2022 Author

Uh oh!

Padarn May 14, 2022 Collaborator

Uh oh!

Uh oh!

JiaruiWang May 14, 2022 Author

Uh oh!

Uh oh!

JiaruiWang May 14, 2022 Author

Uh oh!

Padarn May 14, 2022 Collaborator

Uh oh!

JiaruiWang May 14, 2022 Author

Uh oh!

Padarn May 14, 2022 Collaborator

Uh oh!

JiaruiWang May 14, 2022 Author

Uh oh!

Padarn May 15, 2022 Collaborator

Uh oh!

JiaruiWang May 16, 2022 Author

Uh oh!

Padarn May 16, 2022 Collaborator

Uh oh!

JiaruiWang May 20, 2022 Author

Uh oh!

Uh oh!

JiaruiWang May 20, 2022 Author

Uh oh!

JiaruiWang May 20, 2022 Author

Uh oh!

Padarn May 20, 2022 Collaborator

Uh oh!

JiaruiWang May 20, 2022 Author

Uh oh!

Padarn May 20, 2022 Collaborator

Uh oh!

rusty1s May 21, 2022 Maintainer

Uh oh!

Padarn Oct 11, 2022 Collaborator

JiaruiWang
May 14, 2022

Replies: 7 comments 14 replies

Padarn
May 14, 2022
Collaborator

JiaruiWang May 14, 2022
Author

Padarn
May 14, 2022
Collaborator

JiaruiWang May 14, 2022
Author

JiaruiWang May 14, 2022
Author

Padarn
May 14, 2022
Collaborator

JiaruiWang
May 14, 2022
Author

Padarn
May 14, 2022
Collaborator

JiaruiWang May 14, 2022
Author

Padarn May 15, 2022
Collaborator

JiaruiWang May 16, 2022
Author

Padarn May 16, 2022
Collaborator

JiaruiWang May 20, 2022
Author

JiaruiWang
May 20, 2022
Author

JiaruiWang May 20, 2022
Author

Padarn May 20, 2022
Collaborator

JiaruiWang May 20, 2022
Author

Padarn May 20, 2022
Collaborator

rusty1s May 21, 2022
Maintainer

Padarn
Oct 11, 2022
Collaborator