Advice on normalisation #3583

errhernandez · 2021-11-27T18:33:12Z

errhernandez
Nov 27, 2021

Hi! I am seeking advice from experts here on the best strategy to solve this problem. I have a series of data objects, each of them having a property y which is a real number. I have built a model to predict y for each graph. The graphs can have an arbitrary number of nodes and edges between them, and the nodes themselves can belong to different types, identified by their initial feature vectors, while edges have a single feature dependent on the euclidean distance between the linked nodes. My goal is to use a number of graph convolutional layers and later feed the resulting node embeddings into a fully-connected neural network to fit y for each graph out of a database. I guess this is all fairly standard. Node and edge features are normalised within [-1,1].

My question is: should one normalise y so that its values are contained within some interval, as node and edge features are?

Apparently this is always desirable in conventional neural networks, but after a few trials, normalising y in my model above seems to bring no advantage in terms of ease of convergence or ability of the model to fit the data. I wonder if anyone here has any advice on the best way to proceed in the case of graph NNs?

Thanks in advance.

Answered by rusty1s

Nov 29, 2021

This is super hard to tell. From my experience, normalizing labels can improve performance, but is never the deciding factor of whether your model is able to fit the data or not. However, normalizing the targets has the advantage that you can use a final non-linearity, e.g., sigmoid, to push model outputs into the desired interval. Since it looks like your model is not able to fit the data regardless of final normalization, there might be other reasons for this. Feel free to post your architecture and your task so I can take a look :)

View full answer

rusty1s · 2021-11-29T07:15:30Z

rusty1s
Nov 29, 2021
Maintainer

This is super hard to tell. From my experience, normalizing labels can improve performance, but is never the deciding factor of whether your model is able to fit the data or not. However, normalizing the targets has the advantage that you can use a final non-linearity, e.g., sigmoid, to push model outputs into the desired interval. Since it looks like your model is not able to fit the data regardless of final normalization, there might be other reasons for this. Feel free to post your architecture and your task so I can take a look :)

3 replies

errhernandez Nov 29, 2021
Author

Thank you for your response, Matthias. I see the point about being able to use a last non-linearity if the targets are normalised, however I doubt this can be much help, since the non-linearity does not usually bring any additional fitting parameters. The reason for posting the question originally is that my model is able to reasonably fit the data, but it does so much better when the targets are not normalised, which surprised me. I admit that I need to do more thorough checking on this, because I feel that I must be doing something wrong.

If you have the time and inclination, you can take a look at my project moleculedb in GitLab, which implements all this. I'd be grateful if you did take a look, but I understand that this is too much to ask, so only do it if you are interested. If you do want to take a look, let me know, as I think I need to give you access (at the moment it is not public).

Essentially the aim of the project is to fit a molecular property (total energy) based on a graph representation of the molecules in the QM9 database. My model has the following architecture:

an arbitrary (usually 2) CGConv layers
a non-linearity between every pair of CGConv layers (tanh).
an arbitrary (usually 2) linear layers, the first one of which takes the output of the last CGConv layer, and the last one of which outputs a node's contribution to the total energy of the entire molecule
a non-linearity between every pair of linear layers (tanh)
A global_add_pool to gather the contribution from all nodes into a prediction for the total energy of the molecule.

I construct a KNN graph for each molecule in the QM9 database, looking for the n (usually 6) nearest neighbours of each atom, but I force graphs to be undirected, so n is a "minimum" number of neighbours, not an exact number (I require that if i is neighbour of j, j is also neighbour of i, even if i has already n neighbours closer to it than j). The node features vary according to the chemical species of the atom represented by each node (typical dimensionality 15 but tried both larger and smaller). Edges eij have a single feature depending on the distance separating atoms i and j, normalised such that eij in [0,1]

Anyway, wether you do take a look at moleculedb or not, I'd like to say that this (pytorch_geometric) is a great piece of software; I am really enjoying playing with it and learning from it, and I am sure one can do great things with it. I'll update this post with whatever progress I am able to make on this.

Thanks!

Eduardo

rusty1s Nov 30, 2021
Maintainer

Happy to keep track of your project in this post :)

Regarding normalization: How do you normalize the data? Do you simply squash it into [-1, 1]? This may be problematic in case there exists some outliers in your data (we use standardization in our QM9 example). Also keep in mind that for evaluation, you need to undo the normalization for results to be comparable.

errhernandez Dec 2, 2021
Author

Hello again. I have tried normalising the data into [-1,1], and also transforming it as ynew = (y - ymean)/ysigma, where ysigma is the std of y. Neither of these seem to offer any significant advantage (if anything the opposite) over just using the raw data. Analysing the distribution of y values, it does look fairly Gaussian distributed around its mean value. Anyway, it is still early days and I do need to check the influence of model hyper parameters, which might play a role here. Thanks for your help and interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Advice on normalisation #3583

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Advice on normalisation #3583

Uh oh!

errhernandez Nov 27, 2021

Replies: 1 comment · 3 replies

Uh oh!

rusty1s Nov 29, 2021 Maintainer

Uh oh!

errhernandez Nov 29, 2021 Author

Uh oh!

rusty1s Nov 30, 2021 Maintainer

Uh oh!

errhernandez Dec 2, 2021 Author

errhernandez
Nov 27, 2021

Replies: 1 comment 3 replies

rusty1s
Nov 29, 2021
Maintainer

errhernandez Nov 29, 2021
Author

rusty1s Nov 30, 2021
Maintainer

errhernandez Dec 2, 2021
Author