Gradient Calculation for neighbor nodes #8303

RX28666 · 2023-11-01T03:22:40Z

RX28666
Nov 1, 2023

Hi, I am currently considering the backward propagation process about the sampling case. For example, considering large scale dataset, the training process is normally like this: (https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_sage.py)

def train(epoch):
    model.train()
    total_loss = total_correct = 0
    for batch in train_loader:
        optimizer.zero_grad()
        out = model(batch.x, batch.edge_index.to(device))[:batch.batch_size]
        y = batch.y[:batch.batch_size].squeeze()
        loss = F.cross_entropy(out, y)
        loss.backward()
        optimizer.step()`

We compute the loss using only the target nodes [:batch.batch_size]. Does this imply that the gradients of those neighbor nodes [batch.batch_size:] are zero because they are not included in computing loss? Do those neighbor nodes contribute to the update of the model parameters? Thanks.

Answered by rusty1s

Nov 2, 2023

Hey. While the loss is computed on the seed nodes only, all nodes within the graph that contribute to the final node representation of seed nodes will receive a gradient and are thus used to update model parameters.

View full answer

akihironitta · 2023-11-01T05:59:09Z

akihironitta
Nov 1, 2023
Maintainer

Do those neighbor nodes contribute to the update of the model parameters?

Yes, because the model aggregates node features from neighbor nodes to guess seed nodes' labels. However, neighbor nodes' labels don't contribute to updating models parameters as they're not used to compute the loss.

This doc page might be a good reference: https://pytorch-geometric.readthedocs.io/en/latest/tutorial/neighbor_loader.html

7 replies

akihironitta Nov 1, 2023
Maintainer

I'm sorry but I'm not sure if I understand your questions exactly. Could you rephrase or elaborate?

does the gradient also aggregate on graph during backward propagation similar to the feature aggregation during forward propagation?

What do you mean by "gradient aggregate on graph"?

could this potentially introduce bias compared to full batch training? Since for full batch case, gradient information from all nodes is considered.

What kind of bias are you referring to?

RX28666 Nov 1, 2023
Author

Thanks for your reply and sorry for the confusion I made. Please let me take an example to explain. Suppose we use a three layers APPNP and set α to be 0 for simplicity, then the output should be $Z = A^3*H $, where $H = f_θ(X)$ is the feature after MLP. Suppose we have a Loss $L$, then the gradient w.r.t θ should be like $\frac{\partial L}{\partial θ} = \frac{\partial L}{\partial Z} * \frac{\partial Z}{\partial θ} = \frac{\partial L}{\partial Z} * A^3 * \frac{\partial H}{\partial θ}$, then the term $\frac{\partial L}{\partial Z} * A^3$ might be considered as $\frac{\partial L}{\partial Z}$ aggregates three times on graph. Please correct me if there is anything wrong with my explanation.

RX28666 Nov 1, 2023
Author

Suppose I am correct :) :), then $\frac{\partial L}{\partial Z}$ only has value for the target nodes(seeds nodes) but not for neighbor nodes when we do neighbor sampling, if we compare with full batch training whose $\frac{\partial L}{\partial Z}$ has the values for both target and neighbor nodes, I think the term $\frac{\partial L}{\partial Z} * A^3$ will no longer be equal when comparing sampling and full batch cases and this will also influence the following gradient w.r.t θ ? Any help would be appreciated.

rusty1s Nov 2, 2023
Maintainer

Hey. While the loss is computed on the seed nodes only, all nodes within the graph that contribute to the final node representation of seed nodes will receive a gradient and are thus used to update model parameters.

Answer selected by RX28666

RX28666 Nov 2, 2023
Author

Hi Matthias, thanks for your reply! I appreciate your explanation! I'm still curious about a problem: the gradients computed using mini-batch training might introduce some bias compared to the full-batch case because they are computed within the scope of the current single batch and no out-of-batch nodes participate in the computation. Am I correct about this? Thanks!

rusty1s Nov 2, 2023
Maintainer

Yes, since you sample neighbors, and no longer have access to the full neighborhood, gradients are approximated as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradient Calculation for neighbor nodes #8303

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Gradient Calculation for neighbor nodes #8303

Uh oh!

RX28666 Nov 1, 2023

Replies: 1 comment · 7 replies

Uh oh!

Uh oh!

akihironitta Nov 1, 2023 Maintainer

Uh oh!

akihironitta Nov 1, 2023 Maintainer

Uh oh!

RX28666 Nov 1, 2023 Author

Uh oh!

RX28666 Nov 1, 2023 Author

Uh oh!

rusty1s Nov 2, 2023 Maintainer

Uh oh!

RX28666 Nov 2, 2023 Author

Uh oh!

rusty1s Nov 2, 2023 Maintainer

RX28666
Nov 1, 2023

Replies: 1 comment 7 replies

akihironitta
Nov 1, 2023
Maintainer

akihironitta Nov 1, 2023
Maintainer

RX28666 Nov 1, 2023
Author

RX28666 Nov 1, 2023
Author

rusty1s Nov 2, 2023
Maintainer

RX28666 Nov 2, 2023
Author

rusty1s Nov 2, 2023
Maintainer