Trying to understand how nodeNeighborhood loader works #4216

akul-goyal · 2022-03-07T20:49:33Z

akul-goyal
Mar 7, 2022

Hi,

After looking through the code and implementing the loader myself I am still a little confused as to how the loader works. If I am trying to utilize the loader on a 2 layer GAT that utilizes all of the neighbors then I would give [-1,-1] to the num_neighbors param in the loader. My understanding is that for each node sampled, this would create a two-hop subgraph that starts from the sampled node and move up to its parents. As such, for each sampled node, all the needed information is propagated down to create its representation. Moreover, if we optimized only after running through the entire loader versus if we optimized over the entire graph, then the update should be very similar.

If this is how the loader works then, for GAT, when passing in each batch from the loader we should be passing in all the nodes within the overall graph, not just the nodes related to that batch. Looking through the code it seems like each batch only considers the nodes being sampled and their 2-hop neighborhood. Doesn't GAT need all the nodes in the overall graph to learn the correct weight matrix? When comparing a GAT trained using a node neighborhood loader and a GAT trained over the entire graph at once and the GAT that considers the entire graph provides much better performance. Even increasing the batch size does not improve the performance of the GAT trained on the node neighborhood loader. Most of the examples provided in the Pytorch geometric github related to node neighborhood loader utilize the SAGEconv when creating the network. Looking through the code for SAGEConv there doesn't seem to be anything special that would be needed to be added to the GAT layer to make it work with the loader. However, given the negative results, I am beginning to think I don't understand how the loader is implemented.

rusty1s · 2022-03-08T09:30:13Z

rusty1s
Mar 8, 2022
Maintainer

You are right that the implementation does that change much when swapping out SAGEConv with GATConv. Here is an example of (the old) NeighborSampler utilizing GATConv.

The NeighborLoader takes in the complete graph (and its features), the number of neighbors to sample for each layer, and a set of input/seed nodes from which we want to start sampling subgraphs from. If num_neighbors=[-1, -1], it will sample the complete subgraph around each node in the current mini-batch. The obtained embeddings fully match with the ones obtained in full-batch mode (we actually test against this), so the performance should be identical as well (except for the variance introduced by stochastic gradient descent). If you have a simple example to reproduce the discrepancy between NeighborLoader and full-batch loader, please let us know.

Note that GAT does not need all nodes to learn the correct weight matrix. Attention is computed across the neighborhood (and not all nodes), while parameters are shared for each node.

2 replies

akul-goyal Mar 10, 2022
Author

Hi Rusty, I looked through the code that you sent me and I am a little confused where I am making a mistake in my own implementation. I have included my model implementation. It doesn't seem to perform as well as the model trained on full batch mode. I took a lot of inspiration from this example for creating the forward and inference function:


`class Net(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, fi=None, proc=None):
        super(Net, self).__init__()
        self.GAT = GATConv(hidden_channels, hidden_channels, add_self_loops=False, dropout=0.8, negative_slope=0.01, heads =numHeads)
        self.GAT2 = GATConv(hidden_channels, hidden_channels, add_self_loops=False, negative_slope=0.01, dropout=0.8, heads = numHeads)
        self.GAT3 = GATConv(hidden_channels, hidden_channels, add_self_loops=False, negative_slope=0.01,dropout=0.6,  heads=1)
        self.GRU = GRUCell(hidden_channels*numHeads, hidden_channels)
        self.GRU3= GRUCell(hidden_channels * numHeads, hidden_channels)
        self.GRU2 = GRUCell(hidden_channels, hidden_channels)
        self.lin1 = Linear(in_channels, hidden_channels)
        self.lin2 = Linear(hidden_channels, out_channels)
        self.allLin = Linear(hidden_channels, out_channels)


    def forward(self, x, edge_index, getWeights=False, getEmbeddings = False, training=False, classProb=False):
        temp_x = None
        x, weights = self.firstHop(x, edge_index, training)
        x, weights = self.secondHop(x, edge_index, training)
        # final layer
        if getEmbeddings:
            temp_x = x.clone()
        if classProb:
            x = self.allLin(x)
        return x, weights, temp_x


    def firstHop(self, x, edge_index, root_index=None, training=False, getWeights=False):
        # 1st hop
        x = nn.ReLU()(self.lin1(x))
        h = F.leaky_relu(self.GAT(x, edge_index))
        x = nn.ReLU()(self.GRU(h, x))
        return x, None

    def secondHop(self, x, edge_index, root_index=None, training=False, getWeights=False):
        # 2nd hop
        weights = None
        if getWeights:
            h, weights = self.GAT(x, edge_index, return_attention_weights=True)
            h = F.leaky_relu(h)
        else:
            h = F.leaky_relu(self.GAT(x, edge_index))
        x = nn.ReLU()(self.GRU(h, x))
        return x, weights

    def iter(self, x_all, loader, rLoader, hop_func):
        xs = []
        iter_loader = iter(loader)
        data = next(iter_loader, None)
        while data is not None:
            x = x_all[data.n_id.to(x_all.device)].to(device)
            x, _ = hop_func(x, data.edge_index.to(device))
            xs.append(x[:data.batch_size].cpu())
            data = next(iter_loader, None)
        return torch.cat(xs, dim=0)

    def inference(self, x_all, loader, rLoader):
        x_all = self.iter(x_all, loader, self.firstHop)
        x_all = self.iter(x_all, loader, self.secondHop)
        return x_all`

rusty1s Mar 10, 2022
Maintainer

Thank you. I think this mostly looks good (expect for that you are using GAT and GRU in secondHop rather than GAT2 and GRU2). Do you can share the script for showing the discrepancy between full-batch and mini-batch training as well?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to understand how nodeNeighborhood loader works #4216

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trying to understand how nodeNeighborhood loader works #4216

Uh oh!

akul-goyal Mar 7, 2022

Replies: 1 comment · 2 replies

Uh oh!

rusty1s Mar 8, 2022 Maintainer

Uh oh!

Uh oh!

akul-goyal Mar 10, 2022 Author

Uh oh!

rusty1s Mar 10, 2022 Maintainer

akul-goyal
Mar 7, 2022

Replies: 1 comment 2 replies

rusty1s
Mar 8, 2022
Maintainer

akul-goyal Mar 10, 2022
Author

rusty1s Mar 10, 2022
Maintainer