GAT: Softmax should be on one edge type or all edge types? #8364
edward94587
started this conversation in
General
Replies: 1 comment
-
The inference based on eq2 of the paper suggests that softmax is applied to all types of edges. However, upon inspecting the implementation provided in the paper, it is observed that the authors apply softmax only to a single type of edge. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is to solve a heterogeneous node classification problem. According to eq2 of the paper by Velickovi et al. "Graph Attention Networks", attention coefficients are obtained by doing a softmax on all of node i's neighbors.
https://arxiv.org/pdf/1710.10903.pdf
Because the paper did not specify that eq2's softmax normalization is for only one edge type, it is logical to infer that it is for all edge types. This makes sense since it normalizes the attention score before using it as weights to update node i's representation in eq4.
However, the code in gat_conv.py normalizes attention coefficients for each edge type individually and not for all edge types in one shot. This is implemented in line 274 of file gat_conv.py:
# https://github.com/pyg-team/pytorch_geometric/blob/master/torch_geometric/nn/conv/gat_conv.py
where input parameter "edge_index" in line 274 is the edge list only for one edge type.
What makes matters worse is line 285 would be adding a bias for each edge type as explained below.
Take for example a node i has 2 edge types (e1 and e2) sinking into it.
For edge type e1, line 274 (and line 285) above would be executed because GatConv.forward() would be called from HeteroConv.forward() in line 158 of file hetero_conv.py below. Here, the "out" of line 277 above is the updated node i's representation for edge type e1, BUT using attention coefficients, 'alpha', from line 274, which was normalized only using edge type e1.
For edge type e2, the same execution is done, and another "out" tensor of line 277 is calculated for the updated node i's representation for edge type e2, BUT only using attention coefficients normalized only using edge type e2. Also, a bias is added as in the case of edge e1.
Line 166 above is where these 2 'out' tensors are accumulated. This would have the following issues:
A node's representation that is a result of accumulating 2 'out' tensors both of which were generated by normalized weights (in the case of 2 edge types) is very different from another node's representation which was generated by only one normalized weights (in the case of one edge type). How do you compare these two?
A node with more edge types sinking into it would result in more vector parallel shifts. This is because for each edge type the same bias is added.
Beta Was this translation helpful? Give feedback.
All reactions