-
Hi, In the MultiheadAttentionBlock class, the position-wise feed-forward network consists of a single linear transformation with a ReLU activation:
whereas in almost any other implementation it consists of two linear transformations with a ReLU activation in between (based on the paper Attention Is All You Need):
(and nowadays, GELU is standard) What is the rational behind this unusual implementation? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
This is mostly taken from the official implementation, see https://github.com/davidbuterez/gnn-neural-readouts/blob/main/code/models/set_transformer_modules.py#L8. |
Beta Was this translation helpful? Give feedback.
-
I see. |
Beta Was this translation helpful? Give feedback.
-
Related to this topic: In the implementation you mentioned, we found for the PMA
but in PyG, we have
Why do we have |
Beta Was this translation helpful? Give feedback.
I see.
This seems strange to me but this readout is entirely based on the Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks by Lee et al. whose original implementation also uses a single linear layer followed by ReLU.
Thanks for the prompt answer.