MultiheadAttentionBlock: Unusual FFN #7986

ldv1 · 2023-09-05T21:04:39Z

ldv1
Sep 5, 2023

Hi,

In the MultiheadAttentionBlock class, the position-wise feed-forward network consists of a single linear transformation with a ReLU activation:

out = out + self.lin(out).relu()

whereas in almost any other implementation it consists of two linear transformations with a ReLU activation in between (based on the paper Attention Is All You Need):

out = out + self.lin2(self.lin1(out).relu())

(and nowadays, GELU is standard)

What is the rational behind this unusual implementation?

Answered by ldv1

Sep 6, 2023

I see.
This seems strange to me but this readout is entirely based on the Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks by Lee et al. whose original implementation also uses a single linear layer followed by ReLU.
Thanks for the prompt answer.

View full answer

rusty1s · 2023-09-06T13:27:55Z

rusty1s
Sep 6, 2023
Maintainer

This is mostly taken from the official implementation, see https://github.com/davidbuterez/gnn-neural-readouts/blob/main/code/models/set_transformer_modules.py#L8.

0 replies

ldv1 · 2023-09-06T13:56:20Z

ldv1
Sep 6, 2023
Author

I see.
This seems strange to me but this readout is entirely based on the Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks by Lee et al. whose original implementation also uses a single linear layer followed by ReLU.
Thanks for the prompt answer.

1 reply

rusty1s Sep 6, 2023
Maintainer

I am happy to change this if you see a strong need to do this.

ldv1 · 2023-09-15T19:16:29Z

ldv1
Sep 15, 2023
Author

Related to this topic:

In the implementation you mentioned, we found for the PMA

class PMA(nn.Module):
    def __init__(...)
    ...
    def forward(self, X):
        return self.mab(self.S.repeat(X.size(0), 1, 1), X)

but in PyG, we have

class PoolingByMultiheadAttention(torch.nn.Module):
    def __init__(...)
    ...
    def forward(self, x: Tensor, mask: Optional[Tensor] = None) -> Tensor:
        x = self.lin(x).relu()
        return self.mab(self.seed.expand(x.size(0), -1, -1), x, y_mask=mask)

Why do we have x = self.lin(x).relu() that is also not specified in the Set Transformer paper ? (unless I oversaw it) ?
Thanks for the help.

2 replies

rusty1s Sep 18, 2023
Maintainer

This is taken from Eq (11) in https://arxiv.org/pdf/1810.00825.pdf. It depends on whether you want to implement the feature transformation within or without the module. The paper does integrate it into PMA while the code does not.

ldv1 Sep 19, 2023
Author

You're right! Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiheadAttentionBlock: Unusual FFN #7986

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MultiheadAttentionBlock: Unusual FFN #7986

Uh oh!

ldv1 Sep 5, 2023

Replies: 3 comments · 3 replies

Uh oh!

rusty1s Sep 6, 2023 Maintainer

Uh oh!

ldv1 Sep 6, 2023 Author

Uh oh!

rusty1s Sep 6, 2023 Maintainer

Uh oh!

ldv1 Sep 15, 2023 Author

Uh oh!

Uh oh!

rusty1s Sep 18, 2023 Maintainer

Uh oh!

ldv1 Sep 19, 2023 Author

ldv1
Sep 5, 2023

Replies: 3 comments 3 replies

rusty1s
Sep 6, 2023
Maintainer

ldv1
Sep 6, 2023
Author

rusty1s Sep 6, 2023
Maintainer

ldv1
Sep 15, 2023
Author

rusty1s Sep 18, 2023
Maintainer

ldv1 Sep 19, 2023
Author