self attention for feature fusion over graphs #3432

kristianeschenburg · 2021-11-04T04:21:04Z

kristianeschenburg
Nov 4, 2021

Hi

This is a repost from here, as I'm hoping someone in this forum might have some additional expertise in feature fusion.

Is it possible to perform feature fusion using a self-attention mechanism? I have data distributed over a graph, and multiple feature modalities sampled per node that I would like to "optimally" combine for a node classification problem. Let's say that I have a graph $G=(V,E)$ with $N$ nodes. For each node, we sample $M$ different modalities, such that each node $v$ is characterized by $M$ feature vectors $\in \mathbb{R}^{m_{1}}, \mathbb{R}^{m_{2}}...\mathbb{R}^{m_{M}}$. For simplicity, assume $m_{1} = m_{2} = ... = m_{M} = D$. I'm interested in "fusing" these features in a more intelligent way than simply concatenating them and passing them through a linear layer or MLP.

Assuming each node $v$ is characterzed by a feature matrix $x_{i} \in \mathbb{R}^{M \times D}$ (where $M$ is the number of modalities, and $D$ the input dimension), a transformer approach over modalities would yield something like this:

$$ \begin{align} Q = x_{i}W_{q} \\ K = x_{i}W_{k} \\ V = x_{i}W_{v} \end{align} $$

where $W_{q,k,v} \in \mathbb{R}^{D \times p}. We then have that

$$ \begin{align} Z = softmax(\frac{QK^{T}}{\sqrt{D}}, axis=1)V \end{align} $$

where $Z \in \mathbb{R}^{M \times D}$ and $Z_{i}$ is the fusion of modalities with respect to modality $i$ (I think -- correct me if I'm wrong, please). I could then sum over the rows of $Z$ to compute the final combination. Does this seem correct? I havn't found many papers on using self-attention for feature fusion, so any help is much appreciated.

k

rusty1s · 2021-11-04T07:31:43Z

rusty1s
Nov 4, 2021
Maintainer

Interesting problem. I'm not aware of any work that tackles this problem, but I think your solution makes a lot of sense. Keep in mind that you will need some kind of permutation-invariant aggregation of sampled modalities (e.g., via Transformer, simple mean aggregation), and cannot use torch.cat since the modalities and their order might change across different samples.

Your approach should work fine, but it is quite cumbersome to implement as you are dealing with modalities of potentially different dimensionality. I suggest to utilize different linear layers for each modality and map them to a unified embedding space for all query, key, and value embeddings.

0 replies

kristianeschenburg · 2021-11-04T23:00:07Z

kristianeschenburg
Nov 4, 2021
Author

Hi Matthias Thank you for the insights. You are correct about the dimension aspect, passing each modality through a linear layer with the same output dimension seems like the easiest approach. How does this guarantee that they are in the same embedding space, however? This only guarantees that they are of the same dimension, correct? In my case, each node is guaranteed to have all modalities, and I can ensure that they are shown to the network in the same order each time by construction. I can also encode some sort of positional embedding using the first $k$ eigenfunctions of the graph Laplacian. I'm not sure if it would be better to learn the embeddings or use the precomputed eigenfunctions... I've only seen two papers on spectral graph transformers thus far. k

…

On Thu, Nov 4, 2021, 00:32 Matthias Fey ***@***.***> wrote: Interesting problem. I'm not aware of any work that tackles this problem, but I think your solution makes a lot of sense. Keep in mind that you will need some kind of permutation-invariant aggregation of sampled modalities (e.g., via Transformer, simple mean aggregation), and cannot use torch.cat since the modalities and their order might change across different samples. Your approach should work fine, but it is quite cumbersome to implement as your dealing with modalities of potentially different dimensionality. I suggest to utilize different linear layers for each modality and map them to a unified embedding space for all query, key, and value embeddings. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3432 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACQIP2XBUY63KBRPRR6M4ADUKIZGVANCNFSM5HKRDU6A> .

1 reply

rusty1s Nov 5, 2021
Maintainer

I don't think you need any positional encoding here for now. As far as I understand, the grouping of different modalities does not depend on the graph structure, and there is no clear ordering of modalities in contrast to the classic Transformer use-case of working on sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

self attention for feature fusion over graphs #3432

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

self attention for feature fusion over graphs #3432

Uh oh!

kristianeschenburg Nov 4, 2021

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

rusty1s Nov 4, 2021 Maintainer

Uh oh!

kristianeschenburg Nov 4, 2021 Author

Uh oh!

rusty1s Nov 5, 2021 Maintainer

kristianeschenburg
Nov 4, 2021

Replies: 2 comments 1 reply

rusty1s
Nov 4, 2021
Maintainer

kristianeschenburg
Nov 4, 2021
Author

rusty1s Nov 5, 2021
Maintainer