Repository for the implementation of the Halve Transformer
A self-attention layer in trasformers generates an attention matrix that is used to weight the token sequence. If one can halve the attention matrix then the sequence generated will also be halved. Of course, one wants to halve the attention matrix intelligently. Now, recall that the attention matrix is often mostly diagonal, at least, in the early layers of the transformer. So why not leverage this fact and simply average rows 2 by 2 in the attention matrix?
The implementation is of the halve transformer is based on halving the attention matrix by averaging rows 2 by 2. The full trick can be implemented with one additional line of code to the multi-head attention layer.
if self.halve:
attention_output = attention_output.view(attention_output.size(0), attention_output.size(1)//2, -1, attention_output.size(2)).mean(1)Of course, you can also use
If you want to reproduce these experiments, it should suffice to run:
make -f makefile.mk allThis simple trick can be used to reduce the number of tokens in the sequence by a factor of 2 (or
It can be seen that the HalveTransformer performs on par with the Vanilla Transformer. However, it reduces the memory usage quite a bit. You should also expect better savings with longer sequence lengths. On the other hand, you get slightly worse performance due to the additional operation in the self attention.
