Replies: 1 comment 1 reply
-
Thanks for the feedback! When I understand correctly, when removing the output projection layer, the results from using 1 large head (via the I think this could be because we use a very simple dataset and short training here. In practice, on a larger-scale, I don't think that this will be true. I.e., I think the AI answer is incorrect. That's because a single head with the same overall dimensionality is not equivalent to multi-head, because the single head has just one Q/K/V projection, while multi-head has Regarding the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In the book I'm reading, I see the following words on page
Additionally, we added an output projection layer (self.out_proj) to Multi-HeadAttention after combining the heads, which is not present in the Causal-Attention class. This output projection layer is not strictly necessary (see appendix B for more details), but it is commonly used in many LLM architectures, which is why I added it here for completeness.
I tried using the same d_out, setting dropout to 0.0, and comment the self.out_proj, it turns out that
MultiHeadAttention
andCausalAttention
generate similar results. I asked AI, which told me thatMulti-head attention ≈ a way to allow the model to learn multiple types of relationships in parallel and recombine them. Without the out_proj, multi-head is not much more expressive than a single-head with the same dimensionality, which is why your outputs look nearly identical.
If it is what AI says, then
self.out_proj
should be stressed otherwise.Beta Was this translation helpful? Give feedback.
All reactions