Is self.out_proj necessary? #804

Jessen-Li · 2025-09-03T11:46:21Z

Jessen-Li
Sep 3, 2025

In the book I'm reading, I see the following words on page
Additionally, we added an output projection layer (self.out_proj) to Multi-HeadAttention after combining the heads, which is not present in the Causal-Attention class. This output projection layer is not strictly necessary (see appendix B for more details), but it is commonly used in many LLM architectures, which is why I added it here for completeness.
I tried using the same d_out, setting dropout to 0.0, and comment the self.out_proj, it turns out that MultiHeadAttention and CausalAttention generate similar results. I asked AI, which told me that
Multi-head attention ≈ a way to allow the model to learn multiple types of relationships in parallel and recombine them. Without the out_proj, multi-head is not much more expressive than a single-head with the same dimensionality, which is why your outputs look nearly identical.
If it is what AI says, then self.out_proj should be stressed otherwise.

rasbt · 2025-09-03T15:27:51Z

rasbt
Sep 3, 2025
Maintainer

Thanks for the feedback!

When I understand correctly, when removing the output projection layer, the results from using 1 large head (via the CausalAttention class) versus multiple smaller heads that add to the same size (via the MultiHeadAttention class) lead to approximately the same results?

I think this could be because we use a very simple dataset and short training here. In practice, on a larger-scale, I don't think that this will be true. I.e., I think the AI answer is incorrect.

That's because a single head with the same overall dimensionality is not equivalent to multi-head, because the single head has just one Q/K/V projection, while multi-head has n_heads sets of them, and the attention matrices can differ between them then. I.e., the different heads can then learn to pay attention to different aspects.

Regarding the out_proj, it can be useful to "combine" the heads in the sense that it just adds a learned linear transformation that is useful here. But I was saying that it is not necessary because this paper found that it can be removed without affecting the loss.

1 reply

Jessen-Li Sep 4, 2025
Author

You are absolute right. I redo test with both large and small dimensions and found that even for small dimensions, even if sharing W_key, W_query, W_value between Multi-head and single-head, the sub-dimension operation of multi-head results in different attain_scores in both classes, even if without masking, scaling, or softmax, the final matrix multiplication of multi-head just use sub-dimension attain_scores to multiple sub-dimension of values, and then just concat along the last dimension (d_out), which leads to fundamental and meaningful different output.
AI also corrected its initial answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is self.out_proj necessary? #804

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is self.out_proj necessary? #804

Uh oh!

Jessen-Li Sep 3, 2025

Replies: 1 comment · 1 reply

Uh oh!

rasbt Sep 3, 2025 Maintainer

Uh oh!

Jessen-Li Sep 4, 2025 Author

Jessen-Li
Sep 3, 2025

Replies: 1 comment 1 reply

rasbt
Sep 3, 2025
Maintainer

Jessen-Li Sep 4, 2025
Author