May I ask how did you achieve monotonic alignment for a specific set of attention heads in Whisper ? #1388

firecrack · 2023-05-25T02:45:42Z

firecrack
May 25, 2023

While reading the source code of Whisper, I noticed that models of different sizes all have a set of attention heads specifically designed for alignment.
I visualized the weight distributions of these attention heads during decoding by matplotlib, and found that they all exhibit good monotonic alignment properties.
I'm wondering how researchers at OpenAI achieved this property for a specific set of attention heads.

sanchit-gandhi · 2023-11-14T18:04:20Z

sanchit-gandhi
Nov 14, 2023

+1 this would be super cool to know!

0 replies

jongwook · 2023-11-14T18:09:32Z

jongwook
Nov 14, 2023
Maintainer

Hi! The heads were not specifically designed or constrained to be monotonically aligned, but some heads in the cross attention layers naturally learned to have the attention weights matching the time alignment. The _ALIGNMENT_HEADS masks are obtained post-training by rather manually selecting the heads that have clean alignment patterns.

1 reply

hoohooprog Nov 15, 2023

hi jongwook,

i also have another question about attention.

I touched on the intuition on how attention works using this article by Seb Rashka:
https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

I plan to understand the mechanism better by testing the code myself, along with other resources like tensor2tensor,
Harvard's annotation to attention and "the illustrated transformer" by Jay Alamar.. when i'm in a more stable environment.

from what I understand of Rashka's diagram, after the attention matrix, outputs were highlighted in matrix A by softmax, and yet there's another matrix Z, which seems to perform some sort of validation for training, using the value matrix, which Rashka mentions to be a "context matrix".

Now, i'm not entirely sure how it would work, but it seems like the context matrix can be inversed to also get the resultant Query or Key matrix.

Hence i'm also not sure how it applies to Whisper's model.. i'm not at the stage right now to probe into things and experimentally validate my intuition, but if anyone has any feedback/knowledge about this, or if i'm missing something, I would love to know!

I have also posted on discussion last week and I know jongwook mentioned about how translation to other languages than English is the result of sparse media used rather than intentionally used.

And I also have the idea that how the model's trained affects the utility of a matrix, ie using several masks in a way would result in a model that better paraphrase for example.

thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

May I ask how did you achieve monotonic alignment for a specific set of attention heads in Whisper ? #1388

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

May I ask how did you achieve monotonic alignment for a specific set of attention heads in Whisper ? #1388

Uh oh!

firecrack May 25, 2023

Replies: 2 comments · 1 reply

Uh oh!

sanchit-gandhi Nov 14, 2023

Uh oh!

Uh oh!

jongwook Nov 14, 2023 Maintainer

Uh oh!

hoohooprog Nov 15, 2023

firecrack
May 25, 2023

Replies: 2 comments 1 reply

sanchit-gandhi
Nov 14, 2023

jongwook
Nov 14, 2023
Maintainer