Skip to content
Discussion options

You must be logged in to vote

I'm not super academic about ML, so take this with a grain of salt, but speaking broadly, the whisper model is basically the same structure as a image-to-text that looks at a picture and comes up with a description of it. Instead of looking at pictures, it looks at 30s of audio spectrogram instead.

The encoder layers extract cross-attention weights describing aspects of semantic import and then the decoder generates tokens from that cross attention, keeping track of prior context (prompt) and parts that were already "transcribed/translated" (prefix). This is why we are limited to 30s of audio. Too short, and you'd lack surrounding context. You'd cut sentences more often. A lot of sentenc…

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
1 reply
@ItakeLs
Comment options

Answer selected by jongwook
Comment options

You must be logged in to vote
4 replies
@jongwook
Comment options

@wenjie-p
Comment options

@jeewooyoon-raondata
Comment options

@jongwook
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
5 participants