Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks ? #2325
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I don't understand this concept fully hence asking for clarification -
Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks.
Take an example -
Chunk 1: "The bank manager told me to sign the papers at the branch. Later, when I returned..."
Chunk 2: "...to the branch, I noticed that the teller was gone."
Chunk 1 - Clearly sets the context for a vector embedding around branch with previous context as bank.
Chunk 2 - May not know branch is in context of a tree or a bank or a river unless attention is still active here.
The reason I ask is will the quality differ to transcribe chunks of audio in 30s(done externally lets say for a stream) or pass the full audio and let whisper chunk in 30s windows. The first case as per my understanding will reset embeddings and attention (Even if i pass audio with some overlap lets say 5s it would only carry over the common part only - not from a chunk 5 mins earlier).
Beta Was this translation helpful? Give feedback.
All reactions