-
How does Whisper avoid dividing a word from the middle when it chunks audio into 30-seconds segments? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 11 replies
-
The terminology of "segment" is unfortunately ambiguous in the source code, so I'll use "window" to refer to the 30 second sliding window, and "segment" to refer to a chunk of the transcript bounded by timestamps. From memory, this is how I recall it working. Let's say you have something like this:
Whisper can see that by the end of the first window, the 3rd segment is incomplete because we can't see its end timestamp. So it can rewind to the end timestamp of the 2nd segment thereby shortening the first window, and then starting the second window from that exact point:
This avoids cutting off a word in the middle of a segment. Enabling word timestamps can help this process to be more accurate. But it's still possible that even the first segment doesn't fit within the first window, so Whisper will have to cut it off, perhaps mid-word. It's not perfect. However, in this situation, the output of the previous window is used as the prompt for the next window, and the language model will make a reasonable prediction to usually smooth the transition. So if the previous window happened to pick up enough of the word to output it in the transcript, but then the next window also happened to pick up enough of the word because it was split down the middle, then it's unlikely that the language model would repeat the same word if it "sounds" unnatural. However, it's also possible that because the word landed on a boundary, that neither window picked it up clearly and the word gets mistranscribed. |
Beta Was this translation helpful? Give feedback.
-
Hello, I've read the discussion and it is very useful, but how can I benefit from it when creating my own dataset for fine-tuning whisper? |
Beta Was this translation helpful? Give feedback.
The terminology of "segment" is unfortunately ambiguous in the source code, so I'll use "window" to refer to the 30 second sliding window, and "segment" to refer to a chunk of the transcript bounded by timestamps.
From memory, this is how I recall it working. Let's say you have something like this:
Whisper can see that by the end of the first window, the 3rd segment is incomplete because we can't see its end timestamp. So it can rewind to the end timestamp of the 2nd segment thereby shortening the first window, and then starting the second window from that exact point: