Whisper chunking #1977

Zeiny96 · 2024-01-23T21:30:58Z

Zeiny96
Jan 23, 2024

How does Whisper avoid dividing a word from the middle when it chunks audio into 30-seconds segments?

Jan 24, 2024

The terminology of "segment" is unfortunately ambiguous in the source code, so I'll use "window" to refer to the 30 second sliding window, and "segment" to refer to a chunk of the transcript bounded by timestamps.

From memory, this is how I recall it working. Let's say you have something like this:

|           window           |           window           |
|segment|-----segment---|--segment--|

Whisper can see that by the end of the first window, the 3rd segment is incomplete because we can't see its end timestamp. So it can rewind to the end timestamp of the 2nd segment thereby shortening the first window, and then starting the second window from that exact point:

|           window    …

View full answer

ryanheise · 2024-01-24T08:20:19Z

ryanheise
Jan 24, 2024

The terminology of "segment" is unfortunately ambiguous in the source code, so I'll use "window" to refer to the 30 second sliding window, and "segment" to refer to a chunk of the transcript bounded by timestamps.

From memory, this is how I recall it working. Let's say you have something like this:

|           window           |           window           |
|segment|-----segment---|--segment--|

Whisper can see that by the end of the first window, the 3rd segment is incomplete because we can't see its end timestamp. So it can rewind to the end timestamp of the 2nd segment thereby shortening the first window, and then starting the second window from that exact point:

|           window      |           window           |
|segment|-----segment---|--segment--|--segment--|

This avoids cutting off a word in the middle of a segment. Enabling word timestamps can help this process to be more accurate.

But it's still possible that even the first segment doesn't fit within the first window, so Whisper will have to cut it off, perhaps mid-word. It's not perfect. However, in this situation, the output of the previous window is used as the prompt for the next window, and the language model will make a reasonable prediction to usually smooth the transition. So if the previous window happened to pick up enough of the word to output it in the transcript, but then the next window also happened to pick up enough of the word because it was split down the middle, then it's unlikely that the language model would repeat the same word if it "sounds" unnatural. However, it's also possible that because the word landed on a boundary, that neither window picked it up clearly and the word gets mistranscribed.

11 replies

ryanheise Jan 25, 2024

@Jain-Archit understood, sorry for misunderstanding you there. it often becomes confusing when a question or answer is given based on a separate implementation outside of this repository without being explicit about it (See for example, #1969 (comment)).

@zeiny-96 The answer is, it depends. If Whisper can see that there is definitely nothing after the end timestamp within this 30 second window, then it will start the next window at 30. So consider the following example:

|           window           |           window           |
|segment|-----segment---|       |--segment--|

In this case, Whisper will not truncate the first window because it can see that there is nothing after the second segment within that first 30 second window. So the second window will still start at 30. (To be clear, this is how the Whisper in "this" repository behaves.)

Zeiny96 Jan 25, 2024
Author

@ryanheise

Got it, so to make things clear, Whisper has the ability to decide whether to truncate and start the next chunk abit earlier than the default 30 seconds or not to do so due to the remaining period has no new segments or due to the exacting segment ending exactly at the 30 second boundary.
Thanks for you detailed answer and effort, this part was getting me really confused as when I checked the source code as I mentioned I didn’t find any specific algorithm for it so I couldn’t understand how it was handled but now I get it was done in the training of the model itself not as a separate algorithm.

ryanheise Jan 25, 2024

Right, if you take a look through the source file transcribe.py and just read through the comments, you will see comments explaining the logic, such as:

                if single_timestamp_ending:
                    # single timestamp at the end means no speech after the last timestamp.
                    seek += segment_size
                else:
                    # otherwise, ignore the unfinished segment and seek to the last timestamp
                    last_timestamp_pos = (
                        tokens[last_slice - 1].item() - tokenizer.timestamp_begin
                    )
                    seek += last_timestamp_pos * input_stride

That's just for one small part of it. But by this stage in the pipeline, we already have the output of the model which is a sequence of tokens of the transcript, with some timestamp tokens also appearing within that sequence. If I'm being precise, you might visualise my last example more like this:

|           window           |           window           |
|segment||-----segment---|       |--segment--|

All I did was add an extra vertical bar after the first segment. So every segment has one vertical bar before it and one vertical bar after it, representing a timestamp token. When you see two consecutive timestamp tokens, that means one segment has ended and another segment begins right after it. So the logic in the code is saying that since the last segment in the window ends with a single timestamp token, that tells you that there's no additional segment starting right after it and that's how Whisper knows it can skip over it and start the next window at 30. Compare that to the following scenario:

|           window           |           window           |
|segment||-----segment---||--segment--|

Now, the last timestamp visible within the first window is a double timestamp. In this case, the last timestamp visible in the first window is 2 consecutive timestamps, and this is how Whisper decides that there's an unfinished segment so it rewinds back to the end of the previous segment.

krypton08rises Aug 26, 2024

"But it's still possible that even the first segment doesn't fit within the first window, so Whisper will have to cut it off, perhaps mid-word. It's not perfect. However, in this situation, the output of the previous window is used as the prompt for the next window, and the language model will make a reasonable prediction to usually smooth the transition. "
Correct me if I'm wrong, my understanding is whisper is only operating on 30 second windows with no other input to help prediction, can you cite a source where it says Whisper uses previous speech to predict future speech(in case of a shorter chunk) instead of simply padding that chunk?

yoadsn Nov 22, 2024

Look here how the prompt for the decode gets part of the tokens from the previous decoded segments:
https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L286-L291

And how when the temperature used to infer the previous segment is too high (meaning unreliable outcome, fallback was triggered too many times to arrive at the results) or using the prev prompt option is disabled - the index is reset to ignore previous tokens here:
https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L501-L503

hth

omarabb315 · 2024-02-17T15:07:25Z

omarabb315
Feb 17, 2024

Hello, I've read the discussion and it is very useful, but how can I benefit from it when creating my own dataset for fine-tuning whisper?
I mean while preparing the 30 seconds chunks of data, when should I add a final segment partially and when should I truncate the audio without it ?

0 replies

Whisper chunking #1977

Uh oh!

Zeiny96 Jan 23, 2024

Replies: 2 comments · 11 replies

Uh oh!

Uh oh!

ryanheise Jan 24, 2024

Uh oh!

ryanheise Jan 25, 2024

Uh oh!

Zeiny96 Jan 25, 2024 Author

Uh oh!

ryanheise Jan 25, 2024

Uh oh!

krypton08rises Aug 26, 2024

Uh oh!

Uh oh!

yoadsn Nov 22, 2024

Uh oh!

Uh oh!

omarabb315 Feb 17, 2024

Zeiny96
Jan 23, 2024

Replies: 2 comments 11 replies

ryanheise
Jan 24, 2024

Zeiny96 Jan 25, 2024
Author

omarabb315
Feb 17, 2024