Reason for 30s audio length #1118

ItakeLs · 2023-03-19T10:26:55Z

ItakeLs
Mar 19, 2023

Hello, I understand that whisper can only access 30s of audio content, what is the reason behind that?

Is it because larger than 30s is harder to train? I assume a window larger than 30 seconds since GPU can speed up the process.

Assuming one was to retrain the model from scratch what are the benefits and drawbacks of using a smaller and larger window?

Answered by sphinxrave

Mar 19, 2023

I'm not super academic about ML, so take this with a grain of salt, but speaking broadly, the whisper model is basically the same structure as a image-to-text that looks at a picture and comes up with a description of it. Instead of looking at pictures, it looks at 30s of audio spectrogram instead.

The encoder layers extract cross-attention weights describing aspects of semantic import and then the decoder generates tokens from that cross attention, keeping track of prior context (prompt) and parts that were already "transcribed/translated" (prefix). This is why we are limited to 30s of audio. Too short, and you'd lack surrounding context. You'd cut sentences more often. A lot of sentenc…

View full answer

sphinxrave · 2023-03-19T23:57:25Z

sphinxrave
Mar 19, 2023

I'm not super academic about ML, so take this with a grain of salt, but speaking broadly, the whisper model is basically the same structure as a image-to-text that looks at a picture and comes up with a description of it. Instead of looking at pictures, it looks at 30s of audio spectrogram instead.

The encoder layers extract cross-attention weights describing aspects of semantic import and then the decoder generates tokens from that cross attention, keeping track of prior context (prompt) and parts that were already "transcribed/translated" (prefix). This is why we are limited to 30s of audio. Too short, and you'd lack surrounding context. You'd cut sentences more often. A lot of sentences would cease to make sense. Too long, and you'll need larger and larger models to contain the complexity of the meaning you want the model to keep track of.

1 reply

ItakeLs Mar 20, 2023
Author

I think you are right, in the sense that longer times would require larger complexity to handle the model. I also assume alot of other datasets use shorter audio, so the model might not learn the data properly when alot of data is padded.

And if the window is shorter the model wouldn't be able to gather as much information.

I assumed that for inference a longer window would be faster but forgot to look at the training aspect, which would take much longer and be more complex, and vice versa with shorter windows.

wenjie-p · 2023-03-20T08:58:02Z

wenjie-p
Mar 20, 2023

I also have a similar question. I am wondering the influence of audio length for the finetuning performance. To be specific, do we have some requirements of audio length for finetuning in order to get a good WER? Such as the proportion of data within a given range of audio length (say 15s~25s etc.)? What if the audio length for finetuning is fairly short?

4 replies

jongwook Apr 12, 2023
Maintainer

I think 15-25s should work fairly well for fine-tuning. If the majority of the examples are very short, e.g. containing only a couple of words, the model may degrade into a mode that produces short outputs only.

wenjie-p Apr 19, 2023

Hi, thanks for your reply.

the model may degrade into a mode that produces short outputs only.

I am a little bit confused about this. Does it suggest that the model may produce more deletion errors due to the amount of fairly short audio data used during finetuning?

jeewooyoon-raondata Oct 5, 2023

@jongwook Hi, I have a question. If I fine-tune whisper with short samples with padding (e.g., 4-5 secs), will it degrade the model's performance in long-form (over 30s) prediction?

jongwook Oct 5, 2023
Maintainer

To both: yes, that's my default guess, but it could be possible to fine-tune without degrading long-form transcription accuracy, if done with a careful hyperparameter selection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reason for 30s audio length #1118

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reason for 30s audio length #1118

Uh oh!

ItakeLs Mar 19, 2023

Replies: 2 comments · 5 replies

Uh oh!

sphinxrave Mar 19, 2023

Uh oh!

ItakeLs Mar 20, 2023 Author

Uh oh!

wenjie-p Mar 20, 2023

Uh oh!

jongwook Apr 12, 2023 Maintainer

Uh oh!

Uh oh!

wenjie-p Apr 19, 2023

Uh oh!

jeewooyoon-raondata Oct 5, 2023

Uh oh!

jongwook Oct 5, 2023 Maintainer

ItakeLs
Mar 19, 2023

Replies: 2 comments 5 replies

sphinxrave
Mar 19, 2023

ItakeLs Mar 20, 2023
Author

wenjie-p
Mar 20, 2023

jongwook Apr 12, 2023
Maintainer

jongwook Oct 5, 2023
Maintainer