Whisper-v3 works better on long audio than several short audios #1913

junyongyou · 2023-12-20T12:11:45Z

junyongyou
Dec 20, 2023

Dear all, I am a newbies to ASR, and only tried Whisper-v3 for a couple of times. I have found that it performs good on a long speech. But the results become quite worse if I split the long speech to multiple short segments. Is that a common issue? Thanks a lot.

Answered by Purfview

Dec 21, 2023

But the results become quite worse if I split the long speech to multiple short segments. Is that a common issue?

That's expected, because with splitting you lose context of the previous segments.

View full answer

phineas-pta · 2023-12-21T11:11:46Z

phineas-pta
Dec 21, 2023

may also depends on languages

0 replies

Purfview · 2023-12-21T12:45:24Z

Purfview
Dec 21, 2023

But the results become quite worse if I split the long speech to multiple short segments. Is that a common issue?

That's expected, because with splitting you lose context of the previous segments.

0 replies

junyongyou · 2023-12-22T18:14:21Z

junyongyou
Dec 22, 2023
Author

Are there others encountered the same problem?

0 replies

gaspardpetit · 2024-01-08T02:18:44Z

gaspardpetit
Jan 8, 2024

I have experienced this, and it can be greatly improved if you provide the previous text in prompt. When calling model.transcribe, you want to provide the previous text in initial_prompt. If your segments are short, you can also consider concatenating several of them in a single prompt. This will guide the transcription into being more consistent.

In addition to using the prompt, I can provide two other advices with regards to short segments of audio:

Avoid splitting audio in the middle of words - use voice detection or diarization to find pauses and ideal places to split your audio;
Avoid extremely short segments (less than 1 second). If you provide 2ms of audio, Whisper will make something out of it. 9/10 it will generate something absurd, such as "Don't forget to subscribe" or "Subtitles by Some Random Guy" - I am guessing this comes from training sets which had these kind of text at at the end of audio-video clips without any associated speech, and now they pop when trying to transcribe silence or short nonsense audio.

3 replies

junyongyou Jan 8, 2024
Author

Great, thanks a lot for the relevant answer and valuable answers. What I have done is to aggregate the audio input from the beginning to make a longer speech. It will sacrifice the transcription in the beginning but get better and better then.

stri8ed Jun 18, 2024

@junyongyou What duration do you consider "long" audio vs short?

yagudaev Feb 3, 2025

The original paper referenced that internally whisper breaks down audio into 30-second chunks (see: https://openai.com/index/whisper/).

Therefore, this should be a good start.

I'm working on a feature with live transcription and later stored for a longer summary. When live, a 2-second to 5-second chunk is better to show a few words of text on the screen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper-v3 works better on long audio than several short audios #1913

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Whisper-v3 works better on long audio than several short audios #1913

Uh oh!

junyongyou Dec 20, 2023

Replies: 4 comments · 3 replies

Uh oh!

phineas-pta Dec 21, 2023

Uh oh!

Purfview Dec 21, 2023

Uh oh!

junyongyou Dec 22, 2023 Author

Uh oh!

gaspardpetit Jan 8, 2024

Uh oh!

junyongyou Jan 8, 2024 Author

Uh oh!

stri8ed Jun 18, 2024

Uh oh!

yagudaev Feb 3, 2025

junyongyou
Dec 20, 2023

Replies: 4 comments 3 replies

phineas-pta
Dec 21, 2023

Purfview
Dec 21, 2023

junyongyou
Dec 22, 2023
Author

gaspardpetit
Jan 8, 2024

junyongyou Jan 8, 2024
Author