Does anyone know if faster/slower audio changes WERs? #2117

SIlver-- · 2024-04-03T20:28:54Z

SIlver--
Apr 3, 2024

I imagine at a certain point enough increase/decrease will be detrimental to accuracy and different voices will provide different results in terms of speed. But just thinking out loud that an increase of even just 10% speed would be a big boost in terms of transcribing audio speed. As well as reducing costs overall.

Thoughts?

glangford · 2024-04-03T21:17:15Z

glangford
Apr 3, 2024

Worth experimenting to see for yourself, I would think that the effect would vary based on speaker clarity, audio quality/noise, speedup and language. Note that the original Whisper announcement (2022) had a demo transcript of very high speed speech, see "Speed talking" under examples here.

https://openai.com/research/whisper

0 replies

EtienneAb3d · 2024-04-04T06:48:04Z

EtienneAb3d
Apr 4, 2024

I did some experiments with this, with several stretching tools (see WhisperHallu code), and didn't get any improvements.
Whisper is quite insensible to the voice speed.
The only difference I got is that the resulting damage on the sound quality just produce a lower recognition quality.

0 replies

misutoneko · 2024-04-04T12:34:35Z

misutoneko
Apr 4, 2024

Yeah I had the same idea here a while back.
I didn't notice much difference in the transcript, but slowing down the audio did improve the timing of the subtitles slightly.

Also, at some point whisper.cpp had a switch that sped up the audio but I guess it's been dropped since.

0 replies

itaipee · 2024-04-07T11:31:51Z

itaipee
Apr 7, 2024

on most Kaldi recipes , speed-permutation was a default augmentation method , and usually performs well , gives you a "free" 1%-2% reduction in WER.
However, Kaldi Acoustics models size is 10M-40M parameters , and are usually trained on 50-1000 hours of audio. So Whisper large model is several magnitude larger than Kaldi , and was trained on ~1000X more audio. I can assume the diversity of the audio is so much more , so speed augmentation becomes redundant.

It might helps if you want to fine-tune to a very unique domain , but with smaller training set.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does anyone know if faster/slower audio changes WERs? #2117

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does anyone know if faster/slower audio changes WERs? #2117

Uh oh!

SIlver-- Apr 3, 2024

Replies: 4 comments

Uh oh!

glangford Apr 3, 2024

Uh oh!

EtienneAb3d Apr 4, 2024

Uh oh!

Uh oh!

misutoneko Apr 4, 2024

Uh oh!

itaipee Apr 7, 2024

SIlver--
Apr 3, 2024

glangford
Apr 3, 2024

EtienneAb3d
Apr 4, 2024

misutoneko
Apr 4, 2024

itaipee
Apr 7, 2024