Long decoding time and letter repetition for held out phones. #1410

CJai-K · 2023-05-30T22:16:19Z

CJai-K
May 30, 2023

Apologies if this has been brough up before, but I wasn't able to find any discussion about this.

I've noticed whisper has the ability to transcribe longer, held out phones. For example, if I say "Umm" but extend the word for a few seconds I'll get a decoding result like 'ummmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm'. This happens with several words like "yeah" giving "Yeahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"

The result is not deterministic and for the long "umm", I will sometimes get "umm" and in these cases the audio file decodes faster than real time (audio file is 3 seconds, decoding time is <1 second), but when I get the long result, the utterances consistently take a longer time to decode than the audio file length (audio file is 3 seconds, decoding takes closer to 4 seconds).

Is there a way to supress the repeated letters and let whisper know it should output just the shorter word (i.e. always decode "umm" and "umm" regardless of the length"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long decoding time and letter repetition for held out phones. #1410

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Long decoding time and letter repetition for held out phones. #1410

Uh oh!

CJai-K May 30, 2023

Replies: 0 comments

CJai-K
May 30, 2023