Whisper's WER and CER is huge on audios less than 2 seconds. #1447

BelalElhossany · 2023-06-13T10:42:11Z

BelalElhossany
Jun 13, 2023

I'm evaluating Whisper on an Arabic dataset called SADA, I noticed that the WER and CER for audios less than 2 seconds is a lot bigger than error on larger audios. Is whisper's performance bad on short audios?

BelalElhossany · 2023-06-13T11:35:20Z

BelalElhossany
Jun 13, 2023
Author

All short audios are transcribed into one word which is 'شكرا'

4 replies

ryanheise Jun 13, 2023

I think it would be fair to say that having more context increases accuracy (via either previous input or initial prompt), so I suppose a corollary of that would be that having less context may decrease accuracy.

But the data set you mentioned also contains a variety of dialects, and that may be why it is performing worse on that data set. In case it helps, there are other discussions here about how you can fine tune Whisper on your own data set to make it work better for your use case.

BelalElhossany Jun 13, 2023
Author

@ryanheise Hi Ryan, thanks for your reply.
When I removed the audios less than 2 seconds, the results got much better. So, I don't think it's the dialect. You are right that the Modern Standard Arabic dialect has less error than the other dialects, but still short audios introduce a problem.
I will try to fine-tune. But, if each audio is self-contained and independent can I still use initial prompt in any way? Or, can I embed an external language model?
Thanks a lot again.

ryanheise Jun 13, 2023

The idea of the prompt is to think of some words that could plausibly come before the words in the audio recording. So if you know what kind of words to expect in the short audio, because you have a particular use case in mind, then you might be able to think of some prompt that makes sense in that kind of use case.

As for external models, whisper depends on the model fitting a certain shape, so any model you plug into whisper should be somewhat like whisper's own models, trained on 30 second inputs and the same format of tokens.

BelalElhossany Jun 13, 2023
Author

@ryanheise Thanks brother.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper's WER and CER is huge on audios less than 2 seconds. #1447

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Whisper's WER and CER is huge on audios less than 2 seconds. #1447

Uh oh!

BelalElhossany Jun 13, 2023

Replies: 1 comment · 4 replies

Uh oh!

BelalElhossany Jun 13, 2023 Author

Uh oh!

ryanheise Jun 13, 2023

Uh oh!

BelalElhossany Jun 13, 2023 Author

Uh oh!

ryanheise Jun 13, 2023

Uh oh!

BelalElhossany Jun 13, 2023 Author

BelalElhossany
Jun 13, 2023

Replies: 1 comment 4 replies

BelalElhossany
Jun 13, 2023
Author

BelalElhossany Jun 13, 2023
Author

BelalElhossany Jun 13, 2023
Author