Benchmarks for T4 & V100 GPUs, comparison with human captioning, and deep dive on non-deterministic output #395

kalevleetaru · 2022-10-22T12:30:15Z

kalevleetaru
Oct 22, 2022

For those interested in resource requirements running on larger audio files in the cloud, we've produced a series of detailed benchmarks running 30, 60 and 150 minute television news broadcasts through Whisper from Russian, English and French across Tiny, Small, Medium and Large models for both transcription and translation tasks, including host and GPU resource consumption on T4 and V100 GPUs. We run each model+task multiple times and diff the outputs to show how they change across runs.

For the English-language broadcast we include a diff comparison with the broadcaster-provided realtime human captioning, which shows the results to be far more fluent and significantly more complete than the original human transcription.

For all three broadcasts, however, Whisper's non-deterministic output and tendency towards repetition, dropouts and hallucination when run via the CLI creates significant challenges. The degree to which these artifacts change so much between runs may be of interest.

Would be interesting to hear mitigation strategies that can reduce these artifacts and ways of flagging their density in a given run so that it can be aborted and retried.

Russian 2.5 hour broadcast:

https://blog.gdeltproject.org/a-deep-dive-exploration-applying-openais-whisper-asr-to-a-russian-television-news-broadcast/

English 1 hour broadcast (with comparison against realtime human captioning):

https://blog.gdeltproject.org/a-deep-dive-exploration-applying-openais-whisper-asr-to-a-pbs-newshour-broadcast/

30 minute French broadcast:

https://blog.gdeltproject.org/a-deep-dive-exploration-applying-openais-whisper-asr-to-a-french-language-tele-congo-tv-news-broadcast/

ck-gd · 2023-04-06T05:44:40Z

ck-gd
Apr 6, 2023

Hi @kalevleetaru ,

For all three broadcasts, however, Whisper's non-deterministic output and tendency towards repetition, dropouts and
hallucination when run via the CLI creates significant challenges. The degree to which these artifacts change so much between
runs may be of interest.

Would be interesting to hear mitigation strategies that can reduce these artifacts and ways of flagging their density in a given
run so that it can be aborted and retried.

One strategy might be to use https://github.com/m-bain/whisperX .
Although the runtimes with whisperX are (significantly) higher than with whisper.

1 reply

edsu Feb 4, 2025

Apologies for commenting on an older discusssion, but I am curious why would you expect WhisperX's output to be significantly different from Whisper in terms of non-deterministic output?

I've been surprised by how often Whisper output can vary across runs when the configuration hasn't changed, especially it seems for non-English content? Or is that not normal to see?

ururk · 2023-04-06T12:55:09Z

ururk
Apr 6, 2023

Thanks for this analysis! I too am evaluating whisper, and have been sufficiently impressed with its english performance. I took five different lectures that I had human captioning for, ran and compared them using jiwer which gives a nice visualization of insertions/deletions/substitutions. I further “improved” whispers WER by fixing the human “ada compliant” captions which contained errors themselves! It compares favorably to AWS Transcribe (besting it in some circumstances), even including hallucinations in the mix.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarks for T4 & V100 GPUs, comparison with human captioning, and deep dive on non-deterministic output #395

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Benchmarks for T4 & V100 GPUs, comparison with human captioning, and deep dive on non-deterministic output #395

Uh oh!

kalevleetaru Oct 22, 2022

Replies: 2 comments · 1 reply

Uh oh!

ck-gd Apr 6, 2023

Uh oh!

edsu Feb 4, 2025

Uh oh!

ururk Apr 6, 2023

kalevleetaru
Oct 22, 2022

Replies: 2 comments 1 reply

ck-gd
Apr 6, 2023

ururk
Apr 6, 2023