Benchmarks for T4 & V100 GPUs, comparison with human captioning, and deep dive on non-deterministic output #395
Replies: 2 comments 1 reply
-
Hi @kalevleetaru ,
One strategy might be to use https://github.com/m-bain/whisperX . |
Beta Was this translation helpful? Give feedback.
-
Thanks for this analysis! I too am evaluating whisper, and have been sufficiently impressed with its english performance. I took five different lectures that I had human captioning for, ran and compared them using jiwer which gives a nice visualization of insertions/deletions/substitutions. I further “improved” whispers WER by fixing the human “ada compliant” captions which contained errors themselves! It compares favorably to AWS Transcribe (besting it in some circumstances), even including hallucinations in the mix. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
For those interested in resource requirements running on larger audio files in the cloud, we've produced a series of detailed benchmarks running 30, 60 and 150 minute television news broadcasts through Whisper from Russian, English and French across Tiny, Small, Medium and Large models for both transcription and translation tasks, including host and GPU resource consumption on T4 and V100 GPUs. We run each model+task multiple times and diff the outputs to show how they change across runs.
For the English-language broadcast we include a diff comparison with the broadcaster-provided realtime human captioning, which shows the results to be far more fluent and significantly more complete than the original human transcription.
For all three broadcasts, however, Whisper's non-deterministic output and tendency towards repetition, dropouts and hallucination when run via the CLI creates significant challenges. The degree to which these artifacts change so much between runs may be of interest.
Would be interesting to hear mitigation strategies that can reduce these artifacts and ways of flagging their density in a given run so that it can be aborted and retried.
Russian 2.5 hour broadcast:
https://blog.gdeltproject.org/a-deep-dive-exploration-applying-openais-whisper-asr-to-a-russian-television-news-broadcast/
English 1 hour broadcast (with comparison against realtime human captioning):
https://blog.gdeltproject.org/a-deep-dive-exploration-applying-openais-whisper-asr-to-a-pbs-newshour-broadcast/
30 minute French broadcast:
https://blog.gdeltproject.org/a-deep-dive-exploration-applying-openais-whisper-asr-to-a-french-language-tele-congo-tv-news-broadcast/
Beta Was this translation helpful? Give feedback.
All reactions