Benchmarks for whisperx, faster-whisper, and whispers2t! #1352

BBC-Esq · 2024-06-06T16:27:56Z

BBC-Esq
Jun 6, 2024

Hey all, after a nice conversation with @MahmoudAshraf97 on a different repo I wanted to share some of my benchmark data. This was created using an RTX 4090 on Windows, no flash attention, with 5 beams. I'd love to include data for whisper.cpp as well as huggingface's implementation but unfortunately when the HF implementation uses any beam size above 1 the vram usage skyrockets...and I'm not aware of any python bindings for .cpp that can use cuda acceleration. Hope ya'll find it as interesting as it was for me to test!

MahmoudAshraf97 · 2024-06-06T21:53:21Z

MahmoudAshraf97
Jun 6, 2024

Interesting results indeed thanks for sharing, but afaik whispers2t is just an interface for multiple backends, so which one are you using here?

0 replies

BBC-Esq · 2024-06-06T22:04:13Z

BBC-Esq
Jun 6, 2024
Author

Oh yeah, sorry, using the ctranslate2 backend. It's important to note that it's ctranslate2 and not just faster-whisper. As far as I know, whisperX and whisperS2T are the only repositories that have batch processing using ctranslate2. faster-whisper should hopefully be getting it soon, however. See Here.

At any rate, out of respect for the hard work of all the repositories I'm benching, it's important to note that different libraries have different benefits/drawbacks...my benchmarks are only for speed purposes.

0 replies

Infinitay · 2024-06-09T13:28:24Z

Infinitay
Jun 9, 2024

Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks

0 replies

BBC-Esq · 2024-06-09T13:35:59Z

BBC-Esq
Jun 9, 2024
Author

Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks

Interesting...Thanks for the link. I briefly checked it out and the model names imply that they only handle translation. I didn't see a model that handled straight transcription from one language to the same language. With that being said, if you find out otherwise and provide me with a basic script that can perform inference, I'll fine tune it to get vram measurements and timing and process the same audio file that my other benchmarks did?

0 replies

stri8ed · 2024-06-18T22:22:14Z

stri8ed
Jun 18, 2024

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

0 replies

MahmoudAshraf97 · 2024-06-18T22:36:06Z

MahmoudAshraf97
Jun 18, 2024

I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful

0 replies

stri8ed · 2024-06-18T22:41:52Z

stri8ed
Jun 18, 2024

I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful

Indeed. I recall reading that. Anecdotally, that does not seem to be the case for me, but interested to hear if anyone else has more data on that. Intuitively, I would expect additional context to be useful, given the model was trained to condition the result based on the prompt/context.

0 replies

BBC-Esq · 2024-06-19T02:49:25Z

BBC-Esq
Jun 19, 2024
Author

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

If you go here you can see that the WER rate is actually better...lol. Still trying to figure that out out, but the guy seems solid in his testing so far:

https://github.com/shashikg/WhisperS2T/releases

0 replies

Jiltseb · 2024-07-02T09:37:31Z

Jiltseb
Jul 2, 2024

Generally very long context (>30 sec) is not needed for ASR (unlike paralinguistic tasks). By not passing in the previous context, we can prevent some repetitions/hallucinations from passing on to the next segment, as we see in batched faster_whisper, and inturn better WER.

0 replies

stri8ed · 2024-07-19T14:47:06Z

stri8ed
Jul 19, 2024

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

If you go here you can see that the WER rate is actually better...lol. Still trying to figure that out out, but the guy seems solid in his testing so far:

https://github.com/shashikg/WhisperS2T/releases

Has there been any comparisons with Faster-Whisper, non-batched, with VAD, on long form transcription? Looking at the benchmarks you linked to, it seems the only sequential implementation that was tested, is the OpenAI one, which does not implement VAD preprocessing. It's well know that VAD results in improvement's.

0 replies

stri8ed · 2024-07-19T14:51:38Z

stri8ed
Jul 19, 2024

Generally very long context (>30 sec) is not needed for ASR (unlike paralinguistic tasks). By not passing in the previous context, we can prevent some repetitions/hallucinations from passing on to the next segment, as we see in batched faster_whisper, and inturn better WER.

Is that the the case with long-form audio? The tests in those benchmarks looks tiny, and even in that case, faster-whisper non batched shows lower WER.

Would like to seem more benchmarks on long-form audio (multi-hour), since that is where I would expect to see most gains/losses from batching vs sequential.

0 replies

Uh oh!

Benchmarks for whisperx, faster-whisper, and whispers2t! #1352

Uh oh!

Replies: 11 comments

Uh oh!

Uh oh!

BBC-Esq Jun 6, 2024 Author

Uh oh!

Uh oh!

Uh oh!

BBC-Esq Jun 9, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBC-Esq Jun 19, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBC-Esq
Jun 6, 2024
Author

BBC-Esq
Jun 9, 2024
Author

BBC-Esq
Jun 19, 2024
Author