Replies: 11 comments
-
|
Interesting results indeed thanks for sharing, but afaik whispers2t is just an interface for multiple backends, so which one are you using here? |
Beta Was this translation helpful? Give feedback.
-
|
Oh yeah, sorry, using the ctranslate2 backend. It's important to note that it's At any rate, out of respect for the hard work of all the repositories I'm benching, it's important to note that different libraries have different benefits/drawbacks...my benchmarks are only for speed purposes. |
Beta Was this translation helpful? Give feedback.
-
|
Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks |
Beta Was this translation helpful? Give feedback.
-
Interesting...Thanks for the link. I briefly checked it out and the model names imply that they only handle translation. I didn't see a model that handled straight transcription from one language to the same language. With that being said, if you find out otherwise and provide me with a basic script that can perform inference, I'll fine tune it to get vram measurements and timing and process the same audio file that my other benchmarks did? |
Beta Was this translation helpful? Give feedback.
-
|
It looks like |
Beta Was this translation helpful? Give feedback.
-
|
I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful |
Beta Was this translation helpful? Give feedback.
-
Indeed. I recall reading that. Anecdotally, that does not seem to be the case for me, but interested to hear if anyone else has more data on that. Intuitively, I would expect additional context to be useful, given the model was trained to condition the result based on the prompt/context. |
Beta Was this translation helpful? Give feedback.
-
If you go here you can see that the WER rate is actually better...lol. Still trying to figure that out out, but the guy seems solid in his testing so far: |
Beta Was this translation helpful? Give feedback.
-
|
Generally very long context (>30 sec) is not needed for ASR (unlike paralinguistic tasks). By not passing in the previous context, we can prevent some repetitions/hallucinations from passing on to the next segment, as we see in batched faster_whisper, and inturn better WER. |
Beta Was this translation helpful? Give feedback.
-
Has there been any comparisons with Faster-Whisper, non-batched, with VAD, on long form transcription? Looking at the benchmarks you linked to, it seems the only sequential implementation that was tested, is the OpenAI one, which does not implement VAD preprocessing. It's well know that VAD results in improvement's. |
Beta Was this translation helpful? Give feedback.
-
Is that the the case with long-form audio? The tests in those benchmarks looks tiny, and even in that case, Would like to seem more benchmarks on long-form audio (multi-hour), since that is where I would expect to see most gains/losses from batching vs sequential. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey all, after a nice conversation with @MahmoudAshraf97 on a different repo I wanted to share some of my benchmark data. This was created using an RTX 4090 on Windows, no flash attention, with 5 beams. I'd love to include data for whisper.cpp as well as huggingface's implementation but unfortunately when the HF implementation uses any beam size above 1 the vram usage skyrockets...and I'm not aware of any python bindings for .cpp that can use cuda acceleration. Hope ya'll find it as interesting as it was for me to test!
Beta Was this translation helpful? Give feedback.
All reactions