Reference text as a template for better recognition result #2022
Replies: 2 comments 1 reply
-
FYI, see WhisperTimeSync
|
Beta Was this translation helpful? Give feedback.
-
Hmm, @glangford, just checked project you've mentioned (I was on vacations) - unfortunately it's not what I really need. For karaoke implementation, I need a timestamp for any word, not a range for the sentence (I'm working with JSON output). Also, this is an additional step, increasing overall processing time 😞 but I wanna decrease timing for better user experience. I believe (and hope so) if you can add this option into whisper (one more time, it's a perfect and "magical" tool, I really like it and appreciate your work!), this can dramatically decrease processing time and improve accuracy! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I hope, developers sometimes check in here and might be able to help me.
I'm working on a karaoke program for the Windows platform that automatically converts any song into a karaoke track. Vocal-remover handles the vocal and instrument separation problem excellently (the process runs very fast by using GPU). To extract song lyrics with timestamps, I use Whisper "-m whisper song_Vocals.mp3 --model small --word_timestamps True --output_format json".
Unfortunately, the results are only somewhat acceptable for the English language. Sometimes the base and small models don't give very good results, and although the medium and large models work better, they are too slow (I couldn't get the program to use the GPU; as I understand it, this is a known issue "Failed to launch Triton kernels, likely due to missing CUDA toolkit; falling back to a slower median kernel implementation..."). When recognizing songs in Russian or Ukrainian, the results are sometimes simply discouraging: Whisper forms new, nonexistent words, or greatly distorts what was heard.
However, at the same time, I can practically obtain lyrics from databases and specialized Google searches for literally any song. The only downside is that there are no timestamps for the words there.
My question is: is it possible to provide Whisper with a "reference" text as a template that it should match, but receive the result in the form of a JSON file with timestamps?
Perhaps someone can suggest another solution capable of adding timestamps to audio based on existing text?
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions