Reference text as a template for better recognition result #2022

sensboston · 2024-02-15T17:50:31Z

sensboston
Feb 15, 2024

I hope, developers sometimes check in here and might be able to help me.

I'm working on a karaoke program for the Windows platform that automatically converts any song into a karaoke track. Vocal-remover handles the vocal and instrument separation problem excellently (the process runs very fast by using GPU). To extract song lyrics with timestamps, I use Whisper "-m whisper song_Vocals.mp3 --model small --word_timestamps True --output_format json".

Unfortunately, the results are only somewhat acceptable for the English language. Sometimes the base and small models don't give very good results, and although the medium and large models work better, they are too slow (I couldn't get the program to use the GPU; as I understand it, this is a known issue "Failed to launch Triton kernels, likely due to missing CUDA toolkit; falling back to a slower median kernel implementation..."). When recognizing songs in Russian or Ukrainian, the results are sometimes simply discouraging: Whisper forms new, nonexistent words, or greatly distorts what was heard.

However, at the same time, I can practically obtain lyrics from databases and specialized Google searches for literally any song. The only downside is that there are no timestamps for the words there.

My question is: is it possible to provide Whisper with a "reference" text as a template that it should match, but receive the result in the form of a JSON file with timestamps?

Perhaps someone can suggest another solution capable of adding timestamps to audio based on existing text?

Thank you in advance!

glangford · 2024-02-15T21:52:24Z

glangford
Feb 15, 2024

Perhaps someone can suggest another solution capable of adding timestamps to audio based on existing text?

FYI, see WhisperTimeSync

Synchronize Whisper's timestamps over an existing accurate transcription
Input 1: SRT with good timestamps and bad-quality text
Input 2: good text-only, or SRT with good text and bad timestamps
Output: SRT with good text and good timestamps

https://github.com/EtienneAb3d/WhisperTimeSync

1 reply

sensboston Feb 16, 2024
Author

Thanks, definitely will try!

sensboston · 2024-02-22T17:39:54Z

sensboston
Feb 22, 2024
Author

Hmm, @glangford, just checked project you've mentioned (I was on vacations) - unfortunately it's not what I really need. For karaoke implementation, I need a timestamp for any word, not a range for the sentence (I'm working with JSON output). Also, this is an additional step, increasing overall processing time 😞 but I wanna decrease timing for better user experience. I believe (and hope so) if you can add this option into whisper (one more time, it's a perfect and "magical" tool, I really like it and appreciate your work!), this can dramatically decrease processing time and improve accuracy!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reference text as a template for better recognition result #2022

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reference text as a template for better recognition result #2022

Uh oh!

sensboston Feb 15, 2024

Replies: 2 comments · 1 reply

Uh oh!

glangford Feb 15, 2024

Uh oh!

sensboston Feb 16, 2024 Author

Uh oh!

sensboston Feb 22, 2024 Author

sensboston
Feb 15, 2024

Replies: 2 comments 1 reply

glangford
Feb 15, 2024

sensboston Feb 16, 2024
Author

sensboston
Feb 22, 2024
Author