Using whisper for matching an exact text vs. recognition #390

enigmatichiccup · 2022-10-21T20:41:57Z

enigmatichiccup
Oct 21, 2022

Hopefully this makes sense? Looking for some ideas/approach for how one would use whisper to do matching against an expected piece of text vs. just getting the recognition of what is said. In some sense, for this use case, it doesn't matter what they said (recognition) but rather just that it matches a phrase/sentence. (my scenario would be real-time)

Is this something that can be fine tuned (train the model 1 extra time with new voice on exact text repeated)?

I know when using the initial_prompt it sometimes thinks the phrase is said when there is only silence, so a lower no_speech_threshold should fix it, but is there anything else to try?

result = model.transcribe("audio.mp3", initial_prompt='exact phrase here')
options = whisper.DecodingOptions(fp16=False, prompt="exact phrase here")

So it could just be a simple pass/fail if the phrase is wrong. For the phrase, "The Grass Is Always Greener" and they said "The Bass is Always Greener" or "The Grass is Always Redder" it would fail.

Maybe I just need the confidence level of each word like in #284, and just check each one is high enough? But then I wonder if the initial_prompt would bias the logits (if it even does that) too much even if someone said the wrong thing?

richardburleigh · 2022-10-24T06:35:12Z

richardburleigh
Oct 24, 2022

If I understand correctly, you could simply check the resulting transcription in Python rather than trying to make Whisper do it. There are also plenty of diff options to identify exactly what is different between the expected text and the result.

Something like:

phrase = 'exact phrase here'
result = model.transcribe("audio.mp3")['text']
if result == phrase:
   print("Success")
else:
  print("Fail")

Alternatively, assuming you are working on something related to language education, there are a few models (eg. GPT-3) that can repair broken English if instructed to.

2 replies

enigmatichiccup Oct 24, 2022
Author

Oh yeah that's definitely the idea of doing result == phrase, I guess my question is more around if there's a way to modify whisper (I'm happy to modify the code even if I'm not that familiar with the codebase/python, so that that kind of check/diff can be made more robust? I think if there's background noise, accent, non-native speaker, etc it would have issues. However because it's only checking for the same phrase..

I think if it's just a single speaker, one could train them even on just saying phrase once and maybe doing a fine tuning of the model based on that? I'm not familiar with training/fine tuning but I assume that would help? Almost feel like over-fitting would be "good" in this particular case, since I just want to know they are saying the exact thing again, and not anything else. This is why I also suggested confidence levels of each word since using == feels like it would result in a lot of false negatives.

richardburleigh Oct 25, 2022

It doesn't directly answer your question, but it would be interesting to fine-tune Whisper on some non-native speaker datasets. There's some fine-tuning code here.

You could also feed Whisper results into an English grammar checker (just a random repo I found, there's probably something better).

benferns · 2023-12-11T06:06:52Z

benferns
Dec 11, 2023

I appreciate the need for this is now probably in the long-distant past, but I had a similar goal, and found this: https://github.com/linto-ai/whisper-timestamped where it seems like the confidence scores could be used to grant a bit more human wiggle-room than just direct string-comparison.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using whisper for matching an exact text vs. recognition #390

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using whisper for matching an exact text vs. recognition #390

Uh oh!

Uh oh!

enigmatichiccup Oct 21, 2022

Replies: 2 comments · 2 replies

Uh oh!

richardburleigh Oct 24, 2022

Uh oh!

enigmatichiccup Oct 24, 2022 Author

Uh oh!

richardburleigh Oct 25, 2022

Uh oh!

benferns Dec 11, 2023

enigmatichiccup
Oct 21, 2022

Replies: 2 comments 2 replies

richardburleigh
Oct 24, 2022

enigmatichiccup Oct 24, 2022
Author

benferns
Dec 11, 2023