Using whisper for matching an exact text vs. recognition #390
Replies: 2 comments 2 replies
-
If I understand correctly, you could simply check the resulting transcription in Python rather than trying to make Whisper do it. There are also plenty of diff options to identify exactly what is different between the expected text and the result. Something like:
Alternatively, assuming you are working on something related to language education, there are a few models (eg. GPT-3) that can repair broken English if instructed to. |
Beta Was this translation helpful? Give feedback.
-
I appreciate the need for this is now probably in the long-distant past, but I had a similar goal, and found this: https://github.com/linto-ai/whisper-timestamped where it seems like the confidence scores could be used to grant a bit more human wiggle-room than just direct string-comparison. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hopefully this makes sense? Looking for some ideas/approach for how one would use whisper to do matching against an expected piece of text vs. just getting the recognition of what is said. In some sense, for this use case, it doesn't matter what they said (recognition) but rather just that it matches a phrase/sentence. (my scenario would be real-time)
Is this something that can be fine tuned (train the model 1 extra time with new voice on exact text repeated)?
I know when using the
initial_prompt
it sometimes thinks the phrase is said when there is only silence, so a lowerno_speech_threshold
should fix it, but is there anything else to try?So it could just be a simple pass/fail if the phrase is wrong. For the phrase, "The Grass Is Always Greener" and they said "The Bass is Always Greener" or "The Grass is Always Redder" it would fail.
Maybe I just need the confidence level of each word like in #284, and just check each one is high enough? But then I wonder if the
initial_prompt
would bias the logits (if it even does that) too much even if someone said the wrong thing?Beta Was this translation helpful? Give feedback.
All reactions