How to deal with wrong homonym vocabulary in ASR results #1595

LeeHaha314 · 2023-08-10T03:46:04Z

LeeHaha314
Aug 10, 2023

Hi all! Now I am using whisper for Chinese ASR, and I find there are many wrong homonym vocabularies in whisper results.
At first, I tried the official solution that using initial_prompt to deal with this issue. I collected special nouns and collocations in my dataset as a vocab list, then join them to a str to pass them with the initial_prompt parameter, but it didn't work as expected. Every time I changed the prompt with a update in my vocab list, the result of the same audio changed a lot. Meanwhile, it turns out that the transcribe results with such initial prompt are more likely to contain hallucinations text and fall into a repetition loop... I struggled with it for a while and still no idea to fix it.
Then I plan to do some postprocess. In Chinese it's unreasonable to apply some methods like levenshtein distance since similar pronouciation could result in totally different character. As for phonetic matching, it might be a solution, but how to combine it with local personal vocabulary also takes time to figure out.
Now I found some phonetic algorithm for indexing Chinese characters by sound, then I checked the ASR results of whisper with word_timestamp to analyze word segments because I thought it might be a base input for further process. However, most of the results of word segments for Chinese are just character segments.(e.g. 公司 would be recognized into two different segments 公 and 司), which means that I can't use the result of word segments directly to match words in my personal vocab list. I still need to apply an additional match strategy for correction.
Is there any good idea or practice for something like that?

UPDATE: Now I use another tokenizer with better performance on Chinese text and a tool computing phonetic distance of two words based on pinyin to match source text with my vocab list. I code the postprocess logic. To some extent, it does work with some limits. And I also need to contain a whitelist to skip some similar words in my vocab list manually.
Looking forward to other inspired ideas lol

dgoryeo · 2023-09-11T11:04:02Z

dgoryeo
Sep 11, 2023

@LeeHaha314 , have you considered WhisperBiasing as a possibile solution?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to deal with wrong homonym vocabulary in ASR results #1595

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to deal with wrong homonym vocabulary in ASR results #1595

Uh oh!

Uh oh!

LeeHaha314 Aug 10, 2023

Replies: 1 comment

Uh oh!

dgoryeo Sep 11, 2023

LeeHaha314
Aug 10, 2023

dgoryeo
Sep 11, 2023