a Hallucination issue with large-v3 model #2280

yxzzhang · 2024-07-28T08:35:27Z

yxzzhang
Jul 28, 2024

Hi guys, I'm having a strange hallucination problem while using Whisper-large-v3 for English speech recognition. Some additional names of people appear at the beginning of the transcript results, which are not present at the beginning of the audio. But the name is mentioned later in the audio.
The audio causing the issue is:
audio.zip
Its label is: ”And thank you very much good evening everybody and a warm welcome to our next presentation. My name is Katharina Morlang and together with my colleagues hiker hoods and patrick young, please give me your hands.“
When I invoke the model as follows, the output starts with "Katharina Moorlach". If you look at timestamp, the hallucination person's name appears between 0 and 0.04 seconds. But listen carefully, the audio doesn't begin with that word. And when look at the whole transcript results, the hallucination name is mentioned later in the audio.

I also tried to follow the tutorial in the Huggingface, from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline to invoke the model. But it still has hallucination words "Katharina Morlan and Hülse Tuchel-" at the beginning of the transcript.

When I added "suppress_tokens": "", the output changed but still contained the hallucination "Katharina Morlan:".

I also tried to use the large-v3 model to do speech recognition for other English audio, but only found problems with this audio till now. I also tried large-v2 model, and large-v2 model didn't have this issue on this audio.

I have carefully checked the Github Discussions for other discussions that mention Whister-large-v3 hallucination issues. But it seems that the hallucination problems often occur in the non-speech parts of the audio. In my test audio, the beginning of the audio sound is clear, you can clearly hear the word "and", and there is no non-speech part which only has background sound.
There are several interesting phenomena in this issue, which I would like to consult:

Why does the large-v3 model recognize the person's name at the beginning of the audio? It's worth noting that the name appears later in the audio. Is this problem an hallucination of the large-v3 model, or is there a problem with my parameter Settings?
Why does the hallucination of the model change after setting "suppress_tokens": ""?

The issue seems to be a very specific and rare hallucination problem. Can anyone share some thoughts on this issue? And how to solve this problem? Thank you so much for your help!!!

danghieuan · 2024-08-05T12:53:17Z

danghieuan
Aug 5, 2024

I'm facing the same issue with my custom dataset. I tried fine-tuning for more epochs with my custom dataset, but the 'ghost transcript' still hasn't improved.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

a Hallucination issue with large-v3 model #2280

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

a Hallucination issue with large-v3 model #2280

Uh oh!

Uh oh!

yxzzhang Jul 28, 2024

Replies: 1 comment

Uh oh!

danghieuan Aug 5, 2024

yxzzhang
Jul 28, 2024

danghieuan
Aug 5, 2024