Medium and Large models hallucination #678

DavraYoung · 2022-12-13T16:17:07Z

DavraYoung
Dec 13, 2022

Hi, I am trying to train Whisper Medium model on my custom dataset. (Using this guide)

I did training on small models and it achieves very good results with WER 10 on my custom dataset.
However using this dataset during medium model training seems to enter failure mode.

By hallucination I mean increasing WER with repeating words. For example my Medium model after training on 500 hours of audio gives me:
Truth: faqat basseyn butun qolgan
Prediction: faqat basseyni butun qolgani vaqtlari vaqtlari vaqtlari vaqtlari vaqtlari vaqtlari vaqtlar

word vaqtlar is not present in audio, nor in the original text, the model just lagged on it somehow.

I checked audio itself, and its fine, there are some background noises, but the voice is very recognizable, and in fact my finetuned small model recognizes it correctly.

The model behaves better with smaller generation_max_length setting, but cannot predict long audios.

The dataset itself contains small audios within 10 seconds range.

I tried:

reducing generation_max_length to the smallest minimum possible
reducing learning rate and increasing batch size
setting generation_num_beams to 3 or 5
removing all punctuation and other characters that are non word related

still no good.

How to solve such issue?

Answered by DavraYoung

Dec 17, 2022

For any one facing such issue in transformers.

Check your dataset, if you have too many incorrect clips (with silence or incorrect audio format, sample rate should be exactly 16000)
Check your clips length, it should be cut if you have more than 30 seconds, otherwise you may face issues with padding and hallucination
Use latest transformers version
Reduce learning rate and batch size (in my case big batch size and learning rate(1e-4 and 80batch size with accumulation 2) made generalization worse and sometimes even caused issues with hallucination, not sure why). In my case the model did very well with 1e-5/2 learning rate and batch size 32 on medium model.

In my case, the dataset was fi…

View full answer

KaiserChr · 2022-12-13T17:03:06Z

KaiserChr
Dec 13, 2022

I just posted about this a minute ago, I had the same problem.
Maybe try my (possible) fix described in #679 and let me know if it helps!

0 replies

DavraYoung · 2022-12-17T10:59:44Z

DavraYoung
Dec 17, 2022
Author

For any one facing such issue in transformers.

Check your dataset, if you have too many incorrect clips (with silence or incorrect audio format, sample rate should be exactly 16000)
Check your clips length, it should be cut if you have more than 30 seconds, otherwise you may face issues with padding and hallucination
Use latest transformers version
Reduce learning rate and batch size (in my case big batch size and learning rate(1e-4 and 80batch size with accumulation 2) made generalization worse and sometimes even caused issues with hallucination, not sure why). In my case the model did very well with 1e-5/2 learning rate and batch size 32 on medium model.

In my case, the dataset was fine, and I managed to overcome the problem by simply updating transformers version to 4.25 and finetuning parameters

Previously I used 4.24, which was causing this, after updating it went well. I suppose the issue was in tokenizer, but I am not sure why exactly

P.S. dont forget to clear your dataset cache and rerun feature_extraction and tokenization again. In my case using newest version with old cache did not work either

11 replies

a-nahar Dec 23, 2023

Thanks @DavraYoung , got your point. Will look at it.

a-nahar Jan 5, 2024

Hello @DavraYoung,

Additionally, I experimented with fine-tuning the large-v2 model and found it surprisingly free from hallucinations post-fine-tuning. However, when comparing the results of the fine-tuned model with the pretrained large-v2, it's not conclusively better. Roughly 50% of the testing samples, the fine-tuned model outperforms the pretrained large-v2, while the remaining 50% it underperforms. Through a macro analysis of the underperforming examples, I noticed a pattern – these instances tend to have shorter transcriptions( it could be because the <|endoftext|>token is getting predicted in the middle of the audio).

Any suggestions from your end?

Regarding the large-v2 model, I found the most optimal checkpoint at 4000 steps, utilizing a total of 250+ audio data points. Surprisingly, after this point, the evaluation BLEU (higher scores are better) began to decrease, could be bcoz of the overfitting.

may be hallucination is a tendency of the small and medium model (I may be wrong here).

DavraYoung Jan 5, 2024
Author

@a-nahar if the model predicts endoftext too early, most probably your dataset contains mostly short audios and the model is overfitted to predict no more than certain number of tokens.

I would suggest:

Try to mix your samples with longer samples (e.g. increase the number of long text audios), or mix it with long audios dataset
Try to experiment with logits processors. There is a way to change the probability during token generation, theoretically, you can decrease the probability of endoftext token so that it happens less often.

Stanwang1210 Jan 24, 2024

Hi @DavraYoung
I'm also working on a project to do low-resource fine-tuning whisper with lora.
I also face some hallucination problem recently.
I found that when I fine-tune whisper-medium to German (A seen languages for whisper) with one-hour of data, the ASR performance of German on test set degrades (CER : 5.4% -> 21.7%), which seems weird.
During the log file, I did find some hallucination issues (some of the samples). I'm wondering whether there is bug in my code.
The transformer version I used is 4.31.0, batch size 8 with accumaltion of 4, lr = 5e-5 with 1500 steps of warmup. It seems that maybe lower lr could be tried first?
Also, do you think the problem be the limited data?

luvwinnie Jul 23, 2024

I'm having same problem with small Japanese dataset, what parameters should we be careful? Anyone solved the hallucination?

toanhuynhnguyen · 2024-10-19T16:20:27Z

toanhuynhnguyen
Oct 19, 2024

Based on this guide https://huggingface.co/blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models.

The fine-tuned "small" model works normally.
But the fine-tuned "large-v3" model works poorly on non-English audio files such as Chinese audio files, it auto-translates Chinese to English though I specified transcribing in Chinese, not translating. Have you faced this issue, and can you give me advice? Thank you so much.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Medium and Large models hallucination #678

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 11 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Medium and Large models hallucination #678

Uh oh!

DavraYoung Dec 13, 2022

Replies: 3 comments · 11 replies

Uh oh!

KaiserChr Dec 13, 2022

Uh oh!

Uh oh!

DavraYoung Dec 17, 2022 Author

Uh oh!

a-nahar Dec 23, 2023

Uh oh!

a-nahar Jan 5, 2024

Uh oh!

DavraYoung Jan 5, 2024 Author

Uh oh!

Stanwang1210 Jan 24, 2024

Uh oh!

luvwinnie Jul 23, 2024

Uh oh!

toanhuynhnguyen Oct 19, 2024

DavraYoung
Dec 13, 2022

Replies: 3 comments 11 replies

KaiserChr
Dec 13, 2022

DavraYoung
Dec 17, 2022
Author

DavraYoung Jan 5, 2024
Author

toanhuynhnguyen
Oct 19, 2024