Different transcribe results using same model to transcribe the same audio #81

jltchiu · 2022-09-23T19:19:42Z

jltchiu
Sep 23, 2022

I am trying to use the Chinese model to transcribe a 10-minute audio file. I am using the medium model. When I run the same command three times in a row, each time the results are different from the previous time. And overall, the transcribe performance getting worse for each run. Especially the ASR of the first 1 minutes degrades by a lot. Have anyone encountered this as well? I tried to fix the random seed in python yet the transcribe results still changes.

Answered by jongwook

Sep 23, 2022

This happens when the model is unsure about the output (according to the compression_ratio_threshold and logprob_threshold settings). The most common failure mode is that it falls into a repeat loop, where it likely triggers the compression_ratio_threshold. The default setting tries temperatures 0, 0.2, 0.4, 0.6, 0.8, 1.0 until it gives up, at which it is less likely to be in a repeat loop but is also less likely to be correct.

You can try adding --temperature_increment_on_fallback None to prevent this behavior. In general, Whisper's performance on Chinese is not very good and would probably need fine-tuning or training from scratch to be usable.

View full answer

jongwook · 2022-09-23T20:14:26Z

jongwook
Sep 23, 2022
Maintainer

This happens when the model is unsure about the output (according to the compression_ratio_threshold and logprob_threshold settings). The most common failure mode is that it falls into a repeat loop, where it likely triggers the compression_ratio_threshold. The default setting tries temperatures 0, 0.2, 0.4, 0.6, 0.8, 1.0 until it gives up, at which it is less likely to be in a repeat loop but is also less likely to be correct.

You can try adding --temperature_increment_on_fallback None to prevent this behavior. In general, Whisper's performance on Chinese is not very good and would probably need fine-tuning or training from scratch to be usable.

4 replies

jltchiu Sep 23, 2022
Author

Actually, I think Chinese performance is decent, the character error rate is not so bad. (In Chinese, people care about character error rate not word error rate) It's first run usually has very good performance, but then the second / third run from the same model to the same audio file has decreased result, which I am curious about the reason (given that, each run should be independent I assume). I thought the previous run should not affect the next run, unless there's some form of cache or data recorded? I will try your solution.

jltchiu Sep 23, 2022
Author

Also is there a way to do temperature_increment_on_fallback in python code? I am using result = model.transcribe("audio.mp3") in python and not CLI tool but didn't see place to modify that.

jongwook Sep 24, 2022
Maintainer

This is the full signature of the method:

def transcribe(
    model: "Whisper",
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: bool = False,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    **decode_options,
):

You can specify temperature=0 to force it to use zero temperature (i.e. argmax sampling) always.

jltchiu Sep 24, 2022
Author

Hi, I used the following code to try your suggestions

import whisper
model = whisper.load_model("medium")
result = model.transcribe("CTC10min.16kmono.wav", language='Chinese' , temperature=0)

for s in result["segments"]:
    start = s['start']
    end = s['end']
    text = s['text']
    print(str(start) + "\t" + str(end) + "\t" + text)
result = model.transcribe("CTC10min.16kmono.wav", language='Chinese', temperature=0)

for s in result["segments"]:
    start = s['start']
    end = s['end']
    text = s['text']
    print(str(start) + "\t" + str(end) + "\t" + text)

The ASR is all identical under this settings, I think it breaks the system completely.

Before when I have the issue, my code was like this

import whisper
model = whisper.load_model("medium")
result = model.transcribe("CTC10min.16kmono.wav", language='Chinese')

for s in result["segments"]:
    start = s['start']
    end = s['end']
    text = s['text']
    print(str(start) + "\t" + str(end) + "\t" + text)
result = model.transcribe("CTC10min.16kmono.wav", language='Chinese')

for s in result["segments"]:
    start = s['start']
    end = s['end']
    text = s['text']
    print(str(start) + "\t" + str(end) + "\t" + text)

I was expecting to get two identical set of rows that showed same ASR results, but the end results are different. I think this is not a model loading problem (give I only loaded once), is there any chance that the result of the first model.transcribe() might affect the result of the second model.transcribe()? That can also be observed if my code only have the first half and I run the python code in my command line twice.

kurianbenoy-sentient · 2022-11-29T13:09:10Z

kurianbenoy-sentient
Nov 29, 2022

That's an interesting observation.

0 replies

CangxiongChen · 2022-12-21T23:09:31Z

CangxiongChen
Dec 21, 2022

I have also noticed this issue for English transcription. Two observations: 1: if I load the model once in the script and call transcribe twice on the same audio input, I got two different transcripts; 2: If I load the model once in the script and rerun the script on the same audio, it also gave different outputs. In both 1 and 2, the accuracy of the outputs tend to drop the more time I run the scripts or calling transcribe in the same script.

2 replies

CangxiongChen Dec 21, 2022

I tried to set temperature to 0 when calling transcribe, which did seem to make the output deterministic but produced a large section of hallucinated text.

fleek Dec 22, 2022

Sorry I run the same model 10 times on the same wav file I will get different results, maybe that's the cuteness of the AI hahaha. Remember if I ask you the same question 10 times will you give me the same answer every time? That's ML, DL, otherwhise it is just algo. 20 years ago before AI was such a popular term, there was something called Fuzzy logic. By the way I studied AI NN 30 years ago. This is not new technology but because of processor advancement it is so easily done now, previously to run a simple NN would take like forever.

Atefeh197 · 2023-02-17T21:36:39Z

Atefeh197
Feb 17, 2023

Hi
I am also using the whisper model but for the English language.
The results in each run are not the same. The next is always worse than the previous. I set all thresholds but it is unstable yet.
Why results in the different runs are related to each other?
How I can stop this?

2 replies

Jain-Archit Apr 14, 2023

Have you found a solution for the above problem? It seems that on repeated inference, the model gets worse with hallucination and the results even for whisper english large-v2 models.. I have tried multiple deployment techniques using paddlespeech, hugging face's transformers as well as the whisper package but haven't still found an answer. Any help is appreciated.

Atefeh197 Apr 14, 2023

Using large-v2 model helped me a lot. In my audio, silences with background noise generated hallucinations.
I had long silences in my audio with background noise. I remove them with voice activity detection (VAD). After that, my result improved.

lixikun · 2023-03-13T02:45:09Z

lixikun
Mar 13, 2023

I have got the same , Can anyone have a good solution?

3 replies

Atefeh197 Mar 13, 2023

Which version do you work?
I switched to large-v2 model. It is more stable.

lixikun Mar 13, 2023

Thanks for your reply.
I used the medium model
OK, I will try the large-v2 model.

vgcen May 1, 2023

Hey there, any luck on this? We've been running some tests with Large-v2 models too (using french language) and even though we used the same machine, same settings and same input file, we seem to get different results each time). Anything we should look for to make it more robust?

Thanks

taledv · 2023-09-26T14:39:12Z

taledv
Sep 26, 2023

I also have the same phenomena, I think it relates to how "clean" the audio is. On some I get the same result and on others I get different results (running the same model on the same audio).

Oh, and I use audios that are way longer than 30s, and it transcribes them fine without any "add-ons".
Just "whisper.transcribe(audio)", so I don't understand why the need for some add-ons to handle with 30s.
(probably inside it divides to 30s chunks, but me, the simple user, does not really care).

I use the "small.en" model. I guess I will try also the large-V2 to see if it becomes more "deterministic".

BTW, when it gives different transcriptions to the same audio, some of them are really good but most are bad. :(
Any luck with playing with hyperparameters? I saw the maybe temperature=0 solves this? Can anyone explain shortly what is this parameter?

0 replies

AutomationAdam · 2024-03-05T10:48:17Z

AutomationAdam
Mar 5, 2024

I was becoming quite frustrated by slightly different transcription results each time I ran the same code with the same model and same input.

It became deterministic when I set the following parameter:
condition_on_previous_text=False

I'm assuming this feature is intended to allow for improved results on subsequent executions. Perhaps it should default to false.

0 replies

Different transcribe results using same model to transcribe the same audio #81

Uh oh!

Replies: 7 comments · 11 replies

Uh oh!

jongwook Sep 23, 2022 Maintainer

Uh oh!

jltchiu Sep 23, 2022 Author

Uh oh!

jltchiu Sep 23, 2022 Author

Uh oh!

jongwook Sep 24, 2022 Maintainer

Uh oh!

jltchiu Sep 24, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 11 replies

jongwook
Sep 23, 2022
Maintainer

jltchiu Sep 23, 2022
Author

jltchiu Sep 23, 2022
Author

jongwook Sep 24, 2022
Maintainer

jltchiu Sep 24, 2022
Author