-
I am trying to use the Chinese model to transcribe a 10-minute audio file. I am using the medium model. When I run the same command three times in a row, each time the results are different from the previous time. And overall, the transcribe performance getting worse for each run. Especially the ASR of the first 1 minutes degrades by a lot. Have anyone encountered this as well? I tried to fix the random seed in python yet the transcribe results still changes. |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 11 replies
-
This happens when the model is unsure about the output (according to the You can try adding |
Beta Was this translation helpful? Give feedback.
-
That's an interesting observation. |
Beta Was this translation helpful? Give feedback.
-
I have also noticed this issue for English transcription. Two observations: 1: if I load the model once in the script and call transcribe twice on the same audio input, I got two different transcripts; 2: If I load the model once in the script and rerun the script on the same audio, it also gave different outputs. In both 1 and 2, the accuracy of the outputs tend to drop the more time I run the scripts or calling transcribe in the same script. |
Beta Was this translation helpful? Give feedback.
-
Hi |
Beta Was this translation helpful? Give feedback.
-
I have got the same , Can anyone have a good solution? |
Beta Was this translation helpful? Give feedback.
-
I also have the same phenomena, I think it relates to how "clean" the audio is. On some I get the same result and on others I get different results (running the same model on the same audio). Oh, and I use audios that are way longer than 30s, and it transcribes them fine without any "add-ons". I use the "small.en" model. I guess I will try also the large-V2 to see if it becomes more "deterministic". BTW, when it gives different transcriptions to the same audio, some of them are really good but most are bad. :( |
Beta Was this translation helpful? Give feedback.
-
I was becoming quite frustrated by slightly different transcription results each time I ran the same code with the same model and same input. It became deterministic when I set the following parameter: I'm assuming this feature is intended to allow for improved results on subsequent executions. Perhaps it should default to false. |
Beta Was this translation helpful? Give feedback.
This happens when the model is unsure about the output (according to the
compression_ratio_threshold
andlogprob_threshold
settings). The most common failure mode is that it falls into a repeat loop, where it likely triggers thecompression_ratio_threshold
. The default setting tries temperatures 0, 0.2, 0.4, 0.6, 0.8, 1.0 until it gives up, at which it is less likely to be in a repeat loop but is also less likely to be correct.You can try adding
--temperature_increment_on_fallback None
to prevent this behavior. In general, Whisper's performance on Chinese is not very good and would probably need fine-tuning or training from scratch to be usable.