why does word_timestamps=True change the transcription output? #2535
Unanswered
whispy-woods
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, this is more of a theoretical question - if I run whisper with the same parameters and the audio does not lead to temperature fallbacks introducing randomness, running whisper several times on the same long audio files will usually be quite deterministic for me. At most the differences will be incredibly tiny, probably due to float16 rounding errors (?).
However running it with word_timestamps on and off on the same audio gives quite different text results, even when disabling "condition_on_previous_text" so that small changes don't add up to bigger changes over time via the "previous prompt".
I am curious as to why. As far as I can tell, the word_timestamps option is not even passed to the model.decode() function and also does not appear in there. I also can not find any code that alters the slicing of the audio chunks / seek based on word_timestamps, as long as you keep hallucination_silence_threshold disabled.
Does retrieving the attention weights in find_alignment() alter some internal state of the model that causes it do give different results? However, at that point, the output segments have already been formed. I also see the text output differences if I manually slice a long audio file into 30 second files with ffmpeg and feed them with single python calls to load Whisper again in a fresh python instance for a single transcription with either "word_timestamps" on or off.
Am I overlooking something? I just want to understand how the outputs can be different. If I had to make a guess about the nature of the transcription differences, I would say transcripts with "word_timestamps" = True "overlook" more content overall, but it is hard to tell.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions