why does word_timestamps=True change the transcription output? #2535

whispy-woods · 2025-02-21T11:05:04Z

whispy-woods
Feb 21, 2025

Hi, this is more of a theoretical question - if I run whisper with the same parameters and the audio does not lead to temperature fallbacks introducing randomness, running whisper several times on the same long audio files will usually be quite deterministic for me. At most the differences will be incredibly tiny, probably due to float16 rounding errors (?).

However running it with word_timestamps on and off on the same audio gives quite different text results, even when disabling "condition_on_previous_text" so that small changes don't add up to bigger changes over time via the "previous prompt".

I am curious as to why. As far as I can tell, the word_timestamps option is not even passed to the model.decode() function and also does not appear in there. I also can not find any code that alters the slicing of the audio chunks / seek based on word_timestamps, as long as you keep hallucination_silence_threshold disabled.

Does retrieving the attention weights in find_alignment() alter some internal state of the model that causes it do give different results? However, at that point, the output segments have already been formed. I also see the text output differences if I manually slice a long audio file into 30 second files with ffmpeg and feed them with single python calls to load Whisper again in a fresh python instance for a single transcription with either "word_timestamps" on or off.

Am I overlooking something? I just want to understand how the outputs can be different. If I had to make a guess about the nature of the transcription differences, I would say transcripts with "word_timestamps" = True "overlook" more content overall, but it is hard to tell.

Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

why does word_timestamps=True change the transcription output? #2535

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

why does word_timestamps=True change the transcription output? #2535

Uh oh!

Uh oh!

whispy-woods Feb 21, 2025

Replies: 0 comments

whispy-woods
Feb 21, 2025