Several issues introduced in version 20230306 on audio with silences (repeated text, segment "id" not unique/increasing) #1058

Jeronymous · 2023-03-08T11:37:02Z

Jeronymous
Mar 8, 2023

This issue looks more general than #1046 and it might be related to #730 (however this was after v20230306).

Take this audio: bonjour_vous_allez_bien.mp3
where "Bonjour, est-ce que vous allez bien?" (French) is said twice, with a long delay in between.

For this command:

whisper --model tiny --language fr \
   --beam_size None --temperature_increment_on_fallback None --best_of None \
    bonjour_vous_allez_bien.mp3

below are the differences between outputs of previous version 20230124 (left), new version 20230306 (middle) and 20230306 with --word_timestamps True (right). We can see two issues:

20230306 is producing a lot of hallucinated repetitions and this is much worse with --word_timestamps True
the "id" of each segment is not necessarily unique in 20230306. It can be several times the same in a row (it gets stuck at "0" in the example below)

This became obvious when I used greedy decoding, but a similar thing can be observed with less options (with beam search).

whisper --model tiny --language fr bonjour_vous_allez_bien.mp3

On this command, the differences between outputs of previous version 20230124 (left), new version 20230306 (middle) and 20230306 with --word_timestamps True (right) are:

I also noticed that the following command is particularly long to run:

whisper --model tiny --language fr --word_timestamps True /home/jlouradour/src/whisper-timestamped/tests/data/bonjour_vous_allez_bien.mp3

It takes 1 minute with 4 CPU, whereas it takes 6 sec without --word_timestamps True, or 16 sec if adding options --beam_size None --temperature_increment_on_fallback None --best_of None

Answered by jongwook

Mar 8, 2023

This was an issue where the new transcribe() was mishandling the all_tokens variable which affected the prompts to be more prone to repetitions and also caused the discrepancy between the top-level "text" field and the segment-level "text" fields in the JSON response.

I have a fix merged in #1060, so hopefully this resolves your issue! Please let me know if it continues.

View full answer

Jeronymous · 2023-03-08T12:02:57Z

Jeronymous
Mar 8, 2023
Author

I've just tested 20230307 and the two problems are still here:

not unique "id" (not necessarily increasing)
the "text" field of the transcription include hallucinated repetition, bu actually that those are not present in the segments (so if we join all the "text" of the segments, we get a result that is different from the main "text")

4 replies

glangford Mar 8, 2023

This looks like the repetition is unique to the .json output, the others (.txt, .srt, etc) look correct. Do you agree?

glangford Mar 8, 2023

I can reproduce this problem only with the tiny model, and only for the .json output. Small and medium and other outputs seem to be ok. word_timestamps does not have to be True.
whisper --model tiny --language fr bonjour_vous_allez_bien.mp3

If I don't supply --language, the tiny model detects the audio as Russian.

SalimovAlbert Mar 8, 2023

Had the same issue with JSON, used "".join(map(lambda x: x['text'], result['segments'])) to get text output

Jeronymous Mar 8, 2023
Author

This looks like the repetition is unique to the .json output, the others (.txt, .srt, etc) look correct. Do you agree?

Yes, correct.
It only affects json output (or the output of the transcribe function when using python and not the CLI)

jongwook · 2023-03-08T23:06:34Z

jongwook
Mar 8, 2023
Maintainer

Hi! Thanks for reporting this. The non-unique "id" is my total oversight and should be an easy fix; wasn't paying attention to it much.

About the discrepancy between ["text"] and segment["text"], I think it has something to do with these lines:

whisper/whisper/transcribe.py

Lines 345 to 356 in aac47c9

    
           # if a segment is instantaneous or does not contain text, clear it 
        
           for i, segment in enumerate(current_segments): 
        
               if segment["start"] == segment["end"] or segment["text"].strip() == "": 
        
                   segment["text"] = "" 
        
                   segment["tokens"] = [] 
        
                   segment["words"] = [] 
        
                   current_tokens[i] = [] 
        
           all_segments.extend(current_segments) 
        
           all_tokens.extend( 
        
               [token for segment in current_tokens for token in segment] 
        
           )

EDIT: the culprit was actually this line:

whisper/whisper/transcribe.py

Line 290 in aac47c9

all_tokens.extend(tokens[: last_slice + 1].tolist())

will push a fix soon.

0 replies

jongwook · 2023-03-08T23:36:23Z

jongwook
Mar 8, 2023
Maintainer

This was an issue where the new transcribe() was mishandling the all_tokens variable which affected the prompts to be more prone to repetitions and also caused the discrepancy between the top-level "text" field and the segment-level "text" fields in the JSON response.

I have a fix merged in #1060, so hopefully this resolves your issue! Please let me know if it continues.

1 reply

Jeronymous Mar 9, 2023
Author

Great. Thank you @jongwook

timothyaveni · 2023-04-20T02:12:49Z

timothyaveni
Apr 20, 2023

I think the condition_on_previous_text flag was recently broken, and my hunch is that the cause is some changes in #1060; in particular, I think all_tokens is off by one window at the point where the prompt is reset, causing the prior window's tokens to be added to the prompt anyway. @jongwook would you mind having a look? Thanks!

2 replies

guillaumekln Apr 20, 2023

A PR was opened to fix this issue: #1224.

timothyaveni Apr 20, 2023

oop, didn't think to look there. thanks for the pointer!

Several issues introduced in version 20230306 on audio with silences (repeated text, segment "id" not unique/increasing) #1058

Uh oh!

Jeronymous Mar 8, 2023

Replies: 4 comments · 7 replies

Uh oh!

Uh oh!

Jeronymous Mar 8, 2023 Author

Uh oh!

glangford Mar 8, 2023

Uh oh!

Uh oh!

glangford Mar 8, 2023

Uh oh!

SalimovAlbert Mar 8, 2023

Uh oh!

Jeronymous Mar 8, 2023 Author

Uh oh!

Uh oh!

jongwook Mar 8, 2023 Maintainer

Uh oh!

jongwook Mar 8, 2023 Maintainer

Uh oh!

Jeronymous Mar 9, 2023 Author

Uh oh!

timothyaveni Apr 20, 2023

Uh oh!

guillaumekln Apr 20, 2023

Uh oh!

timothyaveni Apr 20, 2023

Jeronymous
Mar 8, 2023

Replies: 4 comments 7 replies

Jeronymous
Mar 8, 2023
Author

Jeronymous Mar 8, 2023
Author

jongwook
Mar 8, 2023
Maintainer

jongwook
Mar 8, 2023
Maintainer

Jeronymous Mar 9, 2023
Author

timothyaveni
Apr 20, 2023