Difference between transcribing whole audio and transcribing the same audio split into segments #322

AntonioBuccola · 2022-10-14T13:51:38Z

AntonioBuccola
Oct 14, 2022

Hello, I am using whisper to transcribe audios in Italian language and I noticed a huge difference in these two cases.

In the first one, the transcription is obtained passing the whole audio to the model.transcribe method and it works fine; on the other hand, when cropping the audio into segments and then transcribe each of them separately, the results are awful.

I need to perform cropping since I am facing a speaker diarization problem: though whisper already voice activity detection, the results of speaker identification using the segments provided by whisper are not satisfactory (diarization error rate of about 20%, against an already obtained of 8%).

For completeness, I am using the spectralclustermodule to perform speaker identification after audio embedding with pyannote.

Is there a reason why this happens?
Thanks in advance.

ArtyomZemlyak · 2022-10-14T15:51:13Z

ArtyomZemlyak
Oct 14, 2022

Hi, this happens, because whisper used context information from previos recognized audio chunks. And its produced directly in model.trunscribe. Also, notice that whisper itself devide audio to 30 secs chunks.

And, if you want pass context information to whisper, you can use arg for cli:
https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L270

parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.")

Or directly in python (This DecodingOptions class can be passed in model.transcribe func):
https://github.com/openai/whisper/blob/main/whisper/decoding.py#L87-L88

prompt: Optional[Union[str, List[int]]] = None   # text or tokens for the previous context
prefix: Optional[Union[str, List[int]]] = None   # text or tokens to prefix the current context

Also tou can check this discussion #117

3 replies

a-ruban Oct 14, 2022

Dealing with the same problem also.
BTW, @AntonioBuccola how you managed to run it on audio segments?

for segment in pyannote_segments:
  segment.append(model.transcribe(numpy.array(audio[segment[0]:segment[1]].get_array_of_samples()))['segments'])

For me the code above just overflow VRAM on Collab, but running it on the whole audio file works just fine

AntonioBuccola Oct 17, 2022
Author

@a-ruban I didn't have the overflow VRAM problem on Colab since I am using Pro, so I have (just) enough resources to load 'large' whisper and transcribe each segment.

As a suggestion, you could try to load the audio one with librosa, for example,

audio_array, sr = librosa.load("audio_file.wav", sr = 16_000)

and slice the audio as a normal numpy.array.

audio_segment = audio_array[int(start*sr) : int(end*sr)]

where start and end are the timestamps of your segments. Currently I am using it in a for loop and the VRAM does not overflow. I hope this answers your question and helps you.

AntonioBuccola Oct 17, 2022
Author

@ArtyomZemlyak thanks for the reply, but I think I did not understand: whenever I transcribe a new segment, which prompt and prefix should I use? I tried to give the as prompt and prefix both the transcription of the previous segment, using this code but it does not work.
df is the pandas.DataFrame with labelled timestamps. What am I doing wrong? Thanks in advance

texts = []

bw_ext_audio, sr = librosa.load(bw_file, sr = 16_000)

current_prompt = None
current_prefix = None

for i, r in df.iterrows():
    
    bw_ext_segment = bw_ext_audio[int(r["start"]*sr) : int(r["end"]*sr)]
    #results = model.transcribe(bw_ext_segment, language = "it")
    results = model.transcribe(bw_ext_segment, 
                               language = "it", 
                               prompt = current_prompt,
                               prefix = current_prefix,
                               condition_on_previous_text = True)
    
    text = results["text"]
    
    current_prompt = text
    current_prefix = text

    texts.append(text)

df["text"] = texts

a-ruban · 2022-10-14T17:49:54Z

a-ruban
Oct 14, 2022

@AntonioBuccola BTW, the possible solution could be to put silence block between diarized segments, more info in https://github.com/Majdoddin/nlp

1 reply

AntonioBuccola Oct 17, 2022
Author

@a-ruban I am currently giving this method a chance. It seems promising, although the transcription seems slightly worse with respect to the global audio transcription.

Majdoddin · 2023-07-21T05:33:30Z

Majdoddin
Jul 21, 2023

Hi @AntonioBuccola,
www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages.
Feel free to contact me if you need more features.
Announcement: #1537
Repo: https://github.com/Majdoddin/lexicaps

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Difference between transcribing whole audio and transcribing the same audio split into segments #322

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Difference between transcribing whole audio and transcribing the same audio split into segments #322

Uh oh!

AntonioBuccola Oct 14, 2022

Replies: 3 comments · 4 replies

Uh oh!

ArtyomZemlyak Oct 14, 2022

Uh oh!

a-ruban Oct 14, 2022

Uh oh!

AntonioBuccola Oct 17, 2022 Author

Uh oh!

AntonioBuccola Oct 17, 2022 Author

Uh oh!

a-ruban Oct 14, 2022

Uh oh!

AntonioBuccola Oct 17, 2022 Author

Uh oh!

Majdoddin Jul 21, 2023

AntonioBuccola
Oct 14, 2022

Replies: 3 comments 4 replies

ArtyomZemlyak
Oct 14, 2022

AntonioBuccola Oct 17, 2022
Author

AntonioBuccola Oct 17, 2022
Author

a-ruban
Oct 14, 2022

AntonioBuccola Oct 17, 2022
Author

Majdoddin
Jul 21, 2023