Missing the first 21 seconds in small.en and large-v2 #1937

MrEdwards007 · 2024-01-04T00:22:33Z

MrEdwards007
Jan 4, 2024

I have a Libravox mp3 recording of Don Quixote and I've transcribed the file using all models, from tiny_en to large-v3. What I've encountered is that in small_en and large-v2, the first 21 seconds speech are not transcribed. It will simply start transcribing at the 21st second. All other models transcribe the file from the beginning. The only thing that jumps out at me is the coincidence that the word at the beginning at at the 21st second are the same word "dedication".

My concern is that there are other files where the beginning is being skipped and I don't know the trigger, or a workaround.
below is the SRT output from base.en and large-v2

This is transcribed using base.en
This is the beginning

1
00:00:00,000 --> 00:00:09,640
Dedication, Preface, Dermatic Personae, and Act I of Don Quixote in England by Henry Fielding.
2
00:00:10,440 --> 00:00:11,660
This is a LibraVox recording.
3
00:00:12,640 --> 00:00:14,800
All LibraVox recordings are in the public domain.
4
00:00:15,800 --> 00:00:20,720
For more information or to volunteer, please visit LibraVox.org.
5
00:00:21,480 --> 00:00:27,720
Dedication to the right honorable Philip, Earl of Chesterfield, Knight of the Most Noble

This is transcribed using Large-V2
This is not the beginning of the file but actually starting at 00:00:21,000

1
00:00:00,000 --> 00:00:22,420
DEDICATION
2
00:00:22,740 --> 00:00:28,980
To the Right Honorable Philip, Earl of Chesterfield, Knight of the Most Noble Order of the Garter,
3
00:00:29,660 --> 00:00:30,240
My Lord.
4
00:00:31,000 --> 00:00:37,260
However unworthy these scenes may be of your Lordship's protection, the design with which

misutoneko · 2024-01-04T18:21:03Z

misutoneko
Jan 4, 2024

Hi! I did some testing & can reproduce this problem with whisper.cpp and whisper-timestamped as well (tested with small.en).
Adding --initial_prompt "" to the command line seems to partially fix it (but I don't know why).
My usual workaround is to use VAD preprocessing, but yeah it would be nice to not need that.

2 replies

MrEdwards007 Jan 5, 2024
Author

I gave it a try but there was no change on small_en. I did not try this on Large-V2.

After reading your very informative response, I started looking at whether utilizing a more specific prompt would resolve the issue but I was unsuccessful. I looked into https://cookbook.openai.com/examples/whisper_prompting_guide. It felt odd but I tried different prompts, as you would with a LLM, such as "You are excellent at transcribing audio, including preambles" or "You transcribe LibraVox recording in their entirety", "Dedication, Preface, Dramatic Personae" and even "The beginning of this recording is 'Dedication, Preface, Dramatic Personae' so begin with this". None of it had an effect but it was worth the effort and I'm glad that you provided the information and I gave it a try.

Thank you.

misutoneko Jan 5, 2024

Hmm I don't think prompting works in Whisper in the same way as it does with LLMs (AFAIK it can't follow any instructions you might want to give), but it definitely has an effect on the prediction. The effect isn't always what you want though, sometimes prompting can cause some part of text go missing for example.
I've usually only used it to get the spelling right and it works well for that.

Purfview · 2024-01-04T18:51:05Z

Purfview
Jan 4, 2024

Problematic part is:

[00:10.560 --> 00:12.760]  This is a LibraVox recording.
[00:12.760 --> 00:15.840]  All LibraVox recordings are in the public domain.
[00:15.840 --> 00:21.840]  For more information or to volunteer, please visit LibraVox.org.

I encountered same behavior with the different contents, looks like some models just refuse to output anything on such ads.
Must be something to do with the training.

1 reply

MrEdwards007 Jan 5, 2024
Author

I appreciate your help. I don't want to but it looks like I might need to accept that it just might not work or more specifically, that the effort, research, along with the help that I've received haven't been enough to overcome this issue.

Thank you.

ryanheise · 2024-01-05T01:04:02Z

ryanheise
Jan 5, 2024

Can you please share a link to the hosted MP3 file?

5 replies

MrEdwards007 Jan 5, 2024
Author

I have uploaded a file, where the issue can be reproduced 100% of the time on both small_en and large-v2.

Thank you for asking. I didn't think that I could upload the file.

ryanheise Jan 5, 2024

I assumed there would be a hosted MP3 file since it is from Librivox.

misutoneko Jan 5, 2024

https://librivox.org/don-quixote-in-england-by-henry-fielding/

Direct link to Act 1:
https://www.archive.org/download/donquixoteinengland_2209_librivox/donquixoteinengland_1_fielding_128kb.mp3

ryanheise Jan 5, 2024

There are 3 links to mp3 files on that page, can you provide the specific mp3 link to test?

MrEdwards007 Jan 7, 2024
Author

I performed the transcription from Act 1
https://www.archive.org/download/donquixoteinengland_2209_librivox/donquixoteinengland_1_fielding_128kb.mp3

MrEdwards007 · 2024-01-05T01:29:36Z

MrEdwards007
Jan 5, 2024
Author

DonQuixote_GitHub_15Minutes.zip
I am unable to upload the file as a whole, due to the length of a little over an hour and upload limits.
However, I cut this down to roughly 15 minutes and tested the file and the outcome is the same in both small_en and large-v2.
Please see the attached zip file containing the mp3 that is roughly 15 minutes (895.472 seconds - 00:14:55 H:M:S)

1 reply

misutoneko Jan 5, 2024

You could actually cut it down to under 30s or so and still get the same effect. But what's interesting is that if you cut the file to 18s, suddenly Whisper starts to recognize the Librivox Disclaimer bit. Well I guess that explains why VAD can be effective...

misutoneko · 2024-01-05T11:17:41Z

misutoneko
Jan 5, 2024

Here's yet another workaround idea that seems to work:
linto-ai/whisper-timestamped#128

Tested by slowing down the audio to 0.6 of the original. Do note that I only used the first 30s for this test.
The downside of this method is that it hampers the performance (more data to process).
Maybe there's some better way to accomplish the same.

1 reply

MrEdwards007 Jan 7, 2024
Author

I'll check that out. One of the things that I am also doing is diarizing, while including the timestamps.
That would change the timestamps, while keeping the fidelity of the transcription. Something that I'm finding is that while my diarizer is consistent, from model to model, timestamps vary from 0.2 to 0.8 seconds. That would be another adjustment to account for. It may be worth it, since I have a known rate of slowdown that I can factor in.

Thank you.

Missing the first 21 seconds in small.en and large-v2 #1937

Uh oh!

Replies: 5 comments · 10 replies

Uh oh!

Uh oh!

Uh oh!

MrEdwards007 Jan 5, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrEdwards007 Jan 5, 2024 Author

Uh oh!

Uh oh!

Uh oh!

MrEdwards007 Jan 5, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrEdwards007 Jan 7, 2024 Author

Uh oh!

MrEdwards007 Jan 5, 2024 Author

Uh oh!

Uh oh!

Uh oh!

MrEdwards007 Jan 7, 2024 Author

Replies: 5 comments 10 replies

MrEdwards007 Jan 5, 2024
Author

MrEdwards007 Jan 5, 2024
Author

MrEdwards007 Jan 5, 2024
Author

MrEdwards007 Jan 7, 2024
Author

MrEdwards007
Jan 5, 2024
Author

MrEdwards007 Jan 7, 2024
Author