Doesn't seem to be working WITH SONGS? Keeps duplicating transcripts on several timestamps #1833

AIhasArrived · 2023-11-22T10:45:36Z

AIhasArrived
Nov 22, 2023

Hello,
I tried whisper with some song and Got this:

I tried large-v3 and "ro" language.
This line "Kimenikin ajtoumeleibe" only comes later, and several words were not registered before it.
Sometimes words and sentences are indeed repeated int this song but not like this,

Any way to improve this?

ryanheise · 2023-11-22T11:04:23Z

ryanheise
Nov 22, 2023

Whisper tends to get confused when there is no voice, such as when there are periods of music without any vocals. You can use Demucs to filter out the backing music first so that you're left just with the vocals, and then use VAD (e.g. Silaro VAD) to filter out the silent bits so that Whisper only analyses the parts with actual speech to be transcribed. stable-ts is one tool that can do both of these for you, but there are many other tools that also at least do VAD.

0 replies

AIhasArrived · 2023-11-22T11:34:20Z

AIhasArrived
Nov 22, 2023
Author

You meant this vad thing (https://github.com/snakers4/silero-vad) ?
My question is:

But once the silences are removed, would not WHISPER get the wrong timestamps (relative to the original video)? Any solution for that? As in saving which silences was removed at what time so you can compute the right timestamp relative to the original video by adding or removing the value of duration of silences?
Wow I did not know there were other tools to expand on whisper such as stable-ts, do you know of others aswell? I would like to explore them:)
Thanks

2 replies

ryanheise Nov 22, 2023

Tools such as stable-ts and others will also do post processing to re-insert the silent gaps into the final timestamps, so the outputted timestamps should align with the original audio.
See @EtienneAb3d 's comment below for another suggestion, as well as others you can find by searching this discussion board for "hallucinations"

While there are several tools that can do this, I think for the case of music you may find that Demucs can be helpful since it is specifically designed to separate the music from the vocals, so that you can feed the clean vocals into Whisper.

AIhasArrived Nov 22, 2023
Author

Yeah I actually tried demucs aswell, and did not get what I liked, I might hav got the wrogn language so I removed the langage setting,
Can I send you the sample of Music by email so you can try it yourself and see what's not working?

EtienneAb3d · 2023-11-22T11:39:39Z

EtienneAb3d
Nov 22, 2023

Try large-v2!

I'm also currently preparing a new version of WhisperHallu with a bunch of new features specially dedicated to music/song processing.
;-)

5 replies

AIhasArrived Nov 22, 2023
Author

Very cool stuff, I am a bit.. overwhelmed, to be honest, there are so many good things to try, it seems

EtienneAb3d Nov 22, 2023

To process songs, I think the best procedure is the following:

process the file with WhisperHallu to get 2 files: a proper text (vocal extraction+noise/silence removal, etc) without timestamps and accurate timestamps (possibly with hallucinations).
process these to files with WhisperTimeSync to send the proper timestamps at the right places over the proper text.

Knowing a new enriched version of WhisperHallu is in preparation...
;-)

AIhasArrived Nov 22, 2023
Author

I will make sure to come back to you and ask you plenty of questions If I am unable to to undestand somethign deal? ;)
I will be using it either soon or in 2 weeks (depending on some personal projects) Hope I will find there then.

dgoryeo Nov 29, 2023

@EtienneAb3d great to hear that a new version of WhisperHallu is under the work. I saw this: 100% avoid Whisper's hallucinations discussion the other day and I thought you might find it relevant / interesting --if you haven't already seen it.

EtienneAb3d Nov 30, 2023

@dgoryeo
As far as I understand it, it is filtering silence and noise parts while recording. This is a part of what WhisperHallu is doing on already-registered files using various tools.

Doesn't seem to be working WITH SONGS? Keeps duplicating transcripts on several timestamps #1833

Uh oh!

AIhasArrived Nov 22, 2023

Replies: 3 comments · 7 replies

Uh oh!

ryanheise Nov 22, 2023

Uh oh!

AIhasArrived Nov 22, 2023 Author

Uh oh!

ryanheise Nov 22, 2023

Uh oh!

Uh oh!

AIhasArrived Nov 22, 2023 Author

Uh oh!

EtienneAb3d Nov 22, 2023

Uh oh!

AIhasArrived Nov 22, 2023 Author

Uh oh!

Uh oh!

EtienneAb3d Nov 22, 2023

Uh oh!

AIhasArrived Nov 22, 2023 Author

Uh oh!

dgoryeo Nov 29, 2023

Uh oh!

EtienneAb3d Nov 30, 2023

AIhasArrived
Nov 22, 2023

Replies: 3 comments 7 replies

ryanheise
Nov 22, 2023

AIhasArrived
Nov 22, 2023
Author

AIhasArrived Nov 22, 2023
Author

EtienneAb3d
Nov 22, 2023

AIhasArrived Nov 22, 2023
Author

AIhasArrived Nov 22, 2023
Author