Is it possible to silence non-verbal parts of an audio? #2115

orionflame · 2024-04-01T05:41:12Z

orionflame
Apr 1, 2024

Hi,

Basically anytime I am not talking it's either silent, some other noise or there is some throat clearing. My audio is noise free so it's quite clear, and I want to keep only the verbal parts, without changing the audio length as its synced to video.

Is there any tool or API that can do this? I tried a few splitter tools online but they failed to remove throat clearing from verbal parts.

I thought maybe I can use the Whisper API to detect the timestamps of where there is speech and silent any other parts. Is that feasible?

It's about 80 hours of audio (~200 files).

Thanks a lot in advance.

EtienneAb3d · 2024-04-01T07:32:57Z

EtienneAb3d
Apr 1, 2024

@orionflame
Remove noise by voice extraction using Facebook Demucs or Deezer Spleeter.
See code sample (and other processing tools) here:
https://github.com/EtienneAb3d/WhisperHallu

0 replies

EtienneAb3d · 2024-04-01T07:38:32Z

EtienneAb3d
Apr 1, 2024

@orionflame
As you suggested, it's possible to identify word positions with timestamps, but you could possibly get some bad word recognitions and/or bad timestamp positions causing the noise removal also damaging the proper voice parts.

0 replies

orionflame · 2024-04-01T07:43:03Z

orionflame
Apr 1, 2024
Author

Thanks a lot I will try the APIs you mentioned. I didn't have success with other splitters before they just kept the throat clearing parts with vocals.

0 replies

orionflame · 2024-04-01T07:54:44Z

orionflame
Apr 1, 2024
Author

I just tried Facebook Demucs GUI:
https://www.fosshub.com/Demucs-GUI.html

The throat clearing was still there.

I made a very small sample here:
https://www.dropbox.com/scl/fi/2ambh75jdhfix96wizrcl/voice.mp3?rlkey=1oynf39zlefyyfiuxp45fykth&dl=1

In the middle between 2 statements, there is throat clearing. I have a feeling, throat clearing or coughing is not seen as non-vocal for these models, because they just leave it as is. So I am not sure if they are the right tool for the job.

0 replies

orionflame · 2024-04-01T18:39:49Z

orionflame
Apr 1, 2024
Author

I tried whisper API which worked for this audio, but another one it didn't remove the throat clearing at all. So I have to see why.

0 replies

orionflame · 2024-04-02T04:58:51Z

orionflame
Apr 2, 2024
Author

I tried this code that uses the original whisper API on this audio but it didn't silence the throat clearing part on this new audio:
https://paste.ofcode.org/Gc9MUy83K9UHATUHPDVyZ4

https://www.dropbox.com/scl/fi/fi8g0redyz7uzi0fdiil2/voice2.mp3?rlkey=xrmnhrlnaqrzzunwoika0fvj2&dl=1

It worked for the first sample though so I was hopeful.

0 replies

EtienneAb3d · 2024-04-02T07:40:07Z

EtienneAb3d
Apr 2, 2024

@orionflame
As far as I understand it, the Whisper segmentation is much more a matter of subtitle arrangement than a question of voice part identification.

In my own fast test with your example, here is what should work better for you...
With WhisperHallu, call transcribeOpts with these options:

opts = dict(language='en',word_timestamps=True)
isMusic=True
onlySRT=True
remixFactor=0

You should get this:

00:00:00,000 --> 00:00:00,340
and

2
00:00:00,340 --> 00:00:00,600
for

3
00:00:00,600 --> 00:00:00,960
curves

4
00:00:00,960 --> 00:00:01,500
they

5
00:00:01,500 --> 00:00:01,860
tell

6
00:00:01,860 --> 00:00:02,040
you

7
00:00:02,040 --> 00:00:02,340
how

8
00:00:02,340 --> 00:00:02,780
quickly

9
00:00:02,780 --> 00:00:03,080
and

10
00:00:03,080 --> 00:00:03,280
in

11
00:00:03,280 --> 00:00:03,460
what

12
00:00:03,460 --> 00:00:03,780
manner

13
00:00:03,780 --> 00:00:04,100
the

14
00:00:04,100 --> 00:00:04,320
curve

15
00:00:04,320 --> 00:00:04,480
is

16
00:00:04,480 --> 00:00:04,880
changing

17
00:00:04,880 --> 00:00:05,140
its

18
00:00:05,140 --> 00:00:05,600
direction.

19
00:00:13,240 --> 00:00:13,880
The

20
00:00:13,880 --> 00:00:14,180
second

21
00:00:14,180 --> 00:00:14,500
order

22
00:00:14,500 --> 00:00:14,920
partial

23
00:00:14,920 --> 00:00:15,340
derivative

24
00:00:15,340 --> 00:00:15,760
is

25
00:00:15,760 --> 00:00:16,060
about

26
00:00:16,060 --> 00:00:16,280
the

27
00:00:16,280 --> 00:00:16,620
curvature

28
00:00:16,620 --> 00:00:17,120
or

29
00:00:17,120 --> 00:00:17,260
the

30
00:00:17,260 --> 00:00:17,480
rate

31
00:00:17,480 --> 00:00:17,840
at

Then, remove parts with not-matching word end to next word begin, like:

--> 00:00:05,600
direction.

19
00:00:13,240 -->

You should remove the not-word elements:

0 replies

orionflame · 2024-04-02T07:41:54Z

orionflame
Apr 2, 2024
Author

Thanks a lot let me look into this. I will have to adapt my code so that those parts between words are muted basically. Hopefully this will work. Appreciate your help!

0 replies

orionflame · 2024-04-02T07:48:22Z

orionflame
Apr 2, 2024
Author

Ok I am just looking into it but is it possible to install it with just pip? It shows installing other modules with pip but not the API in question:
https://github.com/EtienneAb3d/WhisperHallu

can I just use?:
pip install WhisperHallu

2 replies

EtienneAb3d Apr 2, 2024

@orionflame
As said on the ReadMe, WhisperHallu is an experimental code.
There are only 3 Python files!

EtienneAb3d Apr 2, 2024

Be careful, the way it will working depends on the set of modules you will install.

orionflame · 2024-04-02T07:55:07Z

orionflame
Apr 2, 2024
Author

Oh I see, but those 3 files where I can put them? I normally use pip so I don't know where these libraries are stored.

4 replies

EtienneAb3d Apr 2, 2024

@orionflame
Put them at the place your import will find them, basically at the same place as your main code. See Code sample in the ReadMe.

orionflame Apr 2, 2024
Author

Thanks I just tried this:

import transcribeHallu

opts = dict(language='en',word_timestamps=True)
isMusic=True
onlySRT=True
remixFactor=0

model.transcribeOpts(opts)

It complained about:

File "C:\Users\orionflame\Downloads\WhisperHallu-main\voiceonly.py", line 5, in
import transcribeHallu
File "C:\Users\orionflame\Downloads\WhisperHallu-main\transcribeHallu.py", line 43, in
from demucsWrapper import load_demucs_model
File "C:\Users\orionflame\Downloads\WhisperHallu-main\demucsWrapper.py", line 3, in
import demucs
ModuleNotFoundError: No module named 'demucs'

So I need these modules too? In the help file it says only install if you need but I didn't think transcribeOpts calls them.

Also didn't find any info about the transcribeOpts, I guess I am calling it correctly like this?

EtienneAb3d Apr 2, 2024

As said, WhisperHallu is an experimental code.

You will need some minimal developer skills to understand it and to use it.

For this error case, I need to add a "try..except" to automatically tune the "useDemucs" boolean.
Waiting this, you can change it manually in the code (line 41) to avoid the use of "demucs" if it's not installed.

Your code above should be:

model.transcribeOpts(path="PathToYourFile"
  ,opts = dict(language='en',word_timestamps=True)
  ,isMusic=True,onlySRT=True,remixFactor=0)

orionflame Apr 2, 2024
Author

Thanks I will try this.

orionflame · 2024-04-02T11:54:02Z

orionflame
Apr 2, 2024
Author

Now it's trying to find:
Faster installation found, but whisper-medium-ct2/ model not found

I will try to find this model.

3 replies

EtienneAb3d Apr 2, 2024

If you properly installed standard Whisper, it should not use Faster Whisper, nor search for it's model.

orionflame Apr 2, 2024
Author

I just checked your code but there might be bug about which whisper to use. I get printout for both:

Using standard Whisper
Using Faster Whisper
Faster installation found, but whisper-medium-ct2/ model not found

Normally should not continue trying to use faster if standard whisper is found right?

EtienneAb3d Apr 2, 2024

Answer in the other post (the exchanges would be easier if you would use only one Q/A sequence):
#2115 (reply in thread)

orionflame · 2024-04-02T15:02:27Z

orionflame
Apr 2, 2024
Author

That's strange. I installed whisper properly I think because I was running the code I posted above that used whisper and it was able to transcribe it.

I will see why that happens/. But in your code you are checking if something is present in whisper and if not you are trying to use faster whisper no?

I only did pip install openai-whisper, nothing else. I downloaded the base.pt model manually and placed it at where it was trying to find it.

18 replies

orionflame Apr 3, 2024
Author

Thanks a lot for all your help. I wrote the code that basically mutes between verbal parts and it did work in the sample voice files I posted here. But in the first 5 min audio file I tested, I noticed it's removing actual words or parts of them in some places.

I made a short sample here that has the problem areas:
https://www.dropbox.com/scl/fi/kotmse874x4rsi86kr8f8/voice3.mp3?rlkey=l5m56g5axort1ru70goo3rvch&dl=1

I assume the approach you suggested relies on the transcriber perfectly understanding what's being said? Because there are some names that I imagine the transcriber will have problems matching. For example in the name I spoke "Keenan" it cropped it partially in one of the instances.

Now that the solution is fully implemented, I am trying to see if it can indeed be used perfectly.

EtienneAb3d Apr 3, 2024

Try to use "large-v2" model in place of "medium" one.

EtienneAb3d Apr 3, 2024

I just heard your example. The word "Keenan" is partially repeated 3 times.
It's a known behaviour from Whisper models to remove such (supposed bad) repetitions.

In my own fast test with large-v2, there is no real in-between word interval (thus, with the algo discussed above, no text removed), but and empty text for the first "Keen" (= a segment without word highlighting in the original SRT produced by the Whisper code).

orionflame Apr 3, 2024
Author

Thanks a lot that makes sense. I will try the large model. Basically this is the audio stream for a tutorial. So first I am trying to remove the non verbal parts then my goal was to remove the retakes. Can whisper be used to do this?

Most of the time it could be partial sentences repeated multiple times until a full sentence is formed. But it's always the last instance that should be kept. That's a much bigger problem which I don't know how to solve yet.

Sorry in your last sentence are you saying it successfully transcribes the first "Keen" part?

Also if you check the first part when I say "Pretty" that was cropped in the medium model also. The p is so short, in the original it was normal.

Or do you think I should look into other VAD models like Pyannote? Because I was reading some people saying whisper sometimes misses some words.

EtienneAb3d Apr 4, 2024

@orionflame
Certainly a lot of things may be experimented that will bring you with a large part of the solution working quite properly in a lot of cases. But, as far as I know the current state-of-the-art, you won't succeed in building a solution that will ensure it won't damage something that should be kept in the audio. If current timestamp tools are enough precise to get a colour display on subtitles humanly acceptable, they are far too much imprecise for such a sound filtering task.

To remove bad repetitions, you may try to ask ChatGPT to remove them from the SRT (I would use segment-level rather than word-level timestamps). For this, you may use ChatMate. It should be easy to adapt the provided translation example.
https://github.com/EtienneAb3d/ChatMate

EtienneAb3d · 2024-04-05T03:31:17Z

EtienneAb3d
Apr 5, 2024

@orionflame

You may experiment with whisper-at.

Combining the text time ranges to know where a text is recognized, with tags explaining what it is suppose to be, using a small time resolution like 1.6s (not too small, to get something in each part, and improve the result quality), you could possibly obtain something very interesting.

Here is the result using tiny.en model on your example 2:

You see that, from 6.0s to 13.6s there is no text recognized, while this part is tagged with some "Cough, Throat clearing, Animal" tags that confirm the fact there is something not interesting in this part.

It would remove this part of your audio:

On your example 1, it less an evidence on what should be removed because there isn't a good match between text ranges and unwanted tags. But you could perhaps get a much better result after having applied a noise removal on the sound file.

5 replies

orionflame Apr 5, 2024
Author

Thanks a lot I will experiment with this tool today. Although this audio file I already applied noise removal using audacity. Normally it has a bit of background noise.

orionflame Apr 5, 2024
Author

Also I am getting this text for the last audio:

0.0s-6.9s: pretty much everything you could want that occur around the normal vector not
6.9s-13.3s: along it. Keenan Crane is one of the leading
13.3s-17.2s: researchers in computational geometry.

It is doing de-duplication for the text? I thought maybe I could use that to remove the retakes but it just has a large range like 6.9s-13.3s.

I am surprised it was able to perfectly transcribe keenan crane though but also throat clearing in the audio you tested.

In any case the last audio didnt tag the throat clearing so I got these:

0.0s-1.6s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking
1.6s-3.2s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking
3.2s-4.8s: Speech, Inside, small room, Clicking, Speech synthesizer, Narration, monologue
4.8s-6.4s: Speech, Narration, monologue, Speech synthesizer, Male speech, man speaking
6.4s-8.0s: Speech, Narration, monologue, Clicking, Speech synthesizer, Inside, small room
8.0s-9.6s: Speech, Clicking, Inside, small room
9.6s-11.2s: Speech, Clicking, Inside, small room, Narration, monologue, Male speech, man speaking
11.2s-12.8s: Speech, Speech synthesizer
12.8s-14.4s: Sine wave
14.4s-16.0s: Sine wave, Hum, Chime, White noise, Boiling

In any case when using tags to mute the non speech, what would you recommend? Because even throat clearing is tagged with speech in the same line no? I thought maybe there would be a speech tag that's used only when someone is speaking.

EtienneAb3d Apr 5, 2024

You have to discuss this with the tool authors.

orionflame Apr 6, 2024
Author

Also how long it takes you to transcribe my audio sample file? On my system I have to wait more than 30 mins. Is this not too slow? Because it's only 12s audio. I was using whisper before, and it was able to create subtitles so much faster like orders of magnitude for 20 hours of audio. I was using stable whisper because regular one had incorrect timings I think:
#435

I know you also support fast whisper which I want to try today but that would only give me like 4x speed up right?

Should I just switch to gpu or use another whisper that can be drop in replacement?

Because I want to add some padding to the muting so that the cropping issue is gone. There is always some gap between coughs and speech.

I will still try whisper at today but will have to resolve this first:
YuanGongND/whisper-at#26

EtienneAb3d Apr 6, 2024

Of course, you can do some basic tests using CPU.
But, such models (especially the larger ones) need GPU to be efficient.

Is it possible to silence non-verbal parts of an audio? #2115

Uh oh!

Replies: 13 comments · 32 replies

Uh oh!

Uh oh!

Uh oh!

orionflame Apr 1, 2024 Author

Uh oh!

orionflame Apr 1, 2024 Author

Uh oh!

orionflame Apr 1, 2024 Author

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 2, 2024 Author

Uh oh!

orionflame Apr 3, 2024 Author

Uh oh!

Uh oh!

Uh oh!

orionflame Apr 3, 2024 Author

Uh oh!

Uh oh!

Uh oh!

orionflame Apr 5, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 5, 2024 Author

Uh oh!

Uh oh!

orionflame Apr 6, 2024 Author

Uh oh!

Replies: 13 comments 32 replies

orionflame
Apr 1, 2024
Author

orionflame
Apr 1, 2024
Author

orionflame
Apr 1, 2024
Author

orionflame
Apr 2, 2024
Author

orionflame
Apr 2, 2024
Author

orionflame
Apr 2, 2024
Author

orionflame
Apr 2, 2024
Author

orionflame Apr 2, 2024
Author

orionflame Apr 2, 2024
Author

orionflame
Apr 2, 2024
Author

orionflame Apr 2, 2024
Author

orionflame
Apr 2, 2024
Author

orionflame Apr 3, 2024
Author

orionflame Apr 3, 2024
Author

orionflame Apr 5, 2024
Author

orionflame Apr 5, 2024
Author

orionflame Apr 6, 2024
Author