Is it possible to silence non-verbal parts of an audio? #2115
Replies: 13 comments 32 replies
-
@orionflame |
Beta Was this translation helpful? Give feedback.
-
@orionflame |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot I will try the APIs you mentioned. I didn't have success with other splitters before they just kept the throat clearing parts with vocals. |
Beta Was this translation helpful? Give feedback.
-
I just tried Facebook Demucs GUI: The throat clearing was still there. I made a very small sample here: In the middle between 2 statements, there is throat clearing. I have a feeling, throat clearing or coughing is not seen as non-vocal for these models, because they just leave it as is. So I am not sure if they are the right tool for the job. |
Beta Was this translation helpful? Give feedback.
-
I tried whisper API which worked for this audio, but another one it didn't remove the throat clearing at all. So I have to see why. |
Beta Was this translation helpful? Give feedback.
-
I tried this code that uses the original whisper API on this audio but it didn't silence the throat clearing part on this new audio: https://www.dropbox.com/scl/fi/fi8g0redyz7uzi0fdiil2/voice2.mp3?rlkey=xrmnhrlnaqrzzunwoika0fvj2&dl=1 It worked for the first sample though so I was hopeful. |
Beta Was this translation helpful? Give feedback.
-
@orionflame In my own fast test with your example, here is what should work better for you...
You should get this:
Then, remove parts with not-matching word end to next word begin, like: --> 00:00:05,600 19 |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot let me look into this. I will have to adapt my code so that those parts between words are muted basically. Hopefully this will work. Appreciate your help! |
Beta Was this translation helpful? Give feedback.
-
Ok I am just looking into it but is it possible to install it with just pip? It shows installing other modules with pip but not the API in question: can I just use?: |
Beta Was this translation helpful? Give feedback.
-
Oh I see, but those 3 files where I can put them? I normally use pip so I don't know where these libraries are stored. |
Beta Was this translation helpful? Give feedback.
-
Now it's trying to find: I will try to find this model. |
Beta Was this translation helpful? Give feedback.
-
That's strange. I installed whisper properly I think because I was running the code I posted above that used whisper and it was able to transcribe it. I will see why that happens/. But in your code you are checking if something is present in whisper and if not you are trying to use faster whisper no? I only did pip install openai-whisper, nothing else. I downloaded the base.pt model manually and placed it at where it was trying to find it. |
Beta Was this translation helpful? Give feedback.
-
You may experiment with whisper-at. Combining the text time ranges to know where a text is recognized, with tags explaining what it is suppose to be, using a small time resolution like 1.6s (not too small, to get something in each part, and improve the result quality), you could possibly obtain something very interesting. Here is the result using You see that, from 6.0s to 13.6s there is no text recognized, while this part is tagged with some "Cough, Throat clearing, Animal" tags that confirm the fact there is something not interesting in this part. It would remove this part of your audio: On your example 1, it less an evidence on what should be removed because there isn't a good match between text ranges and unwanted tags. But you could perhaps get a much better result after having applied a noise removal on the sound file. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Basically anytime I am not talking it's either silent, some other noise or there is some throat clearing. My audio is noise free so it's quite clear, and I want to keep only the verbal parts, without changing the audio length as its synced to video.
Is there any tool or API that can do this? I tried a few splitter tools online but they failed to remove throat clearing from verbal parts.
I thought maybe I can use the Whisper API to detect the timestamps of where there is speech and silent any other parts. Is that feasible?
It's about 80 hours of audio (~200 files).
Thanks a lot in advance.
Beta Was this translation helpful? Give feedback.
All reactions