Best model and settings for (not clean) audiodramas? #1928

Pocokk · 2023-12-30T21:19:35Z

Pocokk
Dec 30, 2023

Hey there!

As my hearing is not so great, the voices usually get lost amidst of various sound effect (gunshot, street noise, background distortions etc) and I don't understand the context of certain audiodramas, its just a big mess for me sometimes.

That's why I'm asking here and now the community's help, which model and settings should I use for those type of transcribing, when the source is not clean, it includes a lot of extra sound effects and overall can be messy to hear out words even for a mother-tongued English speaking person in the first place.

Has anyone figured out the optimal usage in that case?

I appreciate all the help here and thank you so much in advance!

glangford · 2023-12-31T01:40:40Z

glangford
Dec 31, 2023

If you haven't had success transcribing noisy audio with the large-v2 model, using a tool such as demucs to separate the spoken portion prior to running whisper might be a good first step.

You can use a command similar to this to separate vocals and 'everything else':
python -m demucs --two-stems vocals --mp3 -o demucs noisy.mp4
(The --two-stems=vocals option allows separating vocals from the rest of the accompaniment (i.e., karaoke mode))

Spleeter is another option (I haven't used that).

demucs is headlined as providing "music source separation", I don't know how well it would perform on your particular audio with FX noise so it's worth experimenting.

Pocokk Dec 31, 2023
Author

Well, the first try was a disaster: I set it up, it worked until 100% progress then suddenly the whole system frozen with maximum CPU/HDD load, here is the photo I took 20 minutes later. As there wasn't anything in the (Windows) Event Viewer logs, I can't provide a source about what went wrong, but now I'm bit reluctant to try again. I've also looked it up and couldn't find a related issue in the past, which is a bit disturbing.

phineas-pta Dec 31, 2023

demucs has tendency to use a lot of ram (not vram) which freeze pc with not enough ram

anyway demucs and others like spleeter are for dealing with songs, so sound effect in your case is not suitable, there's no automatic solution for now, only manual editing

Pocokk Dec 31, 2023
Author

That's sad to read, thank you.

glangford Dec 31, 2023

fyi

device=cpu, Too high memory usage when processing long audio facebookresearch/demucs#498

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best model and settings for (not clean) audiodramas? #1928

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best model and settings for (not clean) audiodramas? #1928

Uh oh!

Uh oh!

Pocokk Dec 30, 2023

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

glangford Dec 31, 2023

Uh oh!

Uh oh!

Pocokk Dec 31, 2023 Author

Uh oh!

phineas-pta Dec 31, 2023

Uh oh!

Pocokk Dec 31, 2023 Author

Uh oh!

glangford Dec 31, 2023

Pocokk
Dec 30, 2023

Replies: 1 comment 4 replies

glangford
Dec 31, 2023

Pocokk Dec 31, 2023
Author

Pocokk Dec 31, 2023
Author