Audio transcription with speaker labelling using speaker samples. Using Whisper & Pyannote #2609
Replies: 10 comments 23 replies
-
how much time will it take to process 5 min audio clip? |
Beta Was this translation helpful? Give feedback.
-
@alunkingusw Scriberr looks promising. rishikanthc is working on v1.0.0. It uses whisper and pyannote as well. It can
There are no subtitle features but if there are skilled devs reading there's no reason it couldn't get added. Give it a look. |
Beta Was this translation helpful? Give feedback.
-
https://drive.google.com/file/d/13WRf4UUCUBQ0NSzdtOfMjy90WTiWve1Z/view?usp=drivesdk |
Beta Was this translation helpful? Give feedback.
-
Hello, can I give you a test subject ( a youtube video) so we can see if it really work well? (It's a video with 2 speakers, where they sometimes speak over each other, as in one interrupt the other to say something then the other continues talking) |
Beta Was this translation helpful? Give feedback.
-
Follow up questions:
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Quick additional question: |
Beta Was this translation helpful? Give feedback.
-
Ok additional feedback: |
Beta Was this translation helpful? Give feedback.
-
File "/root/miniconda3/envs/whisper/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 547, in from_pretrained |
Beta Was this translation helpful? Give feedback.
-
Hey @alunkingusw, this is great! a review UI to fix speaker attribution, We'd love to pilot this with you or any contributors, just a couple sample recordings and we’ll run the pipeline, give you clean exports, and show how reviewers can clean up errors fast. LMK if you'd be open to a short run! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've written some code that I am using in a project and I thought I would share it with you.
The code transcribes an audio file using Whisper, then diarises (sorry for the lack of z, I am British) using Pyannote.
Once the two processes are complete, the result can be output, but my code then takes known speakers from sample clips, generates embeddings and compares them to the speakers identified on the audio.
If a speaker is recognised, then they are labelled in the output. If a speaker is not recognised, they are removed and the final transcription has them labelled as [None]. This removal is optional and you can comment that out in the code if you want. I found it useful to filter out things like musical interludes in a podcast, and focusing on the known speakers.
Here's the code which I had run in Colab
https://gist.github.com/alunkingusw/2eb29682a98f94a714d10080ed0f4896
Beta Was this translation helpful? Give feedback.
All reactions