Audio transcription with speaker labelling using speaker samples. Using Whisper & Pyannote #2609

alunkingusw · 2025-06-20T09:55:09Z

alunkingusw
Jun 20, 2025

I've written some code that I am using in a project and I thought I would share it with you.
The code transcribes an audio file using Whisper, then diarises (sorry for the lack of z, I am British) using Pyannote.

Once the two processes are complete, the result can be output, but my code then takes known speakers from sample clips, generates embeddings and compares them to the speakers identified on the audio.

If a speaker is recognised, then they are labelled in the output. If a speaker is not recognised, they are removed and the final transcription has them labelled as [None]. This removal is optional and you can comment that out in the code if you want. I found it useful to filter out things like musical interludes in a podcast, and focusing on the known speakers.

Here's the code which I had run in Colab
https://gist.github.com/alunkingusw/2eb29682a98f94a714d10080ed0f4896

Vishnu-AIR · 2025-06-29T06:43:41Z

Vishnu-AIR
Jun 29, 2025

how much time will it take to process 5 min audio clip?

5 replies

alunkingusw Jun 29, 2025
Author

5 min clip shouldn't take long with a GPU, possibly 2 minutes? Without a GPU probably a bit longer.

BestofthebestinAI Aug 16, 2025

Hi @alunkingusw Can you show me your "pip list" or list of libararies? At least cuda/torch things, I am suspecting mine to be too recent and hurting and making my whisper SLOWER. I think.

alunkingusw Aug 16, 2025
Author

Hi, my requirements.txt that I use on my docker image is here:

Audio processing and transcription

openai-whisper
ffmpeg-python
openai-whisper
pyannote.audio
torch
torchaudio
scikit-learn
git+https://github.com/alunkingusw/brouhaha-vad.git@main
numpy

I think that is everything!

BestofthebestinAI Aug 16, 2025

no I need the versions. Please @alunkingusw (like what torche version? What torchaudio version? etc ectc). What about cuda etc if there is. Thanks. When I do pip list on my side it gives me all libraires with versions

alunkingusw Aug 16, 2025
Author

Google Colab has all the libraries pre-installed, I ran pip list on that and it had this huge list. Anything else I install is just the latest version. The only package where I specify a version is numpy, use version 1, so that the numpy.nan in pyannote doesn't crash.

Package Version

absl-py 1.4.0
accelerate 1.10.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
alabaster 1.0.0
albucore 0.0.24
albumentations 2.0.8
ale-py 0.11.2
altair 5.5.0
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.10.0
anywidget 0.9.18
argon2-cffi 25.1.0
argon2-cffi-bindings 25.1.0
array_record 0.7.2
arviz 0.22.0
astropy 7.1.0
astropy-iers-data 0.2025.8.11.0.41.9
astunparse 1.6.3
atpublic 5.1
attrs 25.3.0
audioread 3.0.1
autograd 1.8.0
babel 2.17.0
backcall 0.2.0
backports.tarfile 1.2.0
beautifulsoup4 4.13.4
betterproto 2.0.0b6
bigframes 2.15.0
bigquery-magics 0.10.2
bleach 6.2.0
blinker 1.9.0
blis 1.3.0
blobfile 3.0.0
blosc2 3.7.0
bokeh 3.7.3
Bottleneck 1.4.2
bqplot 0.12.45
branca 0.8.1
Brotli 1.1.0
build 1.3.0
CacheControl 0.14.3
cachetools 5.5.2
catalogue 2.0.10
certifi 2025.8.3
cffi 1.17.1
chardet 5.2.0
charset-normalizer 3.4.3
chex 0.1.90
clarabel 0.11.1
click 8.2.1
cloudpathlib 0.21.1
cloudpickle 3.1.1
cmake 3.31.6
cmdstanpy 1.2.5
colorcet 3.1.0
colorlover 0.3.0
colour 0.1.5
community 1.0.0b1
confection 0.1.5
cons 0.4.7
contourpy 1.3.3
cramjam 2.11.0
cryptography 43.0.3
cuda-python 12.6.2.post1
cudf-cu12 25.6.0
cudf-polars-cu12 25.6.0
cufflinks 0.17.3
cuml-cu12 25.6.0
cupy-cuda12x 13.3.0
curl_cffi 0.13.0
cuvs-cu12 25.6.1
cvxopt 1.3.2
cvxpy 1.6.7
cycler 0.12.1
cyipopt 1.5.0
cymem 2.0.11
Cython 3.0.12
dask 2025.5.0
dask-cuda 25.6.0
dask-cudf-cu12 25.6.0
dataproc-spark-connect 0.8.3
datasets 4.0.0
db-dtypes 1.4.3
dbus-python 1.2.18
debugpy 1.8.15
decorator 4.4.2
defusedxml 0.7.1
diffusers 0.34.0
dill 0.3.8
distributed 2025.5.0
distributed-ucxx-cu12 0.44.0
distro 1.9.0
dlib 19.24.6
dm-tree 0.1.9
docstring_parser 0.17.0
docutils 0.21.2
dopamine_rl 4.1.2
duckdb 1.3.2
earthengine-api 1.5.24
easydict 1.13
editdistance 0.8.1
eerepr 0.1.2
einops 0.8.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
etils 1.13.0
etuples 0.3.10
Farama-Notifications 0.0.4
fastai 2.7.19
fastapi 0.116.1
fastcore 1.7.29
fastdownload 0.0.7
fastjsonschema 2.21.1
fastprogress 1.0.3
fastrlock 0.8.3
ffmpy 0.6.1
filelock 3.18.0
firebase-admin 6.9.0
Flask 3.1.1
flatbuffers 25.2.10
flax 0.10.6
folium 0.20.0
fonttools 4.59.0
frozendict 2.4.6
frozenlist 1.7.0
fsspec 2025.3.0
future 1.0.0
gast 0.6.0
gcsfs 2025.3.0
GDAL 3.8.4
gdown 5.2.0
geemap 0.35.3
geocoder 1.38.1
geographiclib 2.0
geopandas 1.1.1
geopy 2.4.1
gin-config 0.5.0
gitdb 4.0.12
GitPython 3.1.45
glob2 0.7
google 2.0.3
google-ai-generativelanguage 0.6.15
google-api-core 2.25.1
google-api-python-client 2.178.0
google-auth 2.38.0
google-auth-httplib2 0.2.0
google-auth-oauthlib 1.2.2
google-cloud-aiplatform 1.108.0
google-cloud-bigquery 3.35.1
google-cloud-bigquery-connection 1.18.3
google-cloud-bigquery-storage 2.32.0
google-cloud-core 2.4.3
google-cloud-dataproc 5.21.0
google-cloud-datastore 2.21.0
google-cloud-firestore 2.21.0
google-cloud-functions 1.20.4
google-cloud-language 2.17.2
google-cloud-resource-manager 1.14.2
google-cloud-spanner 3.56.0
google-cloud-storage 2.19.0
google-cloud-translate 3.21.1
google-colab 1.0.0
google-crc32c 1.7.1
google-genai 1.29.0
google-generativeai 0.8.5
google-pasta 0.2.0
google-resumable-media 2.7.2
googleapis-common-protos 1.70.0
googledrivedownloader 1.1.0
gradio 5.42.0
gradio_client 1.11.1
graphviz 0.21
greenlet 3.2.4
groovy 0.1.2
grpc-google-iam-v1 0.14.2
grpc-interceptor 0.15.4
grpcio 1.74.0
grpcio-status 1.71.2
grpclib 0.4.8
gspread 6.2.1
gspread-dataframe 4.0.0
gym 0.25.2
gym-notices 0.1.0
gymnasium 1.2.0
h11 0.16.0
h2 4.2.0
h5netcdf 1.6.4
h5py 3.14.0
hdbscan 0.8.40
hf_transfer 0.1.9
hf-xet 1.1.7
highspy 1.11.0
holidays 0.78
holoviews 1.21.0
hpack 4.1.0
html5lib 1.1
httpcore 1.0.9
httpimport 1.4.1
httplib2 0.22.0
httpx 0.28.1
huggingface-hub 0.34.4
humanize 4.12.3
hyperframe 6.1.0
hyperopt 0.2.7
ibis-framework 9.5.0
idna 3.10
imageio 2.37.0
imageio-ffmpeg 0.6.0
imagesize 1.4.1
imbalanced-learn 0.13.0
immutabledict 4.2.1
importlib_metadata 8.7.0
importlib_resources 6.5.2
imutils 0.5.4
inflect 7.5.0
iniconfig 2.1.0
intel-cmplr-lib-ur 2025.2.0
intel-openmp 2025.2.0
ipyevents 2.0.2
ipyfilechooser 0.6.0
ipykernel 6.17.1
ipyleaflet 0.20.0
ipyparallel 8.8.0
ipython 7.34.0
ipython-genutils 0.2.0
ipython-sql 0.5.0
ipytree 0.2.2
ipywidgets 7.7.1
itsdangerous 2.2.0
jaraco.classes 3.4.0
jaraco.context 6.0.1
jaraco.functools 4.2.1
jax 0.5.3
jax-cuda12-pjrt 0.5.3
jax-cuda12-plugin 0.5.3
jaxlib 0.5.3
jeepney 0.9.0
jieba 0.42.1
Jinja2 3.1.6
jiter 0.10.0
joblib 1.5.1
jsonpatch 1.33
jsonpickle 4.1.1
jsonpointer 3.0.0
jsonschema 4.25.0
jsonschema-specifications 2025.4.1
jupyter-client 6.1.12
jupyter-console 6.1.0
jupyter_core 5.8.1
jupyter_kernel_gateway 2.5.2
jupyter-leaflet 0.20.0
jupyter-server 1.16.0
jupyterlab_pygments 0.3.0
jupyterlab_widgets 3.0.15
jupytext 1.17.2
kaggle 1.7.4.5
kagglehub 0.3.12
keras 3.10.0
keras-hub 0.21.1
keras-nlp 0.21.1
keyring 25.6.0
keyrings.google-artifactregistry-auth 1.1.2
kiwisolver 1.4.9
langchain 0.3.27
langchain-core 0.3.74
langchain-text-splitters 0.3.9
langcodes 3.5.0
langsmith 0.4.14
language_data 1.3.0
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lazy_loader 0.4
libclang 18.1.1
libcudf-cu12 25.6.0
libcugraph-cu12 25.6.0
libcuml-cu12 25.6.0
libcuvs-cu12 25.6.1
libkvikio-cu12 25.6.0
libpysal 4.13.0
libraft-cu12 25.6.0
librmm-cu12 25.6.0
librosa 0.11.0
libucx-cu12 1.18.1
libucxx-cu12 0.44.0
lightgbm 4.6.0
linkify-it-py 2.0.3
llvmlite 0.43.0
locket 1.0.0
logical-unification 0.4.6
lxml 5.4.0
Mako 1.1.3
marisa-trie 1.2.1
Markdown 3.8.2
markdown-it-py 4.0.0
MarkupSafe 3.0.2
matplotlib 3.10.0
matplotlib-inline 0.1.7
matplotlib-venn 1.1.2
mdit-py-plugins 0.5.0
mdurl 0.1.2
miniKanren 1.0.5
missingno 0.5.2
mistune 3.1.3
mizani 0.13.5
mkl 2025.2.0
ml_dtypes 0.5.3
mlxtend 0.23.4
more-itertools 10.7.0
moviepy 1.0.3
mpmath 1.3.0
msgpack 1.1.1
multidict 6.6.4
multipledispatch 1.0.0
multiprocess 0.70.16
multitasking 0.0.12
murmurhash 1.0.13
music21 9.3.0
namex 0.1.0
narwhals 2.1.1
natsort 8.4.0
nbclassic 1.3.1
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
ndindex 1.10.0
nest-asyncio 1.6.0
networkx 3.5
nibabel 5.3.2
nltk 3.9.1
notebook 6.5.7
notebook_shim 0.2.4
numba 0.60.0
numba-cuda 0.11.0
numexpr 2.11.0
numpy 1.26.4
nvidia-cublas-cu12 12.5.3.2
nvidia-cuda-cupti-cu12 12.5.82
nvidia-cuda-nvcc-cu12 12.5.82
nvidia-cuda-nvrtc-cu12 12.5.82
nvidia-cuda-runtime-cu12 12.5.82
nvidia-cudnn-cu12 9.3.0.75
nvidia-cufft-cu12 11.2.3.61
nvidia-curand-cu12 10.3.6.82
nvidia-cusolver-cu12 11.6.3.83
nvidia-cusparse-cu12 12.5.1.3
nvidia-cusparselt-cu12 0.6.2
nvidia-ml-py 12.575.51
nvidia-nccl-cu12 2.23.4
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu12 12.4.127
nvtx 0.2.13
nx-cugraph-cu12 25.6.0
oauth2client 4.1.3
oauthlib 3.3.1
omegaconf 2.3.0
openai 1.99.9
opencv-contrib-python 4.12.0.88
opencv-python 4.12.0.88
opencv-python-headless 4.12.0.88
openpyxl 3.1.5
opt_einsum 3.4.0
optax 0.2.5
optree 0.17.0
orbax-checkpoint 0.11.21
orjson 3.11.2
osqp 1.0.4
packaging 25.0
pandas 2.2.2
pandas-datareader 0.10.0
pandas-gbq 0.29.2
pandas-stubs 2.2.2.240909
pandocfilters 1.5.1
panel 1.7.5
param 2.2.1
parso 0.8.4
parsy 2.1
partd 1.4.2
patsy 1.0.1
peewee 3.18.2
peft 0.17.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.3.0
pip 24.1.2
platformdirs 4.3.8
plotly 5.24.1
plotnine 0.14.5
pluggy 1.6.0
ply 3.11
polars 1.25.2
pooch 1.8.2
portpicker 1.5.2
preshed 3.0.10
prettytable 3.16.0
proglog 0.1.12
progressbar2 4.5.0
prometheus_client 0.22.1
promise 2.3
prompt_toolkit 3.0.51
propcache 0.3.2
prophet 1.1.7
proto-plus 1.26.1
protobuf 5.29.5
psutil 5.9.5
psycopg2 2.9.10
psygnal 0.14.1
ptyprocess 0.7.0
py-cpuinfo 9.0.0
py4j 0.10.9.7
pyarrow 18.1.0
pyasn1 0.6.1
pyasn1_modules 0.4.2
pycairo 1.28.0
pycocotools 2.0.10
pycparser 2.22
pycryptodomex 3.23.0
pydantic 2.11.7
pydantic_core 2.33.2
pydata-google-auth 1.9.1
pydot 3.0.4
pydotplus 2.0.2
PyDrive2 1.21.3
pydub 0.25.1
pyerfa 2.0.1.5
pygame 2.6.1
pygit2 1.18.1
Pygments 2.19.2
PyGObject 3.42.0
PyJWT 2.10.1
pylibcudf-cu12 25.6.0
pylibcugraph-cu12 25.6.0
pylibraft-cu12 25.6.0
pymc 5.25.1
pynndescent 0.5.13
pynvjitlink-cu12 0.7.0
pynvml 12.0.0
pyogrio 0.11.1
pyomo 6.9.3
PyOpenGL 3.1.9
pyOpenSSL 24.2.1
pyparsing 3.2.3
pyperclip 1.9.0
pyproj 3.7.1
pyproject_hooks 1.2.0
pyshp 2.3.1
PySocks 1.7.1
pyspark 3.5.1
pytensor 2.31.7
pytest 8.4.1
python-apt 0.0.0
python-box 7.3.2
python-dateutil 2.9.0.post0
python-louvain 0.16
python-multipart 0.0.20
python-slugify 8.0.4
python-snappy 0.7.3
python-utils 3.9.1
pytz 2025.2
pyviz_comms 3.0.6
PyWavelets 1.9.0
PyYAML 6.0.2
pyzmq 26.2.1
raft-dask-cu12 25.6.0
rapids-dask-dependency 25.6.0
rapids-logger 0.1.1
ratelim 0.1.6
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
requests-toolbelt 1.0.0
requirements-parser 0.9.0
rich 13.9.4
rmm-cu12 25.6.0
roman-numerals-py 3.1.0
rpds-py 0.27.0
rpy2 3.5.17
rsa 4.9.1
ruff 0.12.8
safehttpx 0.1.6
safetensors 0.6.2
scikit-image 0.25.2
scikit-learn 1.6.1
scipy 1.16.1
scooby 0.10.1
scs 3.2.8
seaborn 0.13.2
SecretStorage 3.3.3
semantic-version 2.10.0
Send2Trash 1.8.3
sentence-transformers 5.1.0
sentencepiece 0.2.1
sentry-sdk 2.34.1
setuptools 75.2.0
shap 0.48.0
shapely 2.1.1
shellingham 1.5.4
simple-parsing 0.1.7
simplejson 3.20.1
simsimd 6.5.0
six 1.17.0
sklearn-compat 0.1.3
sklearn-pandas 2.2.0
slicer 0.0.8
smart_open 7.3.0.post1
smmap 5.0.2
sniffio 1.3.1
snowballstemmer 3.0.1
sortedcontainers 2.4.0
soundfile 0.13.1
soupsieve 2.7
soxr 0.5.0.post1
spacy 3.8.7
spacy-legacy 3.0.12
spacy-loggers 1.0.5
spanner-graph-notebook 1.1.7
Sphinx 8.2.3
sphinxcontrib-applehelp 2.0.0
sphinxcontrib-devhelp 2.0.0
sphinxcontrib-htmlhelp 2.1.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 2.0.0
sphinxcontrib-serializinghtml 2.0.0
SQLAlchemy 2.0.43
sqlglot 25.20.2
sqlparse 0.5.3
srsly 2.5.1
stanio 0.5.1
starlette 0.47.2
statsmodels 0.14.5
stringzilla 3.12.6
stumpy 1.13.0
sympy 1.13.1
tables 3.10.2
tabulate 0.9.0
tbb 2022.2.0
tblib 3.1.0
tcmlib 1.4.0
tenacity 9.1.2
tensorboard 2.19.0
tensorboard-data-server 0.7.2
tensorflow 2.19.0
tensorflow-datasets 4.9.9
tensorflow_decision_forests 1.12.0
tensorflow-hub 0.16.1
tensorflow-io-gcs-filesystem 0.37.1
tensorflow-metadata 1.17.2
tensorflow-probability 0.25.0
tensorflow-text 2.19.0
tensorstore 0.1.76
termcolor 3.1.0
terminado 0.18.1
text-unidecode 1.3
textblob 0.19.0
tf_keras 2.19.0
tf-slim 1.1.0
thinc 8.3.6
threadpoolctl 3.6.0
tifffile 2025.6.11
tiktoken 0.11.0
timm 1.0.19
tinycss2 1.4.0
tokenizers 0.21.4
toml 0.10.2
tomlkit 0.13.3
toolz 0.12.1
torch 2.6.0+cu124
torchao 0.10.0
torchaudio 2.6.0+cu124
torchdata 0.11.0
torchsummary 1.5.1
torchtune 0.6.1
torchvision 0.21.0+cu124
tornado 6.4.2
tqdm 4.67.1
traitlets 5.7.1
traittypes 0.2.1
transformers 4.55.1
treelite 4.4.1
treescope 0.1.10
triton 3.2.0
tsfresh 0.21.0
tweepy 4.16.0
typeguard 4.4.4
typer 0.16.0
types-pytz 2025.2.0.20250809
types-setuptools 80.9.0.20250809
typing_extensions 4.14.1
typing-inspection 0.4.1
tzdata 2025.2
tzlocal 5.3.1
uc-micro-py 1.0.3
ucx-py-cu12 0.44.0
ucxx-cu12 0.44.0
umap-learn 0.5.9.post2
umf 0.11.0
uritemplate 4.2.0
urllib3 2.5.0
uvicorn 0.35.0
vega-datasets 0.9.0
wadllib 1.3.6
wandb 0.21.1
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 15.0.1
Werkzeug 3.1.3
wheel 0.45.1
widgetsnbextension 3.6.10
wordcloud 1.9.4
wrapt 1.17.3
wurlitzer 3.1.1
xarray 2025.7.1
xarray-einstats 0.9.1
xgboost 3.0.4
xlrd 2.0.2
xxhash 3.5.0
xyzservices 2025.4.0
yarl 1.20.1
ydf 0.13.0
yellowbrick 1.5
yfinance 0.2.65
zict 3.0.0
zipp 3.23.0
zstandard 0.23.0

mikeydiamonds · 2025-07-10T21:29:39Z

mikeydiamonds
Jul 10, 2025

@alunkingusw Scriberr looks promising. rishikanthc is working on v1.0.0.

It uses whisper and pyannote as well.

It can

transcribe dictation, audio, video, or a YouTube URL
use open source model for speaker identification (diarization)
summarize if connected to OpenAI or Ollama

There are no subtitle features but if there are skilled devs reading there's no reason it couldn't get added.

Give it a look.

1 reply

alunkingusw Jul 10, 2025
Author

Cool! I'm packaging mine into software that will allow transcription of groups. It is designed to help a supervisor keep track of meetings between groups, so will transcribe audio and summarise using LLMs also, with a focus on individual contribution to the group effort. I'll check out Scriberr!

infolearningcourtsintlschools-bot · 2025-08-16T03:31:51Z

infolearningcourtsintlschools-bot
Aug 16, 2025

https://drive.google.com/file/d/13WRf4UUCUBQ0NSzdtOfMjy90WTiWve1Z/view?usp=drivesdk

1 reply

alunkingusw Aug 16, 2025
Author

Sorry, you'll need to provide some more context to this comment!

BestofthebestinAI · 2025-08-16T04:32:26Z

BestofthebestinAI
Aug 16, 2025

Hello, can I give you a test subject ( a youtube video) so we can see if it really work well? (It's a video with 2 speakers, where they sometimes speak over each other, as in one interrupt the other to say something then the other continues talking)

2 replies

alunkingusw Aug 16, 2025
Author

Sure, send a link and I'll see how it runs.

BestofthebestinAI Aug 16, 2025

Do you have an email or something else where I can send the link to the video? Thanks (or even some linkedin or anything)

BestofthebestinAI · 2025-08-16T05:26:58Z

BestofthebestinAI
Aug 16, 2025

Follow up questions:
I just read the code a bit:

The HF token is one of type "read"Reader right?
I already entered the pyanote page and pressed some agreement, I believe there were 2 version (2 and 3)? Anyway the process to make your code work is as I desribed, go to page and press accept agreement right?
The fun part: i see you have this line ("speaker_one": "/content/drive/speaker_one_sample.wav",) in the last part of the code:
Does that mean I will need to make a subdirectory where this code is running and insert it some wav audio files correponding to one of the speakers (and put it in whatevrer path we wrote after speaker_xx: ?Otherwise we can't run that section of the code.
Am I getting this right?
Thanks

1 reply

alunkingusw Aug 16, 2025
Author

Hi,

Yes, just read access to the HF
Yes just accept the agreement so that you can access the models. If you don't then an error asking you to accept will appear when you try and run the code.
Yes, I edited the audio and made the clip. The application uses this as a sample for the speaker to identify them speaking in the larger audio clip. My larger software that I am writing will hopefully allow the user to extract the sample clips from the larger audio clip automatically.

alunkingusw · 2025-08-16T12:25:32Z

alunkingusw
Aug 16, 2025
Author

8 replies

alunkingusw Aug 16, 2025
Author

My linkedin username is alunking so you should be able to find me that way!

BestofthebestinAI Aug 16, 2025

alunking

Ok sent you a request (with a note that contains a message) just answer me or add me so I can continue and send you the video link (I dont wish to send it here) then you can unadd me if you wish, I just wish to send my request outside of github. Thanks so much.
Btw, I have another question, did you think about adding your algorithm to WhisperX (based on fatser whisper i think or maybe even faaster than it i think)? I feel its faster than normal open ai whisper. Just a though:)

BestofthebestinAI Aug 16, 2025

Did you receive it?

alunkingusw Aug 16, 2025
Author

No I haven't yet sorry

BestofthebestinAI Aug 17, 2025

Hey, for some reason my account was blocked :(. You can keep me informed on en mail i guess I will try to share it later

BestofthebestinAI · 2025-08-17T15:12:14Z

BestofthebestinAI
Aug 17, 2025

Quick additional question:
is there a recommanded length of audio files in the refs directory?
Anythign else?
Only one file per speaker right?

1 reply

alunkingusw Aug 17, 2025
Author

I think 10-15 seconds is ideal for an embedding. I can't remember where I read that so feel free to have a look if there is another length recommended.
One file per speaker, yes.
Thanks!

BestofthebestinAI · 2025-08-18T08:56:56Z

BestofthebestinAI
Aug 18, 2025

Ok additional feedback:
I spent the whole weekend vibecoding (coding with Ai)
I tried to insert your method to WhisperX (which is waay waaay faster whisperisation)
I ended up with results where the speaker was not properly attributed to lines of text
Maybe you can do it yourself instead, because I am about to give up.

2 replies

alunkingusw Aug 20, 2025
Author

So whisperX looks like it already integrates with Pyannote if you want it to. It will label the speakers, but not match them to an embedding to give them their name, which is what I am looking at doing. I haven't looked at their code to try this. If you use WhisperX by itself, it doesn't care about speaker attribution, it just cares about translating whatever is said to text. It is Pyannote that attributes words to speakers, and any problems with speaker crossover is likely down to Pyannote. Whisper has been trained on a much larger dataset and I'm not surprised if Pyannote struggles with speaker crossover. I haven't looked at it enough to know what it is like.

BestofthebestinAI Aug 20, 2025

So whisperX looks like it already integrates with Pyannote if you want it to. It will label the speakers, but not match them to an embedding to give them their name, which is what I am looking at doing. I haven't looked at their code to try this. If you use WhisperX by itself, it doesn't care about speaker attribution, it just cares about translating whatever is said to text. It is Pyannote that attributes words to speakers, and any problems with speaker crossover is likely down to Pyannote. Whisper has been trained on a much larger dataset and I'm not surprised if Pyannote struggles with speaker crossover. I haven't looked at it enough to know what it is like.

No problem, just know that it is at least 15X faster and does no hallucination if you use the right model!
Keep us informed, I tried to add the embedding stuff but was blocked with the crosser speaker being added to one line of transcription throwing pyannote off to begin with.

fishfish123-win · 2025-08-28T07:58:44Z

fishfish123-win
Aug 28, 2025

model = AutoModel.from_pretrained(model_name, use_auth_token=use_auth_token)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/root/miniconda3/envs/whisper/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 547, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/whisper/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1291, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in pyannote/brouhaha. Should have a model_type key in its config.json, or contain one of the following strings in its name: aimv2, aimv2_vision_model, albert, align, altclip, arcee, aria, aria_text, audio-spectrogram-transformer, autoformer, aya_vision, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, bitnet, blenderbot, blenderbot-small, blip, blip-2, blip_2_qformer, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, cohere2_vision, colpali, colqwen2, conditional_detr, convbert, convnext, convnextv2, cpmant, csm, ctrl, cvt, d_fine, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deepseek_v2, deepseek_v3, deepseek_vl, deepseek_vl_hybrid, deformable_detr, deit, depth_anything, depth_pro, deta, detr, dia, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, doge, donut-swin, dots1, dpr, dpt, efficientformer, efficientloftr, efficientnet, electra, emu3, encodec, encoder-decoder, eomt, ernie, ernie4_5, ernie4_5_moe, ernie_m, esm, evolla, exaone4, falcon, falcon_h1, falcon_mamba, fastspeech2_conformer, fastspeech2_conformer_with_hifigan, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, gemma3, gemma3_text, gemma3n, gemma3n_audio, gemma3n_text, gemma3n_vision, git, glm, glm4, glm4_moe, glm4v, glm4v_text, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gpt_oss, gptj, gptsan-japanese, granite, granite_speech, granitemoe, granitemoehybrid, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hgnet_v2, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, internvl, internvl_vision, jamba, janus, jetmoe, jukebox, kosmos-2, kyutai_speech_to_text, layoutlm, layoutlmv2, layoutlmv3, led, levit, lfm2, lightglue, lilt, llama, llama4, llama4_text, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, minimax, mistral, mistral3, mixtral, mlcd, mllama, mm-grounding-dino, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, modernbert-decoder, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, perception_encoder, perception_lm, persimmon, phi, phi3, phi4_multimodal, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prompt_depth_anything, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_omni, qwen2_5_vl, qwen2_5_vl_text, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, qwen2_vl_text, qwen3, qwen3_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, sam_hq, sam_hq_vision_model, sam_vision_model, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, shieldgemma2, siglip, siglip2, siglip_vision_model, smollm3, smolvlm, smolvlm_vision, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, t5gemma, table-transformer, tapas, textnet, time_series_transformer, timesfm, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, vjepa2, voxtral, voxtral_encoder, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xlstm, xmod, yolos, yoso, zamba, zamba2, zoedepth 想问下有遇到这个报错吗？

1 reply

alunkingusw Aug 29, 2025
Author

No, I haven't had that error, but I don't use AutoModel to load the model - what is the model name you are passing?

production23 · 2025-09-01T17:10:50Z

production23
Sep 1, 2025

Hey @alunkingusw, this is great!
We’re building EvoLearns, a review system for LLMs + ASR + diarization. It wraps Whisper + Pyannote pipelines with:

a review UI to fix speaker attribution,
auto-naming from 10–15s reference clips,
exports to clean SRT / JSON / summaries,
and offline-ready workflows for sensitive use cases.

We'd love to pilot this with you or any contributors, just a couple sample recordings and we’ll run the pipeline, give you clean exports, and show how reviewers can clean up errors fast.

LMK if you'd be open to a short run!
Team EvoLearns

1 reply

alunkingusw Sep 3, 2025
Author

Hi, this sounds great - similar to what I am doing I guess!
Unfortunately, as an academic I am governed by ethical rules regarding data use so I need to keep my data internal, but I wish you the best and will keep an eye out for your work!

Audio transcription with speaker labelling using speaker samples. Using Whisper & Pyannote #2609

Uh oh!

Replies: 10 comments · 23 replies

Uh oh!

Uh oh!

alunkingusw Jun 29, 2025 Author

Uh oh!

Uh oh!

alunkingusw Aug 16, 2025 Author

Audio processing and transcription

Uh oh!

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

Uh oh!

alunkingusw Jul 10, 2025 Author

Uh oh!

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

Uh oh!

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

Uh oh!

Uh oh!

alunkingusw Aug 16, 2025 Author

Uh oh!

Uh oh!

Uh oh!

alunkingusw Aug 17, 2025 Author

Uh oh!

Uh oh!

Uh oh!

alunkingusw Aug 20, 2025 Author

Uh oh!

Uh oh!

Uh oh!

alunkingusw Aug 29, 2025 Author

Uh oh!

Uh oh!

Uh oh!

alunkingusw Sep 3, 2025 Author

Replies: 10 comments 23 replies

alunkingusw Jun 29, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw Jul 10, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw
Aug 16, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw Aug 16, 2025
Author

alunkingusw Aug 17, 2025
Author

alunkingusw Aug 20, 2025
Author

alunkingusw Aug 29, 2025
Author

alunkingusw Sep 3, 2025
Author