This is a diarization + transcription project that uses Pyannote for VAD/segmentation, NeMo (ECAPA-TDNN + MSDD) for speaker embedding/clustering, and Faster-Whisper for ASR.
It supports two processing modes:
- Diarization + Transcription (
mode=full) - Transcription Only (
mode=asr)
This project is my summer internship work, which I've worked on as sole member.
Below are some shorts/videos about the project and my research:
| ๐ฅ Topic | ๐ Watch |
|---|---|
| ๐ Full Project Overview | Watch on YouTube |
| ๐ Pipeline Sequence Diagram Explained | Watch on YouTube |
| ๐ค Whisper Model Comparison & Faster-Whisper Insights | Watch on YouTube |
Below is the description for running the diarization pipeline. This work couldn't have been possible without the project of diarization from https://github.com/MahmoudAshraf97/whisper-diarization
Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!
This repository combines Faster-Whisper ASR capabilities with Pyannote's segmentation model for Voice Activity Detection (VAD) and initial segmentation, and NeMo components including ECAPA-TDNN and MSDD for speaker embeddings and clustering to identify the speaker for each sentence in the transcription generated by Faster-Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Faster-Whisper, and the timestamps are corrected and aligned using ctc-forced-aligner to help minimize diarization error due to time shift. The audio is then passed into Pyannote's segmentation model for VAD and segmentation to exclude silences, and NeMo's ECAPA-TDNN and MSDD components are then used to extract speaker embeddings and cluster speakers to identify the speaker for each segment. The result is associated with the timestamps generated by ctc-forced-aligner to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.
The output is displayed to the user in the frontend, which is built with React + Vite and communicates with the backend (Flask) over CORS.
POST /api/diarizemultipart/form-data:audio(file),mode(fullorasr)- Returns:
transcript(text) anddiarization_json(when available)
This project now supports known speaker enrollment and identification on top of standard diarization.
The workflow is split into two phases:
A speaker can be enrolled once using a clean reference audio sample. During enrollment, the system:
- Runs VAD to remove silence
- Extracts NeMo speaker embeddings (192-dim, Titanet-based)
- Stores the averaged embedding in a persistent JSON database
Example:
python3 diarize.py \
--enroll-speaker \
--speaker-audio "Speaker Audios/Ayush.mp3" \
--speaker-label "Ayush" \
--speakers-db "Speaker Audios/speakers_db.json"This creates or updates speakers_db.json with the enrolled speaker embedding.
Once speakers are enrolled, diarization can identify who is speaking by comparing cluster embeddings against the enrolled database.
Example:
python3 diarize.py \
-a "ayushandamber.mp3" \
--mode full \
--identify-known \
--candidate-labels "Ayush,Amber"Internally, the pipeline:
- Runs full diarization (Pyannote + NeMo MSDD)
- Aggregates audio per diarized cluster
- Computes NeMo speaker embeddings per cluster
- Matches clusters to enrolled speakers using cosine similarity
- Assigns speaker identities when similarity crosses a threshold
If --candidate-labels is provided, matching is restricted to those speakers only.
The diagram below illustrates how a new speaker is enrolled into the system and how a persistent speaker embedding is generated for future identification.
sequenceDiagram
participant User
participant CLI as diarize.py
participant Audio as Audio Loader
participant Seg as Pyannote Segmentation Model
participant VAD as VAD Post-Processing
participant NeMo as NeMo Speaker Model (Titanet)
participant DB as speakers_db.json
User->>CLI: Run --enroll-speaker
CLI->>Audio: Load & normalize audio (mono, 16kHz)
Audio->>Seg: Frame-level speech segmentation
Seg->>VAD: Speech probabilities per frame
VAD->>CLI: Final speech segments (timestamps)
CLI->>CLI: Concatenate speech segments (10โ15s)
CLI->>NeMo: Extract speaker embedding
NeMo->>CLI: Return fixed-length embedding (192-dim)
CLI->>DB: Save / update speaker embedding
DB-->>User: Speaker enrolled successfully
๐ฅ Watch the Speaker Enrollment Shorts: YouTube Shorts
During speaker enrollment, the system focuses on extracting a clean and reliable voice representation for a single speaker.
Silence and noise are removed using voice activity detection, and only speech segments are used.
These speech segments are passed through a NeMo-based speaker model (Titanet) to generate a fixed-length embedding that uniquely represents the speakerโs voice.
The resulting embedding is stored in a persistent database and later reused during diarization to identify who is speaking.
Below is a concise view of how speaker diarization and known-speaker identification work together once speakers have been enrolled.
sequenceDiagram
autonumber
actor User
participant CLI as diarize.py
participant VAD as Pyannote Segmentation + VAD
participant MSDD as NeMo MSDD
participant TiTANet as NeMo TiTANet
participant DB as Speaker DB
participant Match as Cosine Match
User->>CLI: Run diarization with identify-known
CLI->>VAD: Segment audio into speech regions
VAD-->>CLI: Speech segments (5-scale windows)
CLI->>MSDD: Cluster speech segments
MSDD-->>CLI: Speaker labels (SPEAKER_0, SPEAKER_1)
CLI->>TiTANet: Create embedding per speaker cluster
TiTANet-->>CLI: Speaker embeddings
CLI->>DB: Load enrolled speaker embeddings
DB-->>CLI: Known speaker vectors
CLI->>Match: Cosine similarity + threshold
Match-->>CLI: Speaker identity (MATCH / UNKNOWN)
CLI-->>User: Diarized output with speaker names
Once speakers are enrolled, the system can identify who is speaking during diarization.
The audio is first passed through Pyannoteโs segmentation model, which performs voice activity detection and produces speech-only regions. These regions are processed using five multi-scale windows, which allows the NeMo diarizer to remain robust to short utterances, pauses, and speaker changes.
Next, NeMo MSDD clusters these speech segments to determine who spoke when, assigning labels such as SPEAKER_0, SPEAKER_1, etc.
For each speaker cluster, the corresponding audio is aggregated and passed through NeMoโs TiTANet speaker model to generate a high-quality 192-dimensional speaker embedding.
These cluster embeddings are then compared against the enrolled speaker embeddings stored in the speaker database using cosine similarity.
If the similarity crosses a threshold, the speaker is identified by name; otherwise, the speaker remains unknown.
This design cleanly separates:
- Segmentation (Pyannote)
- Clustering (NeMo MSDD with 5-scale windows)
- Speaker representation (TiTANet embeddings)
- Identity matching (cosine similarity)
The result is a robust, extensible diarization pipeline that supports both unknown speakers and known speaker identification without breaking existing diarization behavior.
This repo ships with a React frontend and a Python backend (Flask). The two services communicate over CORS.
To run the entire stack locally using Docker, use the provided docker-compose.yml, which creates two containers (frontend + backend).
Prerequisites: Docker & Docker Compose installed.
# From the repository root
# Option A: build then run
docker-compose build
docker-compose up
# Option B: one-shot build+run
docker-compose up --buildThis will spin up two containers on your local system:
Backend: Flask (handles diarization and transcription requests).
Frontend: React + Vite interface for uploading audio, running diarization, and viewing results.
Once running, you can access the UI at http://localhost:5173 (default Vite port), which will interact with the backend automatically.
If you donโt want to run it locally, you can try the hosted version here: ๐ MIE Diarization Website
- Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
- There might be some errors, please raise an issue if you encounter any.
- Diarization/transcription quality can vary depending on audio quality and speaker overlap.
- Improve overlapping of the audio
This work is based on OpenAI's Whisper , Faster Whisper , Nvidia NeMo , Facebook's Demucs, and Pyannote for segmentation.
This is how the flow is:
sequenceDiagram
participant User
participant Demucs
participant Whisper
participant CTCAligner
participant Pyannote
participant NeMo
participant Punctuation
participant Output
User->>Demucs: Provide audio input
alt Source separation enabled
Demucs->>User: Return vocals.wav
else Skipped
User->>Whisper: Use original audio
end
User->>Whisper: Transcribe audio (Faster-Whisper)
Whisper->>CTCAligner: Generate emissions & align words
CTCAligner->>User: Word-level timestamps
User->>Pyannote: Run segmentation (VAD)
Pyannote->>NeMo: Extract embeddings (ECAPA-TDNN) & cluster (MSDD)
NeMo->>User: Speaker segment timestamps
User->>NeMo: Map words to speakers
alt Punctuation supported
User->>Punctuation: Add punctuation
Punctuation->>User: Return punctuated transcript
else Not supported
User->>User: Skip punctuation
end
User->>Output: Save outputs (.txt, .srt)
Output-->>User: Diarized and transcribed results
flowchart TD
A["Start: user provides audio file or mic input"] --> B{Source separation enabled?}
B -- Yes --> C["Demucs output: vocals.wav"]
B -- No --> D["Use original audio"]
%% Common audio input for both branches
C --> X["Audio for processing"]
D --> X
%% Transcription pipeline (single pipeline: Whisper + CTC aligner)
subgraph Transcription
direction TB
T1["Faster-Whisper (ASR)"]
T2["CTC forced aligner (word-level timestamps)"]
T1 --> T2
end
X --> T1
%% Diarization pipeline (segmentation + embeddings/clustering)
subgraph Diarization
direction TB
D1["Pyannote segmentation (VAD)"]
D2["NeMo embeddings (ECAPA-TDNN) + clustering (MSDD)"]
D1 --> D2
end
X --> D1
%% Merge branches to align speakers to words
T2 --> M["Align speakers to words (map segments to tokens)"]
D2 --> M
%% Optional punctuation
M --> P{Punctuation enabled?}
P -- Yes --> P1["Apply punctuation"]
P -- No --> P2["Skip punctuation"]
%% Outputs
P1 --> O["Write outputs: txt / srt / json"]
P2 --> O
O --> R["Return diarized + transcribed results"]