MIE_Diarization

This is a diarization + transcription project that uses Pyannote for VAD/segmentation, NeMo (ECAPA-TDNN + MSDD) for speaker embedding/clustering, and Faster-Whisper for ASR.

It supports two processing modes:

Diarization + Transcription (mode=full)
Transcription Only (mode=asr)

This project is my summer internship work, which I've worked on as sole member.

Below are some shorts/videos about the project and my research:

🎬 Project Shorts

🎥 Topic	🔗 Watch
📊 Full Project Overview	Watch on YouTube
📈 Pipeline Sequence Diagram Explained	Watch on YouTube
🤖 Whisper Model Comparison & Faster-Whisper Insights	Watch on YouTube

Below is the description for running the diarization pipeline. This work couldn't have been possible without the project of diarization from https://github.com/MahmoudAshraf97/whisper-diarization

Speaker Diarization using NEMO + Pyannote Pipeline using OPEN AI Whisper

Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!

What is it

This repository combines Faster-Whisper ASR capabilities with Pyannote's segmentation model for Voice Activity Detection (VAD) and initial segmentation, and NeMo components including ECAPA-TDNN and MSDD for speaker embeddings and clustering to identify the speaker for each sentence in the transcription generated by Faster-Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Faster-Whisper, and the timestamps are corrected and aligned using ctc-forced-aligner to help minimize diarization error due to time shift. The audio is then passed into Pyannote's segmentation model for VAD and segmentation to exclude silences, and NeMo's ECAPA-TDNN and MSDD components are then used to extract speaker embeddings and cluster speakers to identify the speaker for each segment. The result is associated with the timestamps generated by ctc-forced-aligner to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

The output is displayed to the user in the frontend, which is built with React + Vite and communicates with the backend (Flask) over CORS.

API

POST /api/diarize
- multipart/form-data: audio (file), mode (full or asr)
- Returns: transcript (text) and diarization_json (when available)

Speaker Enrollment & Known Speaker Identification

This project now supports known speaker enrollment and identification on top of standard diarization.

The workflow is split into two phases:

1. Speaker Enrollment

A speaker can be enrolled once using a clean reference audio sample. During enrollment, the system:

Runs VAD to remove silence
Extracts NeMo speaker embeddings (192-dim, Titanet-based)
Stores the averaged embedding in a persistent JSON database

Example:

python3 diarize.py \
  --enroll-speaker \
  --speaker-audio "Speaker Audios/Ayush.mp3" \
  --speaker-label "Ayush" \
  --speakers-db "Speaker Audios/speakers_db.json"

This creates or updates speakers_db.json with the enrolled speaker embedding.

2. Known Speaker Identification during Diarization

Once speakers are enrolled, diarization can identify who is speaking by comparing cluster embeddings against the enrolled database.

Example:

python3 diarize.py \
  -a "ayushandamber.mp3" \
  --mode full \
  --identify-known \
  --candidate-labels "Ayush,Amber"

Internally, the pipeline:

Runs full diarization (Pyannote + NeMo MSDD)
Aggregates audio per diarized cluster
Computes NeMo speaker embeddings per cluster
Matches clusters to enrolled speakers using cosine similarity
Assigns speaker identities when similarity crosses a threshold

If --candidate-labels is provided, matching is restricted to those speakers only.

🔄 Speaker Enrollment Pipeline (Sequence Diagram)

The diagram below illustrates how a new speaker is enrolled into the system and how a persistent speaker embedding is generated for future identification.

sequenceDiagram
    participant User
    participant CLI as diarize.py
    participant Audio as Audio Loader
    participant Seg as Pyannote Segmentation Model
    participant VAD as VAD Post-Processing
    participant NeMo as NeMo Speaker Model (Titanet)
    participant DB as speakers_db.json

    User->>CLI: Run --enroll-speaker
    CLI->>Audio: Load & normalize audio (mono, 16kHz)
    Audio->>Seg: Frame-level speech segmentation
    Seg->>VAD: Speech probabilities per frame
    VAD->>CLI: Final speech segments (timestamps)
    CLI->>CLI: Concatenate speech segments (10–15s)
    CLI->>NeMo: Extract speaker embedding
    NeMo->>CLI: Return fixed-length embedding (192-dim)
    CLI->>DB: Save / update speaker embedding
    DB-->>User: Speaker enrolled successfully

🎥 Watch the Speaker Enrollment Shorts: YouTube Shorts

Explanation

During speaker enrollment, the system focuses on extracting a clean and reliable voice representation for a single speaker.
Silence and noise are removed using voice activity detection, and only speech segments are used.
These speech segments are passed through a NeMo-based speaker model (Titanet) to generate a fixed-length embedding that uniquely represents the speaker’s voice.
The resulting embedding is stored in a persistent database and later reused during diarization to identify who is speaking.

🔄 Diarization + Known Speaker Matching (End-to-End Summary)

Below is a concise view of how speaker diarization and known-speaker identification work together once speakers have been enrolled.

sequenceDiagram
  autonumber
  actor User
  participant CLI as diarize.py
  participant VAD as Pyannote Segmentation + VAD
  participant MSDD as NeMo MSDD
  participant TiTANet as NeMo TiTANet
  participant DB as Speaker DB
  participant Match as Cosine Match

  User->>CLI: Run diarization with identify-known
  CLI->>VAD: Segment audio into speech regions
  VAD-->>CLI: Speech segments (5-scale windows)

  CLI->>MSDD: Cluster speech segments
  MSDD-->>CLI: Speaker labels (SPEAKER_0, SPEAKER_1)

  CLI->>TiTANet: Create embedding per speaker cluster
  TiTANet-->>CLI: Speaker embeddings

  CLI->>DB: Load enrolled speaker embeddings
  DB-->>CLI: Known speaker vectors

  CLI->>Match: Cosine similarity + threshold
  Match-->>CLI: Speaker identity (MATCH / UNKNOWN)

  CLI-->>User: Diarized output with speaker names

Explanation

Once speakers are enrolled, the system can identify who is speaking during diarization.

The audio is first passed through Pyannote’s segmentation model, which performs voice activity detection and produces speech-only regions. These regions are processed using five multi-scale windows, which allows the NeMo diarizer to remain robust to short utterances, pauses, and speaker changes.

Next, NeMo MSDD clusters these speech segments to determine who spoke when, assigning labels such as SPEAKER_0, SPEAKER_1, etc.
For each speaker cluster, the corresponding audio is aggregated and passed through NeMo’s TiTANet speaker model to generate a high-quality 192-dimensional speaker embedding.

These cluster embeddings are then compared against the enrolled speaker embeddings stored in the speaker database using cosine similarity.
If the similarity crosses a threshold, the speaker is identified by name; otherwise, the speaker remains unknown.

This design cleanly separates:

Segmentation (Pyannote)
Clustering (NeMo MSDD with 5-scale windows)
Speaker representation (TiTANet embeddings)
Identity matching (cosine similarity)

The result is a robust, extensible diarization pipeline that supports both unknown speakers and known speaker identification without breaking existing diarization behavior.

Running the Full Project Locally (Docker)

This repo ships with a React frontend and a Python backend (Flask). The two services communicate over CORS.
To run the entire stack locally using Docker, use the provided docker-compose.yml, which creates two containers (frontend + backend).

Prerequisites: Docker & Docker Compose installed.

# From the repository root

# Option A: build then run
docker-compose build
docker-compose up

# Option B: one-shot build+run
docker-compose up --build

This will spin up two containers on your local system:

Backend: Flask (handles diarization and transcription requests).
  
Frontend: React + Vite interface for uploading audio, running diarization, and viewing results.

Once running, you can access the UI at http://localhost:5173 (default Vite port), which will interact with the backend automatically.

Hosted Website

If you don’t want to run it locally, you can try the hosted version here: 🔗 MIE Diarization Website

⚠️ Note: The open-source MIE server only has 4 GB RAM and 4 CPU cores, so it cannot handle audio files longer than ~1 minute due to the heavy models involved. For full-length audio support, it’s recommended to run the project on your own system using the docker-compose.yml file.

Known Limitations

Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
There might be some errors, please raise an issue if you encounter any.
Diarization/transcription quality can vary depending on audio quality and speaker overlap.

Future Improvements

Improve overlapping of the audio

Acknowledgements

This work is based on OpenAI's Whisper , Faster Whisper , Nvidia NeMo , Facebook's Demucs, and Pyannote for segmentation.

This is how the flow is:

🔄 Diarization + Transcription Pipeline (Sequence Diagram)

sequenceDiagram
    participant User
    participant Demucs
    participant Whisper
    participant CTCAligner
    participant Pyannote
    participant NeMo
    participant Punctuation
    participant Output

    User->>Demucs: Provide audio input
    alt Source separation enabled
        Demucs->>User: Return vocals.wav
    else Skipped
        User->>Whisper: Use original audio
    end

    User->>Whisper: Transcribe audio (Faster-Whisper)
    Whisper->>CTCAligner: Generate emissions & align words
    CTCAligner->>User: Word-level timestamps

    User->>Pyannote: Run segmentation (VAD)
    Pyannote->>NeMo: Extract embeddings (ECAPA-TDNN) & cluster (MSDD)
    NeMo->>User: Speaker segment timestamps

    User->>NeMo: Map words to speakers
    alt Punctuation supported
        User->>Punctuation: Add punctuation
        Punctuation->>User: Return punctuated transcript
    else Not supported
        User->>User: Skip punctuation
    end

    User->>Output: Save outputs (.txt, .srt)
    Output-->>User: Diarized and transcribed results

🔄 Diarization + Transcription Pipeline (Flow Diagram)

flowchart TD
    A["Start: user provides audio file or mic input"] --> B{Source separation enabled?}
    B -- Yes --> C["Demucs output: vocals.wav"]
    B -- No --> D["Use original audio"]

    %% Common audio input for both branches
    C --> X["Audio for processing"]
    D --> X

    %% Transcription pipeline (single pipeline: Whisper + CTC aligner)
    subgraph Transcription
        direction TB
        T1["Faster-Whisper (ASR)"]
        T2["CTC forced aligner (word-level timestamps)"]
        T1 --> T2
    end
    X --> T1

    %% Diarization pipeline (segmentation + embeddings/clustering)
    subgraph Diarization
        direction TB
        D1["Pyannote segmentation (VAD)"]
        D2["NeMo embeddings (ECAPA-TDNN) + clustering (MSDD)"]
        D1 --> D2
    end
    X --> D1

    %% Merge branches to align speakers to words
    T2 --> M["Align speakers to words (map segments to tokens)"]
    D2 --> M

    %% Optional punctuation
    M --> P{Punctuation enabled?}
    P -- Yes --> P1["Apply punctuation"]
    P -- No --> P2["Skip punctuation"]

    %% Outputs
    P1 --> O["Write outputs: txt / srt / json"]
    P2 --> O
    O --> R["Return diarized + transcribed results"]

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.github		.github
diarization-ui		diarization-ui
whisper-diarization		whisper-diarization
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.deploy.yml		docker-compose.deploy.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIE_Diarization

🎬 Project Shorts

Speaker Diarization using NEMO + Pyannote Pipeline using OPEN AI Whisper

What is it

API

Speaker Enrollment & Known Speaker Identification

1. Speaker Enrollment

2. Known Speaker Identification during Diarization

🔄 Speaker Enrollment Pipeline (Sequence Diagram)

Explanation

🔄 Diarization + Known Speaker Matching (End-to-End Summary)

Explanation

Running the Full Project Locally (Docker)

Hosted Website

Known Limitations

Future Improvements

Acknowledgements

🔄 Diarization + Transcription Pipeline (Sequence Diagram)

🔄 Diarization + Transcription Pipeline (Flow Diagram)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ayushdh96/MIE_Diarization

Folders and files

Latest commit

History

Repository files navigation

MIE_Diarization

🎬 Project Shorts

Speaker Diarization using NEMO + Pyannote Pipeline using OPEN AI Whisper

What is it

API

Speaker Enrollment & Known Speaker Identification

1. Speaker Enrollment

2. Known Speaker Identification during Diarization

🔄 Speaker Enrollment Pipeline (Sequence Diagram)

Explanation

🔄 Diarization + Known Speaker Matching (End-to-End Summary)

Explanation

Running the Full Project Locally (Docker)

Hosted Website

Known Limitations

Future Improvements

Acknowledgements

🔄 Diarization + Transcription Pipeline (Sequence Diagram)

🔄 Diarization + Transcription Pipeline (Flow Diagram)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages