Skip to content

End-to-end Speaker Diarization and Transcription system combining Faster-Whisper (ASR), Pyannote (VAD/segmentation), NVIDIA NeMo (speaker embeddings + clustering), and CTC forced aligner for precise word-level timestamps. Includes a React + Vite frontend and Flask + FastAPI backend for easy audio upload, diarized transcripts, and summaries.

Notifications You must be signed in to change notification settings

ayushdh96/MIE_Diarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

162 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

MIE_Diarization

This is a diarization + transcription project that uses Pyannote for VAD/segmentation, NeMo (ECAPA-TDNN + MSDD) for speaker embedding/clustering, and Faster-Whisper for ASR.

It supports two processing modes:

  • Diarization + Transcription (mode=full)
  • Transcription Only (mode=asr)

This project is my summer internship work, which I've worked on as sole member.

Below are some shorts/videos about the project and my research:

๐ŸŽฌ Project Shorts

๐ŸŽฅ Topic ๐Ÿ”— Watch
๐Ÿ“Š Full Project Overview Watch on YouTube
๐Ÿ“ˆ Pipeline Sequence Diagram Explained Watch on YouTube
๐Ÿค– Whisper Model Comparison & Faster-Whisper Insights Watch on YouTube

Below is the description for running the diarization pipeline. This work couldn't have been possible without the project of diarization from https://github.com/MahmoudAshraf97/whisper-diarization

Speaker Diarization using NEMO + Pyannote Pipeline using OPEN AI Whisper

drawing Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!

What is it

This repository combines Faster-Whisper ASR capabilities with Pyannote's segmentation model for Voice Activity Detection (VAD) and initial segmentation, and NeMo components including ECAPA-TDNN and MSDD for speaker embeddings and clustering to identify the speaker for each sentence in the transcription generated by Faster-Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Faster-Whisper, and the timestamps are corrected and aligned using ctc-forced-aligner to help minimize diarization error due to time shift. The audio is then passed into Pyannote's segmentation model for VAD and segmentation to exclude silences, and NeMo's ECAPA-TDNN and MSDD components are then used to extract speaker embeddings and cluster speakers to identify the speaker for each segment. The result is associated with the timestamps generated by ctc-forced-aligner to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

The output is displayed to the user in the frontend, which is built with React + Vite and communicates with the backend (Flask) over CORS.

API

  • POST /api/diarize
    • multipart/form-data: audio (file), mode (full or asr)
    • Returns: transcript (text) and diarization_json (when available)

Speaker Enrollment & Known Speaker Identification

This project now supports known speaker enrollment and identification on top of standard diarization.

The workflow is split into two phases:

1. Speaker Enrollment

A speaker can be enrolled once using a clean reference audio sample. During enrollment, the system:

  • Runs VAD to remove silence
  • Extracts NeMo speaker embeddings (192-dim, Titanet-based)
  • Stores the averaged embedding in a persistent JSON database

Example:

python3 diarize.py \
  --enroll-speaker \
  --speaker-audio "Speaker Audios/Ayush.mp3" \
  --speaker-label "Ayush" \
  --speakers-db "Speaker Audios/speakers_db.json"

This creates or updates speakers_db.json with the enrolled speaker embedding.

2. Known Speaker Identification during Diarization

Once speakers are enrolled, diarization can identify who is speaking by comparing cluster embeddings against the enrolled database.

Example:

python3 diarize.py \
  -a "ayushandamber.mp3" \
  --mode full \
  --identify-known \
  --candidate-labels "Ayush,Amber"

Internally, the pipeline:

  • Runs full diarization (Pyannote + NeMo MSDD)
  • Aggregates audio per diarized cluster
  • Computes NeMo speaker embeddings per cluster
  • Matches clusters to enrolled speakers using cosine similarity
  • Assigns speaker identities when similarity crosses a threshold

If --candidate-labels is provided, matching is restricted to those speakers only.

๐Ÿ”„ Speaker Enrollment Pipeline (Sequence Diagram)

The diagram below illustrates how a new speaker is enrolled into the system and how a persistent speaker embedding is generated for future identification.

sequenceDiagram
    participant User
    participant CLI as diarize.py
    participant Audio as Audio Loader
    participant Seg as Pyannote Segmentation Model
    participant VAD as VAD Post-Processing
    participant NeMo as NeMo Speaker Model (Titanet)
    participant DB as speakers_db.json

    User->>CLI: Run --enroll-speaker
    CLI->>Audio: Load & normalize audio (mono, 16kHz)
    Audio->>Seg: Frame-level speech segmentation
    Seg->>VAD: Speech probabilities per frame
    VAD->>CLI: Final speech segments (timestamps)
    CLI->>CLI: Concatenate speech segments (10โ€“15s)
    CLI->>NeMo: Extract speaker embedding
    NeMo->>CLI: Return fixed-length embedding (192-dim)
    CLI->>DB: Save / update speaker embedding
    DB-->>User: Speaker enrolled successfully
Loading

๐ŸŽฅ Watch the Speaker Enrollment Shorts: YouTube Shorts

Explanation

During speaker enrollment, the system focuses on extracting a clean and reliable voice representation for a single speaker.
Silence and noise are removed using voice activity detection, and only speech segments are used.
These speech segments are passed through a NeMo-based speaker model (Titanet) to generate a fixed-length embedding that uniquely represents the speakerโ€™s voice.
The resulting embedding is stored in a persistent database and later reused during diarization to identify who is speaking.


๐Ÿ”„ Diarization + Known Speaker Matching (End-to-End Summary)

Below is a concise view of how speaker diarization and known-speaker identification work together once speakers have been enrolled.

sequenceDiagram
  autonumber
  actor User
  participant CLI as diarize.py
  participant VAD as Pyannote Segmentation + VAD
  participant MSDD as NeMo MSDD
  participant TiTANet as NeMo TiTANet
  participant DB as Speaker DB
  participant Match as Cosine Match

  User->>CLI: Run diarization with identify-known
  CLI->>VAD: Segment audio into speech regions
  VAD-->>CLI: Speech segments (5-scale windows)

  CLI->>MSDD: Cluster speech segments
  MSDD-->>CLI: Speaker labels (SPEAKER_0, SPEAKER_1)

  CLI->>TiTANet: Create embedding per speaker cluster
  TiTANet-->>CLI: Speaker embeddings

  CLI->>DB: Load enrolled speaker embeddings
  DB-->>CLI: Known speaker vectors

  CLI->>Match: Cosine similarity + threshold
  Match-->>CLI: Speaker identity (MATCH / UNKNOWN)

  CLI-->>User: Diarized output with speaker names
Loading

Explanation

Once speakers are enrolled, the system can identify who is speaking during diarization.

The audio is first passed through Pyannoteโ€™s segmentation model, which performs voice activity detection and produces speech-only regions. These regions are processed using five multi-scale windows, which allows the NeMo diarizer to remain robust to short utterances, pauses, and speaker changes.

Next, NeMo MSDD clusters these speech segments to determine who spoke when, assigning labels such as SPEAKER_0, SPEAKER_1, etc.
For each speaker cluster, the corresponding audio is aggregated and passed through NeMoโ€™s TiTANet speaker model to generate a high-quality 192-dimensional speaker embedding.

These cluster embeddings are then compared against the enrolled speaker embeddings stored in the speaker database using cosine similarity.
If the similarity crosses a threshold, the speaker is identified by name; otherwise, the speaker remains unknown.

This design cleanly separates:

  • Segmentation (Pyannote)
  • Clustering (NeMo MSDD with 5-scale windows)
  • Speaker representation (TiTANet embeddings)
  • Identity matching (cosine similarity)

The result is a robust, extensible diarization pipeline that supports both unknown speakers and known speaker identification without breaking existing diarization behavior.


Running the Full Project Locally (Docker)

This repo ships with a React frontend and a Python backend (Flask). The two services communicate over CORS.
To run the entire stack locally using Docker, use the provided docker-compose.yml, which creates two containers (frontend + backend).

Prerequisites: Docker & Docker Compose installed.

# From the repository root

# Option A: build then run
docker-compose build
docker-compose up

# Option B: one-shot build+run
docker-compose up --build

This will spin up two containers on your local system:

Backend: Flask (handles diarization and transcription requests).
  
Frontend: React + Vite interface for uploading audio, running diarization, and viewing results.

Once running, you can access the UI at http://localhost:5173 (default Vite port), which will interact with the backend automatically.

Hosted Website

If you donโ€™t want to run it locally, you can try the hosted version here: ๐Ÿ”— MIE Diarization Website

โš ๏ธ Note: The open-source MIE server only has 4 GB RAM and 4 CPU cores, so it cannot handle audio files longer than ~1 minute due to the heavy models involved. For full-length audio support, itโ€™s recommended to run the project on your own system using the docker-compose.yml file.

Known Limitations

  • Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
  • There might be some errors, please raise an issue if you encounter any.
  • Diarization/transcription quality can vary depending on audio quality and speaker overlap.

Future Improvements

  • Improve overlapping of the audio

Acknowledgements

This work is based on OpenAI's Whisper , Faster Whisper , Nvidia NeMo , Facebook's Demucs, and Pyannote for segmentation.

This is how the flow is:

๐Ÿ”„ Diarization + Transcription Pipeline (Sequence Diagram)

sequenceDiagram
    participant User
    participant Demucs
    participant Whisper
    participant CTCAligner
    participant Pyannote
    participant NeMo
    participant Punctuation
    participant Output

    User->>Demucs: Provide audio input
    alt Source separation enabled
        Demucs->>User: Return vocals.wav
    else Skipped
        User->>Whisper: Use original audio
    end

    User->>Whisper: Transcribe audio (Faster-Whisper)
    Whisper->>CTCAligner: Generate emissions & align words
    CTCAligner->>User: Word-level timestamps

    User->>Pyannote: Run segmentation (VAD)
    Pyannote->>NeMo: Extract embeddings (ECAPA-TDNN) & cluster (MSDD)
    NeMo->>User: Speaker segment timestamps

    User->>NeMo: Map words to speakers
    alt Punctuation supported
        User->>Punctuation: Add punctuation
        Punctuation->>User: Return punctuated transcript
    else Not supported
        User->>User: Skip punctuation
    end

    User->>Output: Save outputs (.txt, .srt)
    Output-->>User: Diarized and transcribed results
Loading

๐Ÿ”„ Diarization + Transcription Pipeline (Flow Diagram)

flowchart TD
    A["Start: user provides audio file or mic input"] --> B{Source separation enabled?}
    B -- Yes --> C["Demucs output: vocals.wav"]
    B -- No --> D["Use original audio"]

    %% Common audio input for both branches
    C --> X["Audio for processing"]
    D --> X

    %% Transcription pipeline (single pipeline: Whisper + CTC aligner)
    subgraph Transcription
        direction TB
        T1["Faster-Whisper (ASR)"]
        T2["CTC forced aligner (word-level timestamps)"]
        T1 --> T2
    end
    X --> T1

    %% Diarization pipeline (segmentation + embeddings/clustering)
    subgraph Diarization
        direction TB
        D1["Pyannote segmentation (VAD)"]
        D2["NeMo embeddings (ECAPA-TDNN) + clustering (MSDD)"]
        D1 --> D2
    end
    X --> D1

    %% Merge branches to align speakers to words
    T2 --> M["Align speakers to words (map segments to tokens)"]
    D2 --> M

    %% Optional punctuation
    M --> P{Punctuation enabled?}
    P -- Yes --> P1["Apply punctuation"]
    P -- No --> P2["Skip punctuation"]

    %% Outputs
    P1 --> O["Write outputs: txt / srt / json"]
    P2 --> O
    O --> R["Return diarized + transcribed results"]
Loading

About

End-to-end Speaker Diarization and Transcription system combining Faster-Whisper (ASR), Pyannote (VAD/segmentation), NVIDIA NeMo (speaker embeddings + clustering), and CTC forced aligner for precise word-level timestamps. Includes a React + Vite frontend and Flask + FastAPI backend for easy audio upload, diarized transcripts, and summaries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •