Speaker Diarization/Speaker Segmentation for call conversation #2493

Navanit-nebula · 2025-01-09T09:20:35Z

Navanit-nebula
Jan 9, 2025

I have to create simple speaker segmentation for a simple phone call conversion. I have to transcribe the call and segement between the user and agent.
For the time being I have downloaded the youtube video https://www.youtube.com/watch?v=xbyEs7DJshw&t converted it in wav format in mono audio and 16khz .

So I have used 2 different techniques

Diarization using pyannote

import os
from pyannote.audio import Pipeline
import torch
import pandas as pd

# Load diarization model
# pipeline = Pipeline.from_pretrained(
#         "pyannote/speaker-diarization-3.1", 
#     use_auth_token="xxx")
pipeline = Pipeline.from_pretrained(
  "pyannote/speech-separation-ami-1.0",
  use_auth_token="xxxx")
import torch
pipeline.to(torch.device("cuda"))

audio_path = "call_center_conversation_16k_mono.wav"



# waveform, sample_rate = torchaudio.load(audio_path)
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    # diarization = pipeline("audio.wav", hook=hook)
    diarization, sources = pipeline(audio_path, hook = hook)

diarization_data = []
speaker_segments = {}

for turn, _, speaker in diarization.itertracks(yield_label=True):
    diarization_data.append({"speaker": speaker, "start": turn.start, "end": turn.end})
    if speaker not in speaker_segments:
        speaker_segments[speaker] = []
    speaker_segments[speaker].append((turn.start, turn.end))

# Save diarization data to CSV
diarization_df = pd.DataFrame(diarization_data)
diarization_df.to_csv("audio_v2.csv", index=False)

Whisper large v3

import whisper
import pandas as pd

def whisper_inference(filename,model_type, 
                      verbose=False):
    
    model = whisper.load_model(model_type)
    result = model.transcribe(filename, 
                              language ='en',
                              task ='transcribe',
                              temperature = 0,
                              verbose=verbose)

    return result

def whisper_inference_with_segments_df(fname, model_type):
    
    result = whisper_inference(fname, model_type=model_type)

    all_seg_df_list = []
    
    for this_seg in result['segments']:
        if 'tokens' in this_seg.keys():
            this_seg.pop('tokens')

        this_df = pd.DataFrame.from_dict({0: this_seg}, 
                                        orient='index')
        
        all_seg_df_list.append(this_df)
        
    all_seg_df = pd.concat(all_seg_df_list, axis=0)
    all_seg_df = all_seg_df.set_index('id')
    
    return all_seg_df


if __name__ == '__main__':

    model_type = 'large-v3'
    filename ="call_center_conversation_16k_mono.wav"


    seg_df = whisper_inference_with_segments_df(filename,model_type)
    seg_df.to_csv("test_v6.csv")
    # print(seg_df)

But the timestamp between both of them is not same.
If anyone wants to recreate I am attaching the code.

Now how should I get the best in this

Lakoc · 2025-03-27T08:55:45Z

Lakoc
Mar 27, 2025

Hi,

We’ve just released DiCoW v2, an upgraded version of our Diarization-Conditioned Whisper model! 🚀 This version takes diarization as input and transcribes multi-talker speech, even when speakers are speaking different languages.

Give it a try here: https://pccnect.fit.vutbr.cz/gradio-demo/ 🔗

Papers:
DiCoW: https://arxiv.org/pdf/2501.00114
TS-ASR with Whisper: https://ieeexplore.ieee.org/document/10887683
BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge: https://www.isca-archive.org/chime_2024/polok24_chime.pdf

Codebase:
Inference Pipeline: https://github.com/BUTSpeechFIT/DiCoW
DiCoW Training: https://github.com/BUTSpeechFIT/TS-ASR-Whisper

Let us know your feedback! 🚀

0 replies

jonathgh · 2025-03-31T14:13:51Z

jonathgh
Mar 31, 2025

Hey @Navanit-nebula we also ran into some discrepancies between Whisper and Pyannote, and found that in general, Pyannote was more accurate than what we were seeing with Whisper itself. What has been your experience so far?

If you want to compare, here's the app: WhisperScript

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speaker Diarization/Speaker Segmentation for call conversation #2493

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Speaker Diarization/Speaker Segmentation for call conversation #2493

Uh oh!

Navanit-nebula Jan 9, 2025

Replies: 2 comments

Uh oh!

Lakoc Mar 27, 2025

Uh oh!

jonathgh Mar 31, 2025

Navanit-nebula
Jan 9, 2025

Lakoc
Mar 27, 2025

jonathgh
Mar 31, 2025