Possible solution for Speaker Diarization #763

mu4farooqi · 2022-12-29T21:30:40Z

mu4farooqi
Dec 29, 2022

Over the weekend, I tried to come up with a consistent approach to diarize whisper transcripts predictably. I have written a post about it. Alternatively you can check the colab.

a-ruban · 2023-01-01T18:16:52Z

a-ruban
Jan 1, 2023

Hey, @mu4farooqi
Thank you for your job!

WIll try your solution on my examples and come back with results.

I also tried forced alignment, but it brokes after certain point.
My approach was more primitive:

improve stable-ts based timecodes via aenas forced alignment for each phrase. If the seconds difference between whisper timecode and aenas timecode was more than X seconds (4?) - then you assume that aenas reach the broking point and you just take whisper timecode.
In such approach you will improve whisper based timecodes a liitble bit in most of the cases.
match transcribed-phrases with pyannote-segments via calculating time-overlapping. So if phrase A overlaps with segment X for 4 seconds and with segment Y for 1 second - you assume that phrase A was pronounced by speaker from segment X.

Such approach has shown quite good results, looking forward to compare them with your approach.
Hope it helps a little bit!

6 replies

a-ruban Jan 2, 2023

Yeah, i will share it in coming days, it's in kinda messy state right now

dgoryeo May 5, 2023

Hi @a-ruban , was keen to hear how the above approach is coming along? Are you happy with the approach?

a-ruban May 5, 2023

hi, @dgoryeo
Yep, quite happy, but i don't use aenas for now

The overlap calculation part of the code looks like this:


    for transcription in transcribation_segments:
        segment_overlaps = {}
        for i, segment in enumerate(merged_diarization_segments):
            current_overlap = min(transcription['end'], segment['end']) - max(transcription['start'], segment['start']) + 1
            segment_overlaps[current_overlap] = i  # we need only dict keys, but dict values must be unique, so unique "i" is here

        segment_overlap_indexes = segment_overlaps.keys()
        if segment_overlap_indexes:
            max_overlap_segment_index = segment_overlaps[max(segment_overlap_indexes)]
            final_parts.append({
                    'start': transcription['start'],
                    'end': transcription['end'],
                    'text': transcription['text'],
                    'speaker': int(merged_diarization_segments[max_overlap_segment_index]['speaker'][-1])
                })
    return final_parts

a-ruban May 5, 2023

Aenas part looks like this (where 4 - is the max number of second difference between aenas timecode and stable-ts timecode, after which we assume that aenas timecodes become broken):


 with open('/content/drive/MyDrive/Colab Notebooks/UKR/aenas_input.txt', 'w') as aenas_file:
     for segment in transcriptions:
         aenas_file.writelines(segment['text'])
         aenas_file.writelines('\n')

 !python -m aeneas.tools.execute_task \
    "/content/drive/MyDrive/Colab Notebooks/UKR/ukr_audio.mp3" \
    "/content/drive/MyDrive/Colab Notebooks/UKR/aenas_input.txt" \
    "task_language=ukr|os_task_file_format=json|is_text_type=plain" \
    "/content/drive/MyDrive/Colab Notebooks/UKR/aenas_output.json"

import json
with open('/content/drive/MyDrive/Colab Notebooks/UKR/aenas_output.json', 'r') as aenas_output_file:
    aenas_timecodes = json.load(aenas_output_file)
 for part, timecode in zip(transcriptions, aenas_timecodes['fragments']):
     part['start'] = float(timecode['begin'])
     part['end'] = float(timecode['end'])
     part['text'] = ' '.join(timecode['lines'])

for part, timecode in zip(transcriptions, aenas_timecodes['fragments']):
    if abs(float(timecode['begin']) - float(part['start'])) < 4 and abs(float(timecode['end']) - float(part['end'])) < 4:
      part['start'] = float(timecode['begin']) 
      part['end'] = float(timecode['end'])

a-ruban May 5, 2023

@dgoryeo i think the good approach will be to take an audio (or a few) with bad initial diarization recognition and try to improve it separately with or without aenas.
The problem with aenas - is that it takes some time to process the audio.

rnehrboss · 2023-01-13T16:34:13Z

rnehrboss
Jan 13, 2023

Hey @mu4farooqi
Thanks for good example. I seem to have an issue running your colab notebook.
In the step for the NEMO clustering diarizer, I get this error:

[NeMo I 2023-01-13 16:30:54 features:267] PADDING: 16
[NeMo E 2023-01-13 16:30:54 common:506] Model instantiation failed!
Target class: nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel
Error(s): new() missing 1 required positional argument: 'task'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 485, in from_config_dict
instance = imported_cls(cfg=config, trainer=trainer)
File "/usr/local/lib/python3.8/dist-packages/nemo/collections/asr/models/label_models.py", line 168, in init
self._macro_accuracy = Accuracy(num_classes=num_classes, average='macro')
TypeError: new() missing 1 required positional argument: 'task'

TypeError Traceback (most recent call last)
in
1 from nemo.collections.asr.models.msdd_models import ClusteringDiarizer
2
----> 3 model = ClusteringDiarizer(cfg=config)

8 frames
/usr/local/lib/python3.8/dist-packages/nemo/collections/asr/models/label_models.py in init(self, cfg, trainer)
166 self.decoder = EncDecSpeakerLabelModel.from_config_dict(cfg.decoder)
167
--> 168 self._macro_accuracy = Accuracy(num_classes=num_classes, average='macro')
169
170 self.labels = None

TypeError: new() missing 1 required positional argument: 'task'

Have you seen this and do you have any suggestions?

Thanks!

2 replies

mu4farooqi Jan 14, 2023
Author

As I’m installing nemo from master branch. They may have made a breaking change. I’ll try to fix it later tonight.

mu4farooqi Jan 15, 2023
Author

I'm not sure in which OS/Environment you are trying my code. Because I just ran my code in the colab, and it worked without any problem.

Can you please provide more details.

rnehrboss · 2023-01-15T01:43:27Z

rnehrboss
Jan 15, 2023

Wow.. thanks

…

On Sat, Jan 14, 2023 at 5:31 PM Umar Farooqi ***@***.***> wrote: As I’m installing nemo from master branch. They may have made a breaking change. I’ll try to fix it later tonight. — Reply to this email directly, view it on GitHub <#763 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHU7QE4XLEHV6CWQDDSDU23WSMZNBANCNFSM6AAAAAATMM6DDQ> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

rnehrboss · 2023-01-15T05:25:05Z

rnehrboss
Jan 15, 2023

The error above was when I ran in colab

…

On Sat, Jan 14, 2023, 11:20 PM Umar Farooqi ***@***.***> wrote: I'm not sure in which OS/Environment you are trying my code. Because I just ran my code in the colab <https://colab.research.google.com/drive/1X5XTiob6irFq8NJM831S0ADwz5_wIS-r>, and it worked without any problem. Can you please provide more details. — Reply to this email directly, view it on GitHub <#763 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHU7QEY7FOBKDZCEZGZ77YLWSOCLBANCNFSM6AAAAAATMM6DDQ> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

jh-modjeski · 2023-05-06T03:41:06Z

jh-modjeski
May 6, 2023

Fascinating stuff! I have been trying to take a practical approach to solving diarization for specific podcast-style scenarios. There are some paid services out there for podcast recordings / transcription, but I haven't been finding much that was open source and useful for this scenario. I found however that there are Discord bots that will easily record every speaker to separate audio files, and so I have been working on my project (https://github.com/jh-modjeski/trys) to transcribe each recording and stitch the transcripts together. With the recordings being separate for each speaker, we know who exactly is saying what. I have also been working with word timestamps to embed interjections into another's speaker's line of transcription while also recognizing cross talk as something that should be printed on a separate line.

This doesn't solve diarization for a single audio source, but if people find diarization useful from my project, maybe there will be more interest in solving the harder problems that you're working on!

My project is pretty messy and pales in comparison to what you're doing with diarization, TBH. I'm not a python developer and I've literally thrown it together with ChatGPT over a weekend + some free time this week. I'm pretty happy that it works as well as it does though, and now that it's functional, I'm trying to consider how best to rewrite the code properly. Let me know if you check it out; I'd really like to get more eyes on it.

3 replies

jh-modjeski May 6, 2023

My personal usecase is that I want to transcribe my D&D games and have all of the transcripts labeled by speaker.

Codex966 May 8, 2023

Hah! This is exactly the reason I started digging into Whisper. Looking to record, transcribe-by-speaker, and then use ChatGPT to generate session notes and summaries for my GM campaign wiki. Will look into your repo!

jh-modjeski May 9, 2023

@Codex966 It's currently functioning well, but I'm looking to basically rewrite everything, so future updates may come more slowly. There's a bunch I want to do. Let me know if you try it out!

I have another program that's far less developed that reads a session report from world anvil, identifies all of the characters/locations/organizations, and then creates or updates articles for each entity using GPT to write the article based on the session report. Nothing I'm ready to release yet, but maybe in the coming months. It would require a world anvil access key, however, which needs to be requested.

databill86 · 2023-06-08T10:08:59Z

databill86
Jun 8, 2023

Thank you for the colab notebook.
I was trying to transcribe and diarize a stereo wav file (8k sample rate, 24k bit rate) with the same params on the notebook. But I'm having this issue and I'm not sure what the problem is:

vad:   0%|          | 0/23 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/xx/source/whisper-ui/app/asr_&_speaker_diarization_with_openai_whisper_&_nvidia_nemo.py", line 129, in <module>
    model.diarize()
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/nemo/collections/asr/models/clustering_diarizer.py", line 421, in diarize
    self._perform_speech_activity_detection()
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/nemo/collections/asr/models/clustering_diarizer.py", line 319, in _perform_speech_activity_detection
    self._run_vad(manifest_vad_input)
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/nemo/collections/asr/models/clustering_diarizer.py", line 217, in _run_vad
    for i, test_batch in enumerate(tqdm(self._vad_model.test_dataloader(), desc='vad', leave=True)):
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/nemo/collections/asr/data/audio_to_label.py", line 442, in vad_frame_seq_collate_fn
    return _vad_frame_seq_collate_fn(self, batch)
  File "/home/xx/anaconda3/envs/asr/lib/python3.10/site-packages/nemo/collections/asr/data/audio_to_label.py", line 184, in _vad_frame_seq_collate_fn
    sig = torch.cat((start, sig, end))
RuntimeError: Tensors must have same number of dimensions: got 1 and 2

0 replies

Majdoddin · 2023-07-21T05:17:21Z

Majdoddin
Jul 21, 2023

www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages.
Announcement: #1537
Repo: https://github.com/Majdoddin/lexicaps

1 reply

databill86 Jul 30, 2023

@Majdoddin that's great work!
The repo is empty though, are you planning to publish your code?

Possible solution for Speaker Diarization #763

Uh oh!

Replies: 7 comments · 12 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mu4farooqi Jan 14, 2023 Author

Uh oh!

mu4farooqi Jan 15, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 12 replies

mu4farooqi Jan 14, 2023
Author

mu4farooqi Jan 15, 2023
Author