Skip to content

feat: Add speaker diarization functionality to Qwen3-ForcedAligner#116

Open
YuYun329 wants to merge 2 commits intoQwenLM:mainfrom
YuYun329:feature/speaker_verify
Open

feat: Add speaker diarization functionality to Qwen3-ForcedAligner#116
YuYun329 wants to merge 2 commits intoQwenLM:mainfrom
YuYun329:feature/speaker_verify

Conversation

@YuYun329
Copy link

@YuYun329 YuYun329 commented Mar 7, 2026

Summary

This PR adds speaker diarization functionality to Qwen3-ForcedAligner, enabling speaker identification alongside timestamp prediction.

Key Changes

1. Speaker Diarization Support

  • Added speaker field to ForcedAlignItem dataclass to store speaker labels
  • Integrated CAM++ model for speaker embedding extraction
  • Implemented clustering-based speaker diarization using ClusterBackend

2. New Components

  • cluster_backend.py: New module for speaker clustering with spectral clustering algorithm which refered to FunASR;
  • CAM++ model integration: Automatic model download from ModelScope or HuggingFace

3. Enhanced Features

  • Qwen3ForcedAligner now accepts optional campplus_model parameter
  • Automatic model resolution from local path, ModelScope, or HuggingFace
  • load_wav() utility for audio preprocessing
  • Speaker embedding extraction and clustering pipeline

4. Files Modified

File Changes
qwen_asr/inference/qwen3_forced_aligner.py +226 lines - Core speaker diarization logic
qwen_asr/inference/cluster_backend.py +192 lines - New clustering module
qwen_asr/cli/demo.py +20 lines - Demo updates
README.md Documentation updates
pyproject.toml Dependency updates

Usage Example

python qwen_asr/cli/demo.py --asr-checkpoint Qwen/Qwen3-ASR-1.7B --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B --campplus-model FunAudioLLM/Fun-CosyVoice3-0.5B-2512/campplus.onnx

Dependencies

  • onnxruntime - For CAM++ model inference
  • scikit-learn - For spectral clustering

Testing

Tested with multi-speaker audio files, successfully identifying and labeling different speakers with accurate timestamps.


This feature enhances Qwen3-ASR's capabilities for applications like:

  • Meeting transcription with speaker attribution
  • Interview analysis
  • Podcast/Media content processing
  • Call center analytics

root and others added 2 commits March 7, 2026 19:17
…ize timestamp processing

- Added speaker diarization functionality based on CampPlus model, supporting speaker clustering for audio segments
- Optimized timestamp processing logic, added speaker label field
- Improved audio segmentation processing, supporting text segmentation based on punctuation marks
- Added model auto-download functionality, supporting downloading CampPlus model from ModelScope or HuggingFace
…orceAligner

- Extract embedding extraction logic into _extract_embedding method
- Extract speaker clustering logic into _cluster_speakers method
- Extract label assignment logic into _assign_speaker_labels method
- Add _find_speaker_for_item and _advance_segment helper methods
- Improve code readability and maintainability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant