Skip to content

Latest commit

 

History

History
89 lines (65 loc) · 4.22 KB

File metadata and controls

89 lines (65 loc) · 4.22 KB

两个项目介绍

他们功能一致,都是ASR语音识别框架,区别在于:

  • WhisperS2T(claimed 2.3X speed improvement over WhisperX)相较于 Whisper性能大幅提升,但产品线验证没有收益
  • WhisperX 使用 SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2作为whisper模型的运行后端
  • WhisperS2T 支持whisper模型的多个运行后端(CTranslate2、HuggingFace Model with FlashAttention2、Original OpenAI Model)

产品线 MindIE-TorchModelZoo适配方案

whisperX

适配代码 MindIE/MindIE-Torch/built-in/audio/mindie_whisperx/readme.md · Ascend/ModelZoo-PyTorch - 码云 - 开源中国

  • 使用mindtorch部署高性能版本的whisper-large-v3模型的 执行后端。将开源的whisperX中的语音切分和自动组batch的能力迁移过来,达到提升性能的目的
  • 通过 patch 定制修改了 VAD(语音活动检测) 和 Whisper Larger V3模型的源代码,然后使用mindietorch编译执行模型,自定义了MindIEPipeline串联执行流程

社区适配方案

whisperX 涉及四个模型

  • vad 语音活动检测
  • whisper 语音转文本
  • Diarization 语音分割
  • alignment 对齐

依赖的项目

适配方案

用户使用NPU推理的示例代码

import whisperx
import gc 

##### device = "cuda" 
###################################################################
device = "npu:0"
##################################################################

audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs