An Optimized Speech-to-Text Pipeline for the Whisper Model Using CTranslate2
WhisperS2T-Reborn is a modernized fork of WhisperS2T, an optimized lightning-fast Speech-to-Text (ASR) pipeline. It is tailored for the Whisper model using the CTranslate2 backend to provide faster transcription. It includes several heuristics to enhance transcription accuracy.
Whisper is a general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
pip install -U whisper-s2t-rebornimport whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v3")
files = ['audio1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=32)
print(out[0][0]) # Print first utterance for first file
"""
[Console Output]
{'text': "Let's bring in Phil Mackie who is there at the palace...",
'avg_logprob': -0.25426941679184695,
'no_speech_prob': 8.147954940795898e-05,
'start_time': 0.0,
'end_time': 24.8}
"""Passing multiple files allows segments from different files to be batched together, making better use of the GPU:
import whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v3")
files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
lang_codes = ['en', 'en', 'en']
tasks = ['transcribe', 'transcribe', 'transcribe']
initial_prompts = [None, None, None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=32)
# out[0] = results for audio1.wav, out[1] = results for audio2.wav, etc.
for file_idx, transcript in enumerate(out):
print(f"File {files[file_idx]}: {len(transcript)} segments")To enable word-level timestamps, load the model with:
model = whisper_s2t.load_model("large-v3", asr_options={'word_timestamps': True})| Model | Identifier |
|---|---|
| Tiny | tiny / tiny.en |
| Base | base / base.en |
| Small | small / small.en |
| Medium | medium / medium.en |
| Large V3 | large-v3 |
| Large V3 Turbo | large-v3-turbo |
| Distil Small | distil-small.en |
| Distil Medium | distil-medium.en |
| Distil Large V3 | distil-large-v3 |
| Distil Large V3.5 | distil-large-v3.5 |
All models are available in float16, float32, and bfloat16 compute types via CTranslate2-4you on Hugging Face.
Model: Whisper large-v3 · FP16 · CUDA · RTX 4090
Audio: sam_altman_lex_podcast_367.flac
Comparing openai-whisper (no batch support) against whisper-s2t-reborn.
| Backend | Batch Size | Time (s) | Speedup | Inference VRAM (MB) |
|---|---|---|---|---|
| openai-whisper | 1 | 508.5 | 1.0× | 362 |
| whisper-s2t-reborn | 1 | 372.4 | 1.4× | 560 |
| whisper-s2t-reborn | 2 | 239.6 | 2.1× | 840 |
| whisper-s2t-reborn | 4 | 145.5 | 3.5× | 1,387 |
| whisper-s2t-reborn | 8 | 95.5 | 5.3× | 2,427 |
| whisper-s2t-reborn | 16 | 69.4 | 7.3× | 4,608 |
| whisper-s2t-reborn | 32 | 57.1 | 8.9× | 8,964 |
| whisper-s2t-reborn | 64 | 49.8 | 10.2× | 17,665.75 |
The increased VRAM usage even at batch size 1 is largely due to the VAD model. Openai's implementation doesn't use voice activity detection. The
benchmarksfolder has the actual scripts used.
- Original WhisperS2T: Thanks to shashig for the original WhisperS2T project that this fork is based on.
- OpenAI Whisper Team: Thanks to the OpenAI Whisper Team for open-sourcing the Whisper model.
- CTranslate2 Team: Thanks to the CTranslate2 Team for providing a faster inference engine for Transformers architecture.
- NVIDIA NeMo Team: Thanks to the NVIDIA NeMo Team for their contribution of the open-source VAD model used in this pipeline.
This project is licensed under MIT License - see the LICENSE file for details.
