Skip to content

BBC-Esq/Elegant-Transcriber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

4x faster than the fastest Whisper implementation AND more accurate.

  • Batch transcribe multiple files in a directory and optionally all sub-directories.
  • Optional timestamps with configurable segment intervals
  • Works on GPU (CUDA) and CPU, Windows or Linux
  • Supported file types: AAC, AMR, ASF, AVI, FLAC, M4A, MKV, MP3, MP4, WAV, WEBM, WMA
  • Link to article on Medium

Installation

1. Windows (installer)

Download and run Elegant_Transcriber_Setup.exe (right-click and run as administrator)

2. Windows (from source)

Download the latest release, unzip and extract, then navigate to the directory containing main.py and run:

python -m venv .
.\Scripts\activate
python install.py
python main.py

3. Linux (from source)

Download the latest release, unzip and extract, then navigate to the directory containing main.py and run:

python3 -m venv .
source bin/activate
python install.py
python main.py

Benchmarks (GPU)

Library Model Batch Chunk VRAM Usage Time Real Time Quality Ranking
Elegant Transcriber (NeMo) Parakeet TDT 0.6B v2 1 90s ~3.3 GB 14.9s 580x #8
Transformers Whisper Large v3 32 Default ~12.4 GB 52.2s 166x #32
WhisperS2T Reborn (Ctranslate2) Whisper Large v3 32 Default ~13.4 GB 66.9s 129x #32
Faster-Whisper (Ctranslate2) Whisper Large v3 32 Default ~12.5 GB 75.9s 114x #32
WhisperX (Ctranslate2) Whisper Large v3 32 Default ~12.8 GB 71.8s 120x #32
Transformers Granite 4.0 1B Speech 12 30s ~6.3 GB 97.7s 88x #1
Elegant Transcriber (NeMo) Canary-Qwen-2.5b 1 40s ~11.1 GB 639.8ss 13.5x #2

All models were run in bfloat16.
All VRAM measurements include model weights and inference overhead and subtract background usage.
All parameters were chosen to achieve a maximum throughput of ~90% CUDA core usage on an RTX 4090.

Benchmarks (CPU)

  • ~13 minute private audio file.
  • CPU tests use a shorter audio sample to keep runtimes manageable.
Library Model Batch Chunk RAM Usage Time Real Time Quality Ranking
Elegant Transcriber Parakeet TDT 0.6B v2 1 90s ~5.6 GB 29.0s 26.8x #8
Faster-Whisper (Ctranslate2) Whisper Large v3 1 Default ~6.5 GB 211.8s 3.67x #32
WhisperS2T Reborn (Ctranslate2) Whisper Large v3 1 Default ~6.6 GB 257.9s 3.02x #32
Transformers Whisper Large v3 1 Default ~6.6 GB 311.1s 2.50x #32
Elegant Transcriber (NeMo) Canary-Qwen-2.5b 1 40s ~11.1 GB 370.1ss 2.1x #2
WhisperX (Ctranslate2) Whisper Large v3 1 Default ~7.3 GB 396.4s 1.96x #32

All models were loaded in float32 for CPU compatibility.
20 threads were used on an Intel 13900k resulting in ~90% CPU usage.
I couldn't get Granite Speech to run...

Special Thanks

  • Nvidia for the Parkeet models, which are hands down the best balance of accuracy and compute time for most people IMHO.
  • IBM for Granite Speech Models, which, as of March, 2026, rank #1 on the ASR leaderboard in terms of accuracy. I'll include them in a later release.
  • OpenAI for the older Whisper models setting the gold standard for so many years.