A comprehensive benchmarking suite for real-time Automatic Speech Recognition (ASR) models, optimized for edge devices (Raspberry Pi 5).
- Multi-Model Support:
vosk: Lightweight, offline, fast.faster-whisper: Optimized Whisper implementation (CTranslate2).whisper.cpp: High-performance C++ implementation (viapywhispercpp).
- Modes:
- Real-time Simulation: Streams audio chunks from file to simulate varying network latency.
- Live Audio: Real-time transcription using microphone input (requires PortAudio).
- Automated Suite: Sequential validation across multiple models and languages.
- Advanced Metrics:
- RTF (Real-Time Factor): Processing speed ratio.
- Latency: Average and P90 delay per chunk.
- TTFT (Time To First Token): Responsiveness metric.
- WER/CER: Word/Character Error Rates.
- SemSim (Semantic Similarity): Meaning preservation score (0-1) using
sentence-transformers. - Resource Monitoring: Peak Memory (MB) and CPU Usage (%).
- VAD Support: Integration with Voice Activity Detection for optimized processing.
- Python 3.10+
uv(recommended)portaudio19-dev(for live audio, e.g.,sudo apt install portaudio19-devon Debian/Ubuntu)
git clone https://github.com/rizalbuilds/asr-benchmark.git
cd asr-benchmark
uv sync # Install dependencies
uv run src/download_models.py # Pre-download all models- Faster-Whisper / Semantic:
~/.cache/huggingface/hub/ - Vosk:
~/.cache/vosk/ - Whisper.cpp:
~/.local/share/pywhispercpp/models/
Run benchmarks on all models against files in audio_files/.
# Organize audio: audio_files/en/*.wav, audio_files/id/*.wav
uv run src/benchmark_suite.pyResults are saved to benchmark_results.json.
uv run src/main.py --runner faster-whisper --model tiny.en --audio test.wavuv run src/main.py --runner faster-whisper --model tiny.en --live --vad| Argument | Description | Default |
|---|---|---|
--runner |
vosk, faster-whisper, whisper-cpp |
Required |
--model |
Model name/path | Required |
--audio |
Path to audio file (ignored if --live) |
Required (unless live) |
--live |
Use microphone input | False |
--vad |
Enable Voice Activity Detection | False |
--chunk-ms |
Chunk duration in ms | 1000 |
--reference |
Path to ground truth text file | None |
- Quantization: Use
--quantization int8(default). - Threads:
whisper.cppandfaster-whisperdefaults are tuned for 4 threads. - VAD: Enable
--vadto save compute on silent chunks.
MIT