A comprehensive benchmarking platform for automatic speech recognition (ASR) systems. This tool provides controlled audio degradation, audio enhancement, loudness normalization, multiple evaluation metrics, and comparative analysis across languages and model architectures.
Try the interactive visualization dashboard online:
👉 ASR.lab Demo on Hugging Face Spaces
ASR.lab enables systematic evaluation of speech recognition engines under various acoustic conditions. It supports multiple ASR frameworks, applies configurable audio degradations, tests audio enhancement algorithms, and generates detailed performance reports with interactive visualizations.
- Multi-Engine Support: Compare performance across different ASR frameworks (Whisper, Wav2Vec2, NeMo, Vosk, SeamlessM4T, Moonshine, SenseVoice, etc.)
- Audio Degradation: Apply controlled acoustic degradations (reverb, noise, compression) via VST3 plugins
- Audio Enhancement: Test denoising/enhancement algorithms (Demucs, DeepFilterNet) on degraded audio
- Loudness Normalization: Grid search across different LUFS normalization levels (EBU R128 compliant)
- Evaluation Metrics: WER, CER, MER, WIL, WIP for comprehensive transcription analysis
- Interactive Reports: HTML reports with JSON-driven client-side visualizations (no heavy Pandas/Plotly dependency), lazy JS character-diffs, sortable tables, and multi-filter dropdowns
- Multilingual: Support for multiple languages and language-specific models
- Extensible: Plugin architecture for adding new engines and metrics
- Grid Search: Automatic Cartesian product of all test parameters (degradation × enhancement × normalization)
The benchmark pipeline processes audio in the following order:
Original Audio → Degradation (VST3) → Enhancement (Demucs) → Normalization (LUFS) → ASR Engine → Metrics
Each stage is optional and configurable. Multiple options at each stage create a grid search.
| Engine | Status | Notes |
|---|---|---|
| Whisper (OpenAI) | ✅ Tested | Multilingual, long-form transcription support |
| Wav2Vec 2.0 (Meta) | ✅ Tested | Language-specific fine-tuning, outputs normalized to lowercase |
| SeamlessM4T (Meta) | ✅ Tested | Detects v2 models, selects appropriate model class, and caps generation tokens (max_new_tokens=256) |
| NeMo (NVIDIA) | ✅ Tested | Windows support via runtime SIGKILL compatibility patch; runtime setup handled automatically |
| Vosk | ✅ Tested | Offline recognition; models auto-downloaded and extracted when needed |
| Moonshine (Useful Sensors) | ✅ Tested | English-only; on-device, very low footprint; models: moonshine-tiny, moonshine-base |
| SenseVoice (Alibaba / FunAudio) | ✅ Tested | Multilingual (zh, en, ja, ko, yue), auto language detection, emotion & event detection |
| Metric | Name | Description |
|---|---|---|
| WER | Word Error Rate | Standard ASR metric: (S+D+I)/N |
| CER | Character Error Rate | Better for CJK languages |
| MER | Match Error Rate | Bounded version of WER (0-1) |
| WIL | Word Information Lost | Proportion of information lost |
| WIP | Word Information Preserved | Complement of WIL |
Text normalization is applied systematically as a grid search dimension. Each transcription generates two results:
| Preset | Description |
|---|---|
| raw | Texte brut, aucune transformation |
| normalized | Minuscules + sans ponctuation + espaces normalisés |
Normalized preset applies:
- Lowercase conversion: Case-insensitive comparison
- Punctuation removal: Ignores punctuation differences
- Whitespace normalization: Collapses multiple spaces, trims
In the interactive report, use the "Texte" dropdown to:
- View both raw and normalized results side-by-side
- Filter to raw only to see exact ASR output
- Filter to normalized only for standard ASR evaluation
Visual Encoding in Interactive Reports:
- Symbol = Degradation type (circle = original, diamond = reverb, etc.)
- Color = Engine (whisper, nemo, etc.)
- Size = Text normalization (Large = Normalized, Small = Raw)
- Python 3.12 or higher
- Optional CUDA-capable GPU
git clone https://github.com/berangerthomas/ASR.lab.git
cd ASR.lab
uv syncBenchmarks are defined in YAML configuration files located in configs/.
See configs/example.yaml for a complete configuration reference.
benchmark:
name: "my_benchmark"
description: "Description of the benchmark"
data:
audio_source_dir: "data/audio" # Directory, file, or glob pattern
processed_dir: "data/processed"
audio_processing:
sample_rate: 16000
channels: 1
# Loudness normalization (grid search)
normalizations:
- name: "broadcast"
enabled: true
method: "lufs"
target_loudness: -23.0
- name: "no_norm"
enabled: true
method: "none"
# Audio degradation via VST3 plugins
degradations:
vst_plugin_path: "path/to/reverb.vst3"
presets:
- name: "cathedral"
preset_name: "Cathedral"
# Audio enhancement/denoising
enhancements:
- name: "demucs"
enabled: true
method: "demucs"
model_name: "htdemucs"
engines:
whisper:
- id: "whisper-tiny"
model_id: "openai/whisper-tiny"
enabled: true
chunk_length_s: 30
wav2vec2:
- id: "wav2vec2-fr"
model_id: "facebook/wav2vec2-large-xlsr-53-french"
enabled: true
metrics:
- name: "wer"
enabled: true
- name: "cer"
enabled: truepython main.py run -c configs/default.yamlThis will:
- Run transcriptions with all configured engines
- Compute metrics for both raw and normalized text (grid search dimension)
- Generate an interactive HTML report with dropdown filters including text normalization
The generated report_interactive.html includes:
- Multi-filter dropdowns: Filter by language, engine, degradation, enhancement, audio normalization, and text normalization
- Performance Overview: Scatter plot (time vs. metric) with subplots per language
- Metrics Visualization: Heatmap of all metrics + configurable box plots (group/color by engine, language, degradation, etc.)
- Cross-Language Analysis: Grouped bar chart (engine × language), engine × language heatmap, language consistency chart, and aggregated statistics table (mean ± std)
- Transcription Analysis: Side-by-side reference vs. hypothesis with word-level and character-level diff highlighting
- Results Summary: Sortable data table with CSV export
- Interactive Metric Normalization: Toggle between raw and normalized text — metrics update via filter
Results are saved to results/reports/<config>/:
report_interactive.html: Interactive HTML report with embedded metrics variantsresults.csv: Results in CSV formatraw_results.json: JSON backup with reference and hypothesis text
Open results/reports/<config>/report_interactive.html in a web browser. Use the sticky filter bar to slice results by language, engine, degradation, enhancement, or normalization. All tabs update in sync.
Manifest files are JSON files listing audio samples and their reference transcriptions.
The audio_source_dir config value supports three modes:
| Mode | Example | Behavior |
|---|---|---|
| Directory | audio_source_dir: "data/audio" |
Loads all *.json files in the directory |
| Single file | audio_source_dir: "data/audio/manifest_en.json" |
Loads that file only |
| Glob pattern | audio_source_dir: "data/audio/manifest_fr*.json" |
Loads all matching .json files |
Each manifest is a JSON array:
[
{
"audio_filepath": "sample1.wav",
"text": "This is a sample transcription.",
"lang": "en"
},
{
"audio_filepath": "sample2.wav",
"text": "Ceci est une transcription.",
"lang": "fr"
}
]Required fields:
audio_filepath: Path to audio file (relative to manifest location or absolute)text: Reference transcription (ground truth)lang: Language code (ISO 639-1: en, fr, de, es, etc.)
Audio format requirements:
- WAV format recommended
- 16kHz sample rate (or configure in
audio_processing.sample_rate) - Mono channel recommended
Most models automatically use CUDA if available. Check GPU usage:
import torch
print(torch.cuda.is_available())Reduce batch size or use smaller models. For Whisper, use tiny or base variants.
Ensure audio files are:
- WAV format
- 16kHz sample rate (or configure accordingly)
- Mono channel
Engine-specific prerequisites are handled automatically when running a benchmark:
- Vosk: Models listed in the configuration (
model_path) are downloaded and extracted on the fly if not already present. - NeMo on Windows: The
signal.SIGKILLcompatibility patch is applied at runtime (exp_manager.py).
Both mechanisms are implemented in src/asr_lab/setup (utilities: engine_setup, nemo_patch, vosk_setup). The benchmark runner calls ensure_engines_ready(...) before engine initialization so most manual setup steps are no longer necessary.
See LICENSE file for details.
Contributions are welcome. Please submit pull requests or open issues for bugs and feature requests.