ASR.lab

A comprehensive benchmarking platform for automatic speech recognition (ASR) systems. This tool provides controlled audio degradation, audio enhancement, loudness normalization, multiple evaluation metrics, and comparative analysis across languages and model architectures.

Live Demo

Try the interactive visualization dashboard online:
👉 ASR.lab Demo on Hugging Face Spaces

Overview

ASR.lab enables systematic evaluation of speech recognition engines under various acoustic conditions. It supports multiple ASR frameworks, applies configurable audio degradations, tests audio enhancement algorithms, and generates detailed performance reports with interactive visualizations.

Features

Multi-Engine Support: Compare performance across different ASR frameworks (Whisper, Wav2Vec2, NeMo, Vosk, SeamlessM4T, Moonshine, SenseVoice, etc.)
Audio Degradation: Apply controlled acoustic degradations (reverb, noise, compression) via VST3 plugins
Audio Enhancement: Test denoising/enhancement algorithms (Demucs, DeepFilterNet) on degraded audio
Loudness Normalization: Grid search across different LUFS normalization levels (EBU R128 compliant)
Evaluation Metrics: WER, CER, MER, WIL, WIP for comprehensive transcription analysis
Interactive Reports: HTML reports with JSON-driven client-side visualizations (no heavy Pandas/Plotly dependency), lazy JS character-diffs, sortable tables, and multi-filter dropdowns
Multilingual: Support for multiple languages and language-specific models
Extensible: Plugin architecture for adding new engines and metrics
Grid Search: Automatic Cartesian product of all test parameters (degradation × enhancement × normalization)

Processing Pipeline

The benchmark pipeline processes audio in the following order:

Original Audio → Degradation (VST3) → Enhancement (Demucs) → Normalization (LUFS) → ASR Engine → Metrics

Each stage is optional and configurable. Multiple options at each stage create a grid search.

Supported ASR Engines

Engine	Status	Notes
Whisper (OpenAI)	✅ Tested	Multilingual, long-form transcription support
Wav2Vec 2.0 (Meta)	✅ Tested	Language-specific fine-tuning, outputs normalized to lowercase
SeamlessM4T (Meta)	✅ Tested	Detects v2 models, selects appropriate model class, and caps generation tokens (`max_new_tokens=256`)
NeMo (NVIDIA)	✅ Tested	Windows support via runtime SIGKILL compatibility patch; runtime setup handled automatically
Vosk	✅ Tested	Offline recognition; models auto-downloaded and extracted when needed
Moonshine (Useful Sensors)	✅ Tested	English-only; on-device, very low footprint; models: `moonshine-tiny`, `moonshine-base`
SenseVoice (Alibaba / FunAudio)	✅ Tested	Multilingual (`zh`, `en`, `ja`, `ko`, `yue`), auto language detection, emotion & event detection

Evaluation Metrics

Metric	Name	Description
WER	Word Error Rate	Standard ASR metric: (S+D+I)/N
CER	Character Error Rate	Better for CJK languages
MER	Match Error Rate	Bounded version of WER (0-1)
WIL	Word Information Lost	Proportion of information lost
WIP	Word Information Preserved	Complement of WIL

Text Normalization (Grid Search Dimension)

Text normalization is applied systematically as a grid search dimension. Each transcription generates two results:

Preset	Description
raw	Texte brut, aucune transformation
normalized	Minuscules + sans ponctuation + espaces normalisés

Normalized preset applies:

Lowercase conversion: Case-insensitive comparison
Punctuation removal: Ignores punctuation differences
Whitespace normalization: Collapses multiple spaces, trims

In the interactive report, use the "Texte" dropdown to:

View both raw and normalized results side-by-side
Filter to raw only to see exact ASR output
Filter to normalized only for standard ASR evaluation

Visual Encoding in Interactive Reports:

Symbol = Degradation type (circle = original, diamond = reverb, etc.)
Color = Engine (whisper, nemo, etc.)
Size = Text normalization (Large = Normalized, Small = Raw)

Installation

Requirements

Python 3.12 or higher
Optional CUDA-capable GPU

Basic Installation

git clone https://github.com/berangerthomas/ASR.lab.git
cd ASR.lab
uv sync

Configuration

Benchmarks are defined in YAML configuration files located in configs/.

See configs/example.yaml for a complete configuration reference.

Basic Configuration Structure

benchmark:
  name: "my_benchmark"
  description: "Description of the benchmark"

data:
  audio_source_dir: "data/audio"      # Directory, file, or glob pattern
  processed_dir: "data/processed"

audio_processing:
  sample_rate: 16000
  channels: 1

# Loudness normalization (grid search)
normalizations:
  - name: "broadcast"
    enabled: true
    method: "lufs"
    target_loudness: -23.0
  - name: "no_norm"
    enabled: true
    method: "none"

# Audio degradation via VST3 plugins
degradations:
  vst_plugin_path: "path/to/reverb.vst3"
  presets:
    - name: "cathedral"
      preset_name: "Cathedral"

# Audio enhancement/denoising
enhancements:
  - name: "demucs"
    enabled: true
    method: "demucs"
    model_name: "htdemucs"

engines:
  whisper:
    - id: "whisper-tiny"
      model_id: "openai/whisper-tiny"
      enabled: true
      chunk_length_s: 30
  
  wav2vec2:
    - id: "wav2vec2-fr"
      model_id: "facebook/wav2vec2-large-xlsr-53-french"
      enabled: true

metrics:
  - name: "wer"
    enabled: true
  - name: "cer"
    enabled: true

Usage

Running a Benchmark

python main.py run -c configs/default.yaml

This will:

Run transcriptions with all configured engines
Compute metrics for both raw and normalized text (grid search dimension)
Generate an interactive HTML report with dropdown filters including text normalization

Interactive Report Features

The generated report_interactive.html includes:

Multi-filter dropdowns: Filter by language, engine, degradation, enhancement, audio normalization, and text normalization
Performance Overview: Scatter plot (time vs. metric) with subplots per language
Metrics Visualization: Heatmap of all metrics + configurable box plots (group/color by engine, language, degradation, etc.)
Cross-Language Analysis: Grouped bar chart (engine × language), engine × language heatmap, language consistency chart, and aggregated statistics table (mean ± std)
Transcription Analysis: Side-by-side reference vs. hypothesis with word-level and character-level diff highlighting
Results Summary: Sortable data table with CSV export
Interactive Metric Normalization: Toggle between raw and normalized text — metrics update via filter

Output

Results are saved to results/reports/<config>/:

report_interactive.html: Interactive HTML report with embedded metrics variants
results.csv: Results in CSV format
raw_results.json: JSON backup with reference and hypothesis text

Viewing Results

Open results/reports/<config>/report_interactive.html in a web browser. Use the sticky filter bar to slice results by language, engine, degradation, enhancement, or normalization. All tabs update in sync.

Data Organization

Manifest Files

Manifest files are JSON files listing audio samples and their reference transcriptions.

The audio_source_dir config value supports three modes:

Mode	Example	Behavior
Directory	`audio_source_dir: "data/audio"`	Loads all `*.json` files in the directory
Single file	`audio_source_dir: "data/audio/manifest_en.json"`	Loads that file only
Glob pattern	`audio_source_dir: "data/audio/manifest_fr*.json"`	Loads all matching `.json` files

Each manifest is a JSON array:

[
  {
    "audio_filepath": "sample1.wav",
    "text": "This is a sample transcription.",
    "lang": "en"
  },
  {
    "audio_filepath": "sample2.wav",
    "text": "Ceci est une transcription.",
    "lang": "fr"
  }
]

Required fields:

audio_filepath: Path to audio file (relative to manifest location or absolute)
text: Reference transcription (ground truth)
lang: Language code (ISO 639-1: en, fr, de, es, etc.)

Audio format requirements:

WAV format recommended
16kHz sample rate (or configure in audio_processing.sample_rate)
Mono channel recommended

Performance Considerations

GPU Acceleration

Most models automatically use CUDA if available. Check GPU usage:

import torch
print(torch.cuda.is_available())

Troubleshooting

CUDA Out of Memory

Reduce batch size or use smaller models. For Whisper, use tiny or base variants.

Audio Format Errors

Ensure audio files are:

WAV format
16kHz sample rate (or configure accordingly)
Mono channel

Vosk & NeMo — Automatic Setup

Engine-specific prerequisites are handled automatically when running a benchmark:

Vosk: Models listed in the configuration (model_path) are downloaded and extracted on the fly if not already present.
NeMo on Windows: The signal.SIGKILL compatibility patch is applied at runtime (exp_manager.py).

Both mechanisms are implemented in src/asr_lab/setup (utilities: engine_setup, nemo_patch, vosk_setup). The benchmark runner calls ensure_engines_ready(...) before engine initialization so most manual setup steps are no longer necessary.

License

See LICENSE file for details.

Contributing

Contributions are welcome. Please submit pull requests or open issues for bugs and feature requests.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
configs		configs
src/asr_lab		src/asr_lab
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASR.lab

Live Demo

Overview

Features

Processing Pipeline

Supported ASR Engines

Evaluation Metrics

Text Normalization (Grid Search Dimension)

Installation

Requirements

Basic Installation

Configuration

Basic Configuration Structure

Usage

Running a Benchmark

Interactive Report Features

Output

Viewing Results

Data Organization

Manifest Files

Performance Considerations

GPU Acceleration

Troubleshooting

CUDA Out of Memory

Audio Format Errors

Vosk & NeMo — Automatic Setup

License

Contributing

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ASR.lab

Live Demo

Overview

Features

Processing Pipeline

Supported ASR Engines

Evaluation Metrics

Text Normalization (Grid Search Dimension)

Installation

Requirements

Basic Installation

Configuration

Basic Configuration Structure

Usage

Running a Benchmark

Interactive Report Features

Output

Viewing Results

Data Organization

Manifest Files

Performance Considerations

GPU Acceleration

Troubleshooting

CUDA Out of Memory

Audio Format Errors

Vosk & NeMo — Automatic Setup

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages