- Code cleanup: Removed unused API-based cloud engines (
azure.py,deepgram.py,google.py) and inactive engine (usm.py). - Unused metrics: Cleaned up orphans/irrelevant metrics modules (
bleu.py,perplexity.py). Retainedrtf.pyandlatency.pyfor potential future scaling. - Legacy reporting: Removed old non-interactive static HTML reporting scripts and templates (
report.py,report_template.html), as well as trailing backup files, fully transitioning to the modern interactivereport_interactive.py.
- Moonshine engine (
src/asr_lab/engines/moonshine.py): newMoonshineEnginesupportingUsefulSensors/moonshine-baseandUsefulSensors/moonshine-tiny. Uses the Hugging Face Transformers pipeline. English-only; no additional dependencies required. - SenseVoice engine (
src/asr_lab/engines/sensevoice.py): newSenseVoiceEnginesupportingFunAudioLLM/SenseVoiceSmallvia the FunASR library. Multilingual with automatic language detection (auto,zh,en,ja,ko,yue). Applies Inverse Text Normalization by default. - Cross-language analysis: new "Cross-Language Analysis" tab in the interactive report with grouped bar chart (engine × language), engine × language heatmap, language consistency chart, and aggregated statistics table.
- Language filter in the global filter bar.
- Language column in the results summary table and language badge in transcription cards.
- Language as a "Group By" / "Color By" option in box plots.
- Language included in heatmap row labels.
aggregator.py: statistics module (aggregate_by,cross_language_matrix,language_consistency) for multi-file and cross-language aggregation.- HTML report, tab "Detailled transcription analysis" : sort buttons and time metric.
- Client-Side Rendering: Replaced server-side Plotly/Pandas plotting with JSON chart data (
chart_data_json) for responsive client-side UI rendering. - Reporting System:
InteractiveReportGeneratornow prepares a JSON-serializable records list instead of generating an HTML Plotly div. - Engine Metrics: Updated
engine_registry,nemo,vosk,aggregator, andexportto expose explicit metadata/metrics required by the new client-side visualizations. - CSV export now includes
enhancement,audio_norm, andtext_normcolumns and readslanguagedirectly from results instead of parsing it from the dataset name. - Scatter chart customdata includes language for filter support.
audio_source_dirnow accepts a directory, a single file, or a glob pattern:- Directory (e.g.
"data/audio"): loads all*.jsonmanifest files in the directory. - Single file (e.g.
"data/audio/manifest_en.json"): loads that manifest only. - Glob pattern (e.g.
"data/audio/manifest_fr*.json"): loads all matching.jsonfiles.
Previously, only a directory was supported and a single hardcodedmanifest.jsonwas expected.
- Directory (e.g.
- Relative audio paths in manifests are now resolved from the manifest's parent directory (instead of from
audio_source_dir).
- Fix box plot controls: "Normalization" option was not wired to any data attribute — replaced by distinct "Audio Norm" and "Text Norm" options matching the existing JS switch cases.
- Convert Demucs-separated vocals to mono before writing (average channels) and simplify saving logic to write a 1-D waveform. This ensures ASR pipelines receive mono audio and avoids incorrect transposes.
- Also import os and close the file descriptor returned by tempfile.mkstemp immediately to avoid descriptor leaks and allow the downloader to open the temp file by path.
- Server-side diffs: Removed server-side character-level alignment and heavy Pandas usage in favor of a lazy JavaScript char-diff implementation.
- Delete unused
visualizer.py(Matplotlib/Seaborn) andvisualizer_plotly.py— all visualizations are now handled by the interactive report template via client-side JS.
- Introduce a pre-flight engine setup system and related fixes/enhancements:
- Add
src/asr_lab/setup:engine_setup,nemo_patch,vosk_setupand package__init__to prepare engine-specific prerequisites at runtime (NeMo Windows SIGKILL compatibility patch and automatic Vosk model download+extraction). BenchmarkRunnercallsensure_engines_ready(...)so engines are prepared before initialization.
- Add
- Update interactive report template: make filters bar sticky with backdrop blur and add
IntersectionObserverto toggle shadow on scroll.
- Improve
SeamlessM4Tengine: detect v2 models, select appropriate model class, and cap generation tokens (max_new_tokens=256); minor import adjustments. - Move
deepfilternetinto an optional dependency group (deepfilter) inpyproject.toml.
- Systematic text normalization: Each transcription generates 2 results (raw + normalized)
- Normalized preset: lowercase + remove punctuation + normalize spaces
- Diff view shows both raw and normalized texts, if selected
- Configurable text transforms:
ToLowerCase(),RemovePunctuation(),ExpandCommonEnglishContractions() - Metrics and transforms computed consistently with jiwer library
- Symbol = Degradation type (circle = original, diamond = reverb, etc.)
- Color = Engine (whisper = blue, nemo = purple, etc.)
- Size = Text normalization (normalized = large, raw = small)
- wav2vec2 engine: Output normalized to lowercase (was outputting uppercase)
0.8.0 - 2026-02-04
- Modular ASR Engine System (
src/asr_lab/engines/)- Abstract base class
ASREnginefor unified engine interface - Whisper engine (OpenAI)
- Vosk engine (offline recognition)
- Wav2Vec2 engine (Meta)
- Seamless M4T engine (Meta)
- Extensible API system (
engines/api/)
- Abstract base class
- Word Error Rate (WER)
- Character Error Rate (CER)
- Match Error Rate (MER)
- Word Information Lost (WIL)
- Word Information Preserved (WIP)
- Audio degradation via VST plugins
- Audio enhancement: Demucs, DeepFilterNet
- Audio loader, normalization (EBU R128), processor, segmentation
- BenchmarkRunner, DataManager, Dataset, Experiment
- YAML-based configuration loader
- Dynamic engine and metric registries
- Result storage, aggregation, export (JSON, CSV)
- Interactive HTML reports (Plotly)
- Jinja2 templates
- Click-based command-line interface
- Pydantic models: EngineConfig, TranscriptionResult
- NeMo engine: does not work on Windows (SIGKILL)
- HuBERT engine: uses Wav2Vec2 tokenizer fallback
- NeMo:
signal.SIGKILLnot available on Windows - HuBERT: uses Wav2Vec2 tokenizer fallback
HuBERT: outputs uppercase text(fixed in 1.0.0: lowercase applied)No text normalization before metric computation(fixed in 1.0.0)
- Multi-language support (en, fr, de, es)
- YAML configuration
- GPU acceleration (CUDA)
- Batch processing
- Plugin architecture
- Python 3.12+
- PyTorch backend
- Type hints