feat: Auto-detect Moonshine & SenseVoice models, update docs

skitsanos · skitsanos · commit f4f8fcda6f54 · 2026-03-14T19:59:00.000+02:00
sherpa-onnx engine now auto-detects model architecture from files:
- Whisper (encoder + decoder ONNX)
- Moonshine (preprocess + encode + cached/uncached decode)
- SenseVoice (single model.onnx)

Also: glob-based model resolver, SenseVoice language config,
fix whisper.cpp set_detect_language bug, comprehensive doc updates
with benchmark results across all engines.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -17,6 +17,9 @@ Accepts any audio or video format — FFmpeg handles conversion automatically.
 # Build (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
 cargo build --release
 
+# Build without sherpa-onnx (no shared library dependency needed)
+cargo build --release --no-default-features
+
 # Download a GGML model (default format, for --provider local)
 transcribeit download-model -s base
 
@@ -29,9 +32,15 @@ transcribeit list-models
 # Transcribe with local whisper.cpp (model alias resolves from MODEL_CACHE_DIR)
 transcribeit run -i recording.mp3 -m base
 
-# Transcribe with sherpa-onnx (auto-segments at ≤30s boundaries)
+# Transcribe with sherpa-onnx Whisper (auto-segments at ≤30s boundaries)
 transcribeit run -p sherpa-onnx -i recording.mp3 -m base
 
+# Transcribe with sherpa-onnx Moonshine (auto-detected from model files)
+transcribeit run -p sherpa-onnx -i recording.mp3 -m moonshine-base
+
+# Transcribe with sherpa-onnx SenseVoice (auto-detected from model files)
+transcribeit run -p sherpa-onnx -i recording.mp3 -m sense-voice
+
 # Or pass an explicit model path
 transcribeit run -i recording.mp3 -m .cache/ggml-base.bin
 
@@ -59,11 +68,13 @@ transcribeit run -i recording.wav -m base --language en --normalize
 
 - **Any input format** — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
 - **4 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI. Extensible via the `Transcriber` trait.
-- **Model aliases** — `-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers.
+- **3 model architectures via sherpa-onnx** — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point `--model` at any supported model directory.
+- **Model aliases** — `-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
 - **Language hinting** — Pass `--language` to force local and API transcription language.
 - **FFmpeg audio normalization** — Optional `--normalize` to apply loudnorm before transcription.
 - **Silence-based segmentation** — Splits long audio at silence boundaries for better accuracy and API compatibility.
 - **sherpa-onnx auto-segmentation** — Whisper ONNX models only support ≤30s per call; segmentation is enabled automatically.
+- **sherpa-onnx is optional** — Enabled by default as a Cargo feature. Build without it: `cargo build --no-default-features`.
 - **Auto-split for API limits** — Files exceeding 25MB are automatically segmented when using remote providers.
 - **Progress spinner** — Shows live terminal feedback during transcription (single file and segmented mode).
 - **Parallel API segment transcription** — Multiple segment requests can be processed concurrently with `--segment-concurrency`.
@@ -101,4 +112,4 @@ See the [docs](docs/) folder for detailed documentation:
 - [CLI Reference](docs/cli-reference.md) — All commands, options, and examples
 - [Provider behavior](docs/provider-behavior.md) — OpenAI vs Azure argument differences
 - [Troubleshooting](docs/troubleshooting.md) — Common setup/runtime issues and fixes
-- [Performance benchmarks](docs/performance-benchmarks.md) — Reproducible measurement plan and templates
+- [Performance benchmarks](docs/performance-benchmarks.md) — Measurement plan, reference results, and templates
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -19,7 +19,7 @@ src/
 │   └── manifest.rs        # JSON manifest writer
 └── engines/
     ├── whisper_local.rs   # Local whisper.cpp via whisper-rs
-    ├── sherpa_onnx.rs     # Local sherpa-onnx engine (Whisper ONNX models)
+    ├── sherpa_onnx.rs     # Local sherpa-onnx engine (auto-detects Whisper, Moonshine, SenseVoice)
     ├── openai_api.rs      # OpenAI-compatible REST API
     ├── azure_openai.rs    # Azure OpenAI REST API
     ├── rate_limit.rs      # Retry logic and 429 handling
@@ -114,17 +114,35 @@ Caches whether the endpoint supports `verbose_json` via an `AtomicU8` flag to sk
 
 ### Sherpa-ONNX (`sherpa_onnx.rs`)
 
-Local inference using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) with Whisper ONNX models. Uses a **dedicated worker thread pattern**: the `OfflineRecognizer` is created on a plain `std::thread` (not on the Tokio runtime) and stays there for its entire lifetime. Transcription requests are sent to the thread via an `mpsc` channel and results come back through `tokio::sync::oneshot` channels. This design avoids:
+Local inference using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) with automatic model architecture detection. Uses a **dedicated worker thread pattern**: the `OfflineRecognizer` is created on a plain `std::thread` (not on the Tokio runtime) and stays there for its entire lifetime. Transcription requests are sent to the thread via an `mpsc` channel and results come back through `tokio::sync::oneshot` channels. This design avoids:
 
 - Blocking the async runtime during inference.
 - Thread-safety issues with the C FFI recognizer, which is neither `Send` nor `Sync`.
 
 Model initialization also happens on the worker thread, with errors propagated back through a sync channel so callers get a clear error if the model directory is invalid.
 
-The engine prefers `int8` quantized ONNX files when available (`encoder.int8.onnx`, `decoder.int8.onnx`) for lower memory usage, falling back to full-precision variants. A `tokens.txt` file must be present in the model directory.
+#### Auto-detected model architectures
+
+The engine auto-detects the model architecture by inspecting the files present in the model directory:
+
+| Architecture | Required files | Config used |
+|---|---|---|
+| **Whisper** | `encoder.onnx` + `decoder.onnx` | `OfflineWhisperModelConfig` |
+| **Moonshine** | `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` | `OfflineMoonshineModelConfig` |
+| **SenseVoice** | `model.onnx` (single file) | `OfflineSenseVoiceModelConfig` |
+
+All architectures also require a `tokens.txt` (or `*-tokens.txt`) file in the model directory. The engine prefers `int8` quantized ONNX files when available (e.g., `encoder.int8.onnx`) for lower memory usage, falling back to full-precision variants.
+
+The model resolver supports glob-based directory matching, so you can use partial names like `-m moonshine-base` or `-m sense-voice` to find models in the cache directory.
+
+**SenseVoice limitation:** SenseVoice models can detect emotions and audio events (laughter, applause, music), but these tags are stripped by the sherpa-onnx C API and are not available in the transcription output.
 
 Whisper ONNX models only support audio chunks of 30 seconds or less, so the pipeline automatically enables segmentation and caps `--max-segment-secs` at 30 when using this provider.
 
+#### C++ stderr suppression
+
+During `recognizer.decode()`, the sherpa-onnx C++ library prints warnings to stderr. The engine temporarily redirects stderr to `/dev/null` via `libc::dup`/`dup2` during decode calls and restores it immediately after, keeping the terminal output clean.
+
 ### Rate limiting (`rate_limit.rs`)
 
 Shared retry logic for both API engines. On 429 responses:
@@ -149,7 +167,7 @@ Both API engines can send file uploads directly and choose the correct container
 
 ## Build requirements
 
-The `sherpa-onnx` crate requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
+The `sherpa-onnx` Cargo feature is **enabled by default**. It requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
 
 Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
 
@@ -158,6 +176,14 @@ Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
 SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
 ```
 
+To build without the sherpa-onnx dependency entirely:
+
+```bash
+cargo build --release --no-default-features
+```
+
+This removes the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.
+
 ## Adding a new engine
 
 1. Create `src/engines/your_engine.rs`
diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -60,9 +60,15 @@ Model aliases auto-resolve from the `MODEL_CACHE_DIR` cache directory (default `
 
 | Option | Description | Default |
 |--------|-------------|---------|
-| `-m, --model` | Path to ONNX model directory or alias (e.g. `tiny`, `base.en`) | required |
+| `-m, --model` | Path to ONNX model directory or partial name (e.g. `tiny`, `base.en`, `moonshine-base`, `sense-voice`) | required |
 
-The model directory must contain `encoder.onnx` (or `encoder.int8.onnx`), `decoder.onnx` (or `decoder.int8.onnx`), and `tokens.txt`. When an alias like `base.en` is given, the cache is searched for a directory named `sherpa-onnx-whisper-base.en` under `MODEL_CACHE_DIR`.
+The engine auto-detects the model architecture from files in the directory:
+
+- **Whisper** -- `encoder.onnx` + `decoder.onnx` (or int8 variants) + `tokens.txt`
+- **Moonshine** -- `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` + `tokens.txt`
+- **SenseVoice** -- `model.onnx` + `tokens.txt`
+
+When an alias like `base.en` is given, the cache is searched for a directory named `sherpa-onnx-whisper-base.en` under `MODEL_CACHE_DIR`. The resolver also supports glob matching, so partial names like `-m moonshine-base` or `-m sense-voice` will match any directory in the cache containing that string.
 
 Sherpa-ONNX automatically enables segmentation and caps segment length at 30 seconds due to the Whisper ONNX model limitation.
 
@@ -176,10 +182,16 @@ transcribeit run -i recording.mp3 -m base
 transcribeit run -i recording.mp3 -m .cache/ggml-base.bin
 transcribeit run -i meeting.mp4 -m .cache/ggml-small.en.bin
 
-# Process with sherpa-onnx provider (auto-segments at 30s)
+# Process with sherpa-onnx Whisper (auto-segments at 30s)
 transcribeit run -p sherpa-onnx -i recording.mp3 -m base.en
 transcribeit run -p sherpa-onnx -i lecture.mp4 -m tiny -f vtt -o ./output
 
+# Process with sherpa-onnx Moonshine (auto-detected from model files)
+transcribeit run -p sherpa-onnx -i recording.mp3 -m moonshine-base
+
+# Process with sherpa-onnx SenseVoice (auto-detected from model files)
+transcribeit run -p sherpa-onnx -i recording.mp3 -m sense-voice
+
 # Process a directory
 transcribeit run --input samples/ --output-dir ./output
 
@@ -215,7 +227,7 @@ transcribeit run -p azure -i recording.wav \
 ### Provider behavior
 
 - **Local** (`-p local`) runs whisper.cpp in-process using GGML models.
-- **Sherpa-ONNX** (`-p sherpa-onnx`) runs sherpa-onnx in-process using Whisper ONNX models. Always auto-segments at 30s.
+- **Sherpa-ONNX** (`-p sherpa-onnx`) runs sherpa-onnx in-process. Auto-detects Whisper, Moonshine, and SenseVoice models from directory contents. Always auto-segments at 30s.
 - **OpenAI-compatible** (`-p openai`) uses `--remote-model` and calls `POST {base-url}/v1/audio/transcriptions`.
 - **Azure** (`-p azure`) uses `--azure-deployment` and calls:
   `POST {base-url}/openai/deployments/{deployment}/audio/transcriptions?api-version={version}`.
diff --git a/docs/performance-benchmarks.md b/docs/performance-benchmarks.md
@@ -27,9 +27,15 @@ time transcribeit run -i <input_file> -m base -f text -o ./output
 time transcribeit run -i <input_file> -m small -f text -o ./output
 time transcribeit run -i <input_file> -m small.en -f text -o ./output
 
-# sherpa-onnx (ONNX) — auto-segments at 30s
+# sherpa-onnx Whisper (ONNX) — auto-segments at 30s
 time transcribeit run -p sherpa-onnx -i <input_file> -m base -f text -o ./output
 time transcribeit run -p sherpa-onnx -i <input_file> -m small.en -f text -o ./output
+
+# sherpa-onnx Moonshine
+time transcribeit run -p sherpa-onnx -i <input_file> -m moonshine-base -f text -o ./output
+
+# sherpa-onnx SenseVoice
+time transcribeit run -p sherpa-onnx -i <input_file> -m sense-voice -f text -o ./output
 ```
 
 Record:
@@ -92,6 +98,25 @@ Output size: 4.6 MB
 
 Keep rows in a simple table (date + commit hash + environment + results) in your preferred tracker so regressions are easy to catch.
 
+## Reference benchmark results
+
+These results were measured on a 5-minute medical interview recording.
+
+| Engine / Model | Wall clock | Realtime factor | Notes |
+|---|---|---|---|
+| Local whisper.cpp `base` | 3.6s | 83x RT | Best speed/quality trade-off |
+| SenseVoice 2024 | 6.6s | 46x RT | Good quality, 50+ languages |
+| Sherpa-ONNX Whisper `base` | 10.9s | 27x RT | |
+| Moonshine `base` | 14.1s | 21x RT | |
+| Local whisper.cpp `large-v3-turbo` | 33.7s | 8.9x RT | Highest transcription quality |
+| Sherpa-ONNX Whisper `turbo` | 47.2s | 6.4x RT | |
+
+**Notes:**
+- Local whisper.cpp (GGML) is consistently the fastest engine for a given model size.
+- SenseVoice 2024 offers excellent speed with good quality. **Avoid the SenseVoice 2025 model** -- it is a regression in quality.
+- Moonshine provides a compact alternative but is slower than Whisper at the same size tier.
+- For highest quality where speed is not critical, use `large-v3-turbo` with local whisper.cpp.
+
 ## CI/automatable baseline
 
 For now, treat these as manual benchmarks in a fixed environment.
diff --git a/docs/provider-behavior.md b/docs/provider-behavior.md
@@ -14,14 +14,22 @@ This project supports four providers. They share the same input/output surface,
 ## Sherpa-ONNX (`-p sherpa-onnx`)
 
 - Input audio/video is converted with FFmpeg to 16 kHz mono WAV.
+- The engine **auto-detects model architecture** from the files in the model directory:
+  - **Whisper** -- `encoder.onnx` + `decoder.onnx` + `tokens.txt`
+  - **Moonshine** -- `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` + `tokens.txt`
+  - **SenseVoice** -- `model.onnx` + `tokens.txt`
 - Model loading uses `--model` resolved from:
-  - explicit filesystem path to a directory containing `encoder.onnx`, `decoder.onnx`, and `tokens.txt`
-  - or cache alias (`tiny`, `base.en`, `small`, etc.) resolved under `MODEL_CACHE_DIR` as `sherpa-onnx-whisper-<alias>/`.
+  - explicit filesystem path to a model directory
+  - cache alias (`tiny`, `base.en`, `small`, etc.) resolved under `MODEL_CACHE_DIR` as `sherpa-onnx-whisper-<alias>/`
+  - or glob-based partial name matching (e.g., `-m moonshine-base`, `-m sense-voice`) against directories in `MODEL_CACHE_DIR`.
 - The engine prefers `int8` quantized ONNX files when available for lower memory usage.
 - Transcription runs in-process on a dedicated worker thread using the sherpa-onnx C library via FFI.
+- C++ stderr warnings from the sherpa-onnx library are suppressed during inference to keep terminal output clean.
 - Whisper ONNX models only support audio of 30 seconds or less per call. The pipeline automatically enables segmentation and caps `--max-segment-secs` at 30, regardless of user-supplied values.
+- **SenseVoice limitation:** emotion and audio event detection tags are stripped by the sherpa-onnx C API and are not available in the output.
 - Segment concurrency is always 1 (sequential processing).
 - No external API key is required.
+- The `sherpa-onnx` feature is enabled by default. Build without it using `cargo build --no-default-features`.
 - Requires `SHERPA_ONNX_LIB_DIR` to be set at build time (see [Architecture](architecture.md#build-requirements)).
 
 ## OpenAI-compatible (`-p openai`)
@@ -57,8 +65,8 @@ This project supports four providers. They share the same input/output surface,
 
 Both are local engines that run without network access. They differ in the model format and inference backend:
 
-- **Local** uses GGML models via `whisper.cpp` (`whisper-rs` binding). Supports all model sizes.
-- **Sherpa-ONNX** uses ONNX models via the `sherpa-onnx` C library. Supports all sizes except `large-v3`. Requires auto-segmentation at 30s due to Whisper ONNX limitations.
+- **Local** uses GGML models via `whisper.cpp` (`whisper-rs` binding). Supports all Whisper model sizes.
+- **Sherpa-ONNX** uses ONNX models via the `sherpa-onnx` C library. Supports three model architectures (Whisper, Moonshine, SenseVoice) with automatic detection. Whisper ONNX supports all sizes except `large-v3`. Requires auto-segmentation at 30s due to Whisper ONNX limitations. The `sherpa-onnx` feature is optional (enabled by default); build without it using `cargo build --no-default-features`.
 
 ### OpenAI vs Azure
 
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -43,12 +43,28 @@ cargo build --release
 Symptoms:
 - `ONNX model not found for '<name>'`
 - `encoder.onnx (or encoder.int8.onnx) not found in ...`
+- `Could not detect model architecture in ...`
 - `tokens.txt not found in ...`
 
 Fix:
-- Ensure the model directory contains `encoder.onnx` (or `encoder.int8.onnx`), `decoder.onnx` (or `decoder.int8.onnx`), and `tokens.txt` (or `*-tokens.txt`).
-- Download ONNX models with: `transcribeit download-model -f onnx -s <size>`
+- The sherpa-onnx engine auto-detects the model architecture. Ensure the model directory contains the correct files for one of:
+  - **Whisper:** `encoder.onnx` + `decoder.onnx` (or int8 variants) + `tokens.txt`
+  - **Moonshine:** `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` + `tokens.txt`
+  - **SenseVoice:** `model.onnx` + `tokens.txt`
+- Download Whisper ONNX models with: `transcribeit download-model -f onnx -s <size>`
+- For Moonshine and SenseVoice models, download from the [sherpa-onnx model releases](https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models) and extract into `MODEL_CACHE_DIR`.
 - Verify with: `transcribeit list-models` (ONNX models appear with an `[onnx]` tag)
+- The model resolver supports partial name matching (e.g., `-m moonshine-base`, `-m sense-voice`).
+
+### Building without sherpa-onnx
+
+If you do not need the sherpa-onnx provider and want to avoid installing the shared libraries:
+
+```bash
+cargo build --release --no-default-features
+```
+
+This disables the `sherpa-onnx` Cargo feature (which is enabled by default) and removes the dependency on `SHERPA_ONNX_LIB_DIR`.
 
 ### Model download fails
 
@@ -140,10 +156,17 @@ Fix:
 Common causes:
 - Language mismatch (auto-detection failed on very short clips)
 - Excessive background noise
+- Previously, a `whisper-rs` bug with `set_detect_language(true)` caused 0 segments when `--language` was not specified. This has been fixed; if you encounter this on an older build, rebuild with the latest code.
 
 Fix:
 - Provide `--language` hint (for example `--language en`).
 - Use `--segment` and tune silence thresholds:
   - raise (less negative) `--silence-threshold` for more aggressive splits
   - lower `--min-silence-duration` for noisy recordings
 - Try the same file with a different model (for example `base.en`, `small`, `small.en`).
+
+### SenseVoice emotion/event tags missing
+
+SenseVoice models are capable of detecting emotions and audio events (laughter, applause, music, etc.), but the sherpa-onnx C API strips these tags from the output. Only the transcription text is available. This is a limitation of the sherpa-onnx C-level bindings, not of transcribeit.
+
+Additionally, the SenseVoice 2025 model is a quality regression compared to the 2024 version. Prefer using the 2024 SenseVoice model for best results.
diff --git a/src/engines/sherpa_onnx.rs b/src/engines/sherpa_onnx.rs
diff --git a/src/main.rs b/src/main.rs