Skip to content

Commit f4f8fcd

Browse files
author
skitsanos
committed
feat: Auto-detect Moonshine & SenseVoice models, update docs
sherpa-onnx engine now auto-detects model architecture from files: - Whisper (encoder + decoder ONNX) - Moonshine (preprocess + encode + cached/uncached decode) - SenseVoice (single model.onnx) Also: glob-based model resolver, SenseVoice language config, fix whisper.cpp set_detect_language bug, comprehensive doc updates with benchmark results across all engines.
1 parent 2f06dab commit f4f8fcd

File tree

9 files changed

+252
-42
lines changed

9 files changed

+252
-42
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ Accepts any audio or video format — FFmpeg handles conversion automatically.
1717
# Build (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
1818
cargo build --release
1919

20+
# Build without sherpa-onnx (no shared library dependency needed)
21+
cargo build --release --no-default-features
22+
2023
# Download a GGML model (default format, for --provider local)
2124
transcribeit download-model -s base
2225

@@ -29,9 +32,15 @@ transcribeit list-models
2932
# Transcribe with local whisper.cpp (model alias resolves from MODEL_CACHE_DIR)
3033
transcribeit run -i recording.mp3 -m base
3134

32-
# Transcribe with sherpa-onnx (auto-segments at ≤30s boundaries)
35+
# Transcribe with sherpa-onnx Whisper (auto-segments at ≤30s boundaries)
3336
transcribeit run -p sherpa-onnx -i recording.mp3 -m base
3437

38+
# Transcribe with sherpa-onnx Moonshine (auto-detected from model files)
39+
transcribeit run -p sherpa-onnx -i recording.mp3 -m moonshine-base
40+
41+
# Transcribe with sherpa-onnx SenseVoice (auto-detected from model files)
42+
transcribeit run -p sherpa-onnx -i recording.mp3 -m sense-voice
43+
3544
# Or pass an explicit model path
3645
transcribeit run -i recording.mp3 -m .cache/ggml-base.bin
3746

@@ -59,11 +68,13 @@ transcribeit run -i recording.wav -m base --language en --normalize
5968

6069
- **Any input format** — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
6170
- **4 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI. Extensible via the `Transcriber` trait.
62-
- **Model aliases**`-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers.
71+
- **3 model architectures via sherpa-onnx** — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point `--model` at any supported model directory.
72+
- **Model aliases**`-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
6373
- **Language hinting** — Pass `--language` to force local and API transcription language.
6474
- **FFmpeg audio normalization** — Optional `--normalize` to apply loudnorm before transcription.
6575
- **Silence-based segmentation** — Splits long audio at silence boundaries for better accuracy and API compatibility.
6676
- **sherpa-onnx auto-segmentation** — Whisper ONNX models only support ≤30s per call; segmentation is enabled automatically.
77+
- **sherpa-onnx is optional** — Enabled by default as a Cargo feature. Build without it: `cargo build --no-default-features`.
6778
- **Auto-split for API limits** — Files exceeding 25MB are automatically segmented when using remote providers.
6879
- **Progress spinner** — Shows live terminal feedback during transcription (single file and segmented mode).
6980
- **Parallel API segment transcription** — Multiple segment requests can be processed concurrently with `--segment-concurrency`.
@@ -101,4 +112,4 @@ See the [docs](docs/) folder for detailed documentation:
101112
- [CLI Reference](docs/cli-reference.md) — All commands, options, and examples
102113
- [Provider behavior](docs/provider-behavior.md) — OpenAI vs Azure argument differences
103114
- [Troubleshooting](docs/troubleshooting.md) — Common setup/runtime issues and fixes
104-
- [Performance benchmarks](docs/performance-benchmarks.md)Reproducible measurement plan and templates
115+
- [Performance benchmarks](docs/performance-benchmarks.md)Measurement plan, reference results, and templates

docs/architecture.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ src/
1919
│ └── manifest.rs # JSON manifest writer
2020
└── engines/
2121
├── whisper_local.rs # Local whisper.cpp via whisper-rs
22-
├── sherpa_onnx.rs # Local sherpa-onnx engine (Whisper ONNX models)
22+
├── sherpa_onnx.rs # Local sherpa-onnx engine (auto-detects Whisper, Moonshine, SenseVoice)
2323
├── openai_api.rs # OpenAI-compatible REST API
2424
├── azure_openai.rs # Azure OpenAI REST API
2525
├── rate_limit.rs # Retry logic and 429 handling
@@ -114,17 +114,35 @@ Caches whether the endpoint supports `verbose_json` via an `AtomicU8` flag to sk
114114

115115
### Sherpa-ONNX (`sherpa_onnx.rs`)
116116

117-
Local inference using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) with Whisper ONNX models. Uses a **dedicated worker thread pattern**: the `OfflineRecognizer` is created on a plain `std::thread` (not on the Tokio runtime) and stays there for its entire lifetime. Transcription requests are sent to the thread via an `mpsc` channel and results come back through `tokio::sync::oneshot` channels. This design avoids:
117+
Local inference using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) with automatic model architecture detection. Uses a **dedicated worker thread pattern**: the `OfflineRecognizer` is created on a plain `std::thread` (not on the Tokio runtime) and stays there for its entire lifetime. Transcription requests are sent to the thread via an `mpsc` channel and results come back through `tokio::sync::oneshot` channels. This design avoids:
118118

119119
- Blocking the async runtime during inference.
120120
- Thread-safety issues with the C FFI recognizer, which is neither `Send` nor `Sync`.
121121

122122
Model initialization also happens on the worker thread, with errors propagated back through a sync channel so callers get a clear error if the model directory is invalid.
123123

124-
The engine prefers `int8` quantized ONNX files when available (`encoder.int8.onnx`, `decoder.int8.onnx`) for lower memory usage, falling back to full-precision variants. A `tokens.txt` file must be present in the model directory.
124+
#### Auto-detected model architectures
125+
126+
The engine auto-detects the model architecture by inspecting the files present in the model directory:
127+
128+
| Architecture | Required files | Config used |
129+
|---|---|---|
130+
| **Whisper** | `encoder.onnx` + `decoder.onnx` | `OfflineWhisperModelConfig` |
131+
| **Moonshine** | `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` | `OfflineMoonshineModelConfig` |
132+
| **SenseVoice** | `model.onnx` (single file) | `OfflineSenseVoiceModelConfig` |
133+
134+
All architectures also require a `tokens.txt` (or `*-tokens.txt`) file in the model directory. The engine prefers `int8` quantized ONNX files when available (e.g., `encoder.int8.onnx`) for lower memory usage, falling back to full-precision variants.
135+
136+
The model resolver supports glob-based directory matching, so you can use partial names like `-m moonshine-base` or `-m sense-voice` to find models in the cache directory.
137+
138+
**SenseVoice limitation:** SenseVoice models can detect emotions and audio events (laughter, applause, music), but these tags are stripped by the sherpa-onnx C API and are not available in the transcription output.
125139

126140
Whisper ONNX models only support audio chunks of 30 seconds or less, so the pipeline automatically enables segmentation and caps `--max-segment-secs` at 30 when using this provider.
127141

142+
#### C++ stderr suppression
143+
144+
During `recognizer.decode()`, the sherpa-onnx C++ library prints warnings to stderr. The engine temporarily redirects stderr to `/dev/null` via `libc::dup`/`dup2` during decode calls and restores it immediately after, keeping the terminal output clean.
145+
128146
### Rate limiting (`rate_limit.rs`)
129147

130148
Shared retry logic for both API engines. On 429 responses:
@@ -149,7 +167,7 @@ Both API engines can send file uploads directly and choose the correct container
149167

150168
## Build requirements
151169

152-
The `sherpa-onnx` crate requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
170+
The `sherpa-onnx` Cargo feature is **enabled by default**. It requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
153171

154172
Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
155173

@@ -158,6 +176,14 @@ Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
158176
SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
159177
```
160178

179+
To build without the sherpa-onnx dependency entirely:
180+
181+
```bash
182+
cargo build --release --no-default-features
183+
```
184+
185+
This removes the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.
186+
161187
## Adding a new engine
162188

163189
1. Create `src/engines/your_engine.rs`

docs/cli-reference.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,15 @@ Model aliases auto-resolve from the `MODEL_CACHE_DIR` cache directory (default `
6060

6161
| Option | Description | Default |
6262
|--------|-------------|---------|
63-
| `-m, --model` | Path to ONNX model directory or alias (e.g. `tiny`, `base.en`) | required |
63+
| `-m, --model` | Path to ONNX model directory or partial name (e.g. `tiny`, `base.en`, `moonshine-base`, `sense-voice`) | required |
6464

65-
The model directory must contain `encoder.onnx` (or `encoder.int8.onnx`), `decoder.onnx` (or `decoder.int8.onnx`), and `tokens.txt`. When an alias like `base.en` is given, the cache is searched for a directory named `sherpa-onnx-whisper-base.en` under `MODEL_CACHE_DIR`.
65+
The engine auto-detects the model architecture from files in the directory:
66+
67+
- **Whisper** -- `encoder.onnx` + `decoder.onnx` (or int8 variants) + `tokens.txt`
68+
- **Moonshine** -- `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` + `tokens.txt`
69+
- **SenseVoice** -- `model.onnx` + `tokens.txt`
70+
71+
When an alias like `base.en` is given, the cache is searched for a directory named `sherpa-onnx-whisper-base.en` under `MODEL_CACHE_DIR`. The resolver also supports glob matching, so partial names like `-m moonshine-base` or `-m sense-voice` will match any directory in the cache containing that string.
6672

6773
Sherpa-ONNX automatically enables segmentation and caps segment length at 30 seconds due to the Whisper ONNX model limitation.
6874

@@ -176,10 +182,16 @@ transcribeit run -i recording.mp3 -m base
176182
transcribeit run -i recording.mp3 -m .cache/ggml-base.bin
177183
transcribeit run -i meeting.mp4 -m .cache/ggml-small.en.bin
178184

179-
# Process with sherpa-onnx provider (auto-segments at 30s)
185+
# Process with sherpa-onnx Whisper (auto-segments at 30s)
180186
transcribeit run -p sherpa-onnx -i recording.mp3 -m base.en
181187
transcribeit run -p sherpa-onnx -i lecture.mp4 -m tiny -f vtt -o ./output
182188

189+
# Process with sherpa-onnx Moonshine (auto-detected from model files)
190+
transcribeit run -p sherpa-onnx -i recording.mp3 -m moonshine-base
191+
192+
# Process with sherpa-onnx SenseVoice (auto-detected from model files)
193+
transcribeit run -p sherpa-onnx -i recording.mp3 -m sense-voice
194+
183195
# Process a directory
184196
transcribeit run --input samples/ --output-dir ./output
185197

@@ -215,7 +227,7 @@ transcribeit run -p azure -i recording.wav \
215227
### Provider behavior
216228

217229
- **Local** (`-p local`) runs whisper.cpp in-process using GGML models.
218-
- **Sherpa-ONNX** (`-p sherpa-onnx`) runs sherpa-onnx in-process using Whisper ONNX models. Always auto-segments at 30s.
230+
- **Sherpa-ONNX** (`-p sherpa-onnx`) runs sherpa-onnx in-process. Auto-detects Whisper, Moonshine, and SenseVoice models from directory contents. Always auto-segments at 30s.
219231
- **OpenAI-compatible** (`-p openai`) uses `--remote-model` and calls `POST {base-url}/v1/audio/transcriptions`.
220232
- **Azure** (`-p azure`) uses `--azure-deployment` and calls:
221233
`POST {base-url}/openai/deployments/{deployment}/audio/transcriptions?api-version={version}`.

docs/performance-benchmarks.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,15 @@ time transcribeit run -i <input_file> -m base -f text -o ./output
2727
time transcribeit run -i <input_file> -m small -f text -o ./output
2828
time transcribeit run -i <input_file> -m small.en -f text -o ./output
2929

30-
# sherpa-onnx (ONNX) — auto-segments at 30s
30+
# sherpa-onnx Whisper (ONNX) — auto-segments at 30s
3131
time transcribeit run -p sherpa-onnx -i <input_file> -m base -f text -o ./output
3232
time transcribeit run -p sherpa-onnx -i <input_file> -m small.en -f text -o ./output
33+
34+
# sherpa-onnx Moonshine
35+
time transcribeit run -p sherpa-onnx -i <input_file> -m moonshine-base -f text -o ./output
36+
37+
# sherpa-onnx SenseVoice
38+
time transcribeit run -p sherpa-onnx -i <input_file> -m sense-voice -f text -o ./output
3339
```
3440

3541
Record:
@@ -92,6 +98,25 @@ Output size: 4.6 MB
9298

9399
Keep rows in a simple table (date + commit hash + environment + results) in your preferred tracker so regressions are easy to catch.
94100

101+
## Reference benchmark results
102+
103+
These results were measured on a 5-minute medical interview recording.
104+
105+
| Engine / Model | Wall clock | Realtime factor | Notes |
106+
|---|---|---|---|
107+
| Local whisper.cpp `base` | 3.6s | 83x RT | Best speed/quality trade-off |
108+
| SenseVoice 2024 | 6.6s | 46x RT | Good quality, 50+ languages |
109+
| Sherpa-ONNX Whisper `base` | 10.9s | 27x RT | |
110+
| Moonshine `base` | 14.1s | 21x RT | |
111+
| Local whisper.cpp `large-v3-turbo` | 33.7s | 8.9x RT | Highest transcription quality |
112+
| Sherpa-ONNX Whisper `turbo` | 47.2s | 6.4x RT | |
113+
114+
**Notes:**
115+
- Local whisper.cpp (GGML) is consistently the fastest engine for a given model size.
116+
- SenseVoice 2024 offers excellent speed with good quality. **Avoid the SenseVoice 2025 model** -- it is a regression in quality.
117+
- Moonshine provides a compact alternative but is slower than Whisper at the same size tier.
118+
- For highest quality where speed is not critical, use `large-v3-turbo` with local whisper.cpp.
119+
95120
## CI/automatable baseline
96121

97122
For now, treat these as manual benchmarks in a fixed environment.

docs/provider-behavior.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,22 @@ This project supports four providers. They share the same input/output surface,
1414
## Sherpa-ONNX (`-p sherpa-onnx`)
1515

1616
- Input audio/video is converted with FFmpeg to 16 kHz mono WAV.
17+
- The engine **auto-detects model architecture** from the files in the model directory:
18+
- **Whisper** -- `encoder.onnx` + `decoder.onnx` + `tokens.txt`
19+
- **Moonshine** -- `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` + `tokens.txt`
20+
- **SenseVoice** -- `model.onnx` + `tokens.txt`
1721
- Model loading uses `--model` resolved from:
18-
- explicit filesystem path to a directory containing `encoder.onnx`, `decoder.onnx`, and `tokens.txt`
19-
- or cache alias (`tiny`, `base.en`, `small`, etc.) resolved under `MODEL_CACHE_DIR` as `sherpa-onnx-whisper-<alias>/`.
22+
- explicit filesystem path to a model directory
23+
- cache alias (`tiny`, `base.en`, `small`, etc.) resolved under `MODEL_CACHE_DIR` as `sherpa-onnx-whisper-<alias>/`
24+
- or glob-based partial name matching (e.g., `-m moonshine-base`, `-m sense-voice`) against directories in `MODEL_CACHE_DIR`.
2025
- The engine prefers `int8` quantized ONNX files when available for lower memory usage.
2126
- Transcription runs in-process on a dedicated worker thread using the sherpa-onnx C library via FFI.
27+
- C++ stderr warnings from the sherpa-onnx library are suppressed during inference to keep terminal output clean.
2228
- Whisper ONNX models only support audio of 30 seconds or less per call. The pipeline automatically enables segmentation and caps `--max-segment-secs` at 30, regardless of user-supplied values.
29+
- **SenseVoice limitation:** emotion and audio event detection tags are stripped by the sherpa-onnx C API and are not available in the output.
2330
- Segment concurrency is always 1 (sequential processing).
2431
- No external API key is required.
32+
- The `sherpa-onnx` feature is enabled by default. Build without it using `cargo build --no-default-features`.
2533
- Requires `SHERPA_ONNX_LIB_DIR` to be set at build time (see [Architecture](architecture.md#build-requirements)).
2634

2735
## OpenAI-compatible (`-p openai`)
@@ -57,8 +65,8 @@ This project supports four providers. They share the same input/output surface,
5765

5866
Both are local engines that run without network access. They differ in the model format and inference backend:
5967

60-
- **Local** uses GGML models via `whisper.cpp` (`whisper-rs` binding). Supports all model sizes.
61-
- **Sherpa-ONNX** uses ONNX models via the `sherpa-onnx` C library. Supports all sizes except `large-v3`. Requires auto-segmentation at 30s due to Whisper ONNX limitations.
68+
- **Local** uses GGML models via `whisper.cpp` (`whisper-rs` binding). Supports all Whisper model sizes.
69+
- **Sherpa-ONNX** uses ONNX models via the `sherpa-onnx` C library. Supports three model architectures (Whisper, Moonshine, SenseVoice) with automatic detection. Whisper ONNX supports all sizes except `large-v3`. Requires auto-segmentation at 30s due to Whisper ONNX limitations. The `sherpa-onnx` feature is optional (enabled by default); build without it using `cargo build --no-default-features`.
6270

6371
### OpenAI vs Azure
6472

docs/troubleshooting.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,12 +43,28 @@ cargo build --release
4343
Symptoms:
4444
- `ONNX model not found for '<name>'`
4545
- `encoder.onnx (or encoder.int8.onnx) not found in ...`
46+
- `Could not detect model architecture in ...`
4647
- `tokens.txt not found in ...`
4748

4849
Fix:
49-
- Ensure the model directory contains `encoder.onnx` (or `encoder.int8.onnx`), `decoder.onnx` (or `decoder.int8.onnx`), and `tokens.txt` (or `*-tokens.txt`).
50-
- Download ONNX models with: `transcribeit download-model -f onnx -s <size>`
50+
- The sherpa-onnx engine auto-detects the model architecture. Ensure the model directory contains the correct files for one of:
51+
- **Whisper:** `encoder.onnx` + `decoder.onnx` (or int8 variants) + `tokens.txt`
52+
- **Moonshine:** `preprocess.onnx` + `encode.onnx` + `uncached_decode.onnx` + `cached_decode.onnx` + `tokens.txt`
53+
- **SenseVoice:** `model.onnx` + `tokens.txt`
54+
- Download Whisper ONNX models with: `transcribeit download-model -f onnx -s <size>`
55+
- For Moonshine and SenseVoice models, download from the [sherpa-onnx model releases](https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models) and extract into `MODEL_CACHE_DIR`.
5156
- Verify with: `transcribeit list-models` (ONNX models appear with an `[onnx]` tag)
57+
- The model resolver supports partial name matching (e.g., `-m moonshine-base`, `-m sense-voice`).
58+
59+
### Building without sherpa-onnx
60+
61+
If you do not need the sherpa-onnx provider and want to avoid installing the shared libraries:
62+
63+
```bash
64+
cargo build --release --no-default-features
65+
```
66+
67+
This disables the `sherpa-onnx` Cargo feature (which is enabled by default) and removes the dependency on `SHERPA_ONNX_LIB_DIR`.
5268

5369
### Model download fails
5470

@@ -140,10 +156,17 @@ Fix:
140156
Common causes:
141157
- Language mismatch (auto-detection failed on very short clips)
142158
- Excessive background noise
159+
- Previously, a `whisper-rs` bug with `set_detect_language(true)` caused 0 segments when `--language` was not specified. This has been fixed; if you encounter this on an older build, rebuild with the latest code.
143160

144161
Fix:
145162
- Provide `--language` hint (for example `--language en`).
146163
- Use `--segment` and tune silence thresholds:
147164
- raise (less negative) `--silence-threshold` for more aggressive splits
148165
- lower `--min-silence-duration` for noisy recordings
149166
- Try the same file with a different model (for example `base.en`, `small`, `small.en`).
167+
168+
### SenseVoice emotion/event tags missing
169+
170+
SenseVoice models are capable of detecting emotions and audio events (laughter, applause, music, etc.), but the sherpa-onnx C API strips these tags from the output. Only the transcription text is available. This is a limitation of the sherpa-onnx C-level bindings, not of transcribeit.
171+
172+
Additionally, the SenseVoice 2025 model is a quality regression compared to the 2024 version. Prefer using the 2024 SenseVoice model for best results.

0 commit comments

Comments
 (0)