Skip to content

Commit c6d60d9

Browse files
authored
Merge pull request #1 from transcriptintel/develop
v1.2.0 — Speaker diarization, VAD segmentation, setup command
2 parents f4f8fcd + 9857591 commit c6d60d9

23 files changed

+1848
-291
lines changed

Cargo.lock

Lines changed: 466 additions & 225 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
[package]
22
name = "transcribeit"
3-
version = "1.1.0"
3+
version = "1.2.0"
44
edition = "2024"
5+
license-file = "LICENSE"
56

67
[profile.release]
78
opt-level = 3
@@ -25,16 +26,16 @@ dotenvy = "0.15"
2526
futures-util = "0.3"
2627
hound = "3.5"
2728
glob = "0.3"
28-
indicatif = "0.17"
29-
reqwest = { version = "0.12", features = ["json", "multipart", "stream"] }
29+
indicatif = "0.18"
30+
reqwest = { version = "0.13", features = ["json", "multipart", "stream"] }
3031
serde = { version = "1", features = ["derive"] }
3132
serde_json = "1"
3233
tempfile = "3"
3334
regex = "1"
3435
tokio = { version = "1", features = ["full"] }
3536
sherpa-onnx = { version = "0.1", optional = true }
3637
tar = "0.4"
37-
bzip2 = "0.5"
38+
bzip2 = "0.6"
3839
libc = "0.2"
39-
whisper-rs = "0.12"
40+
whisper-rs = "0.16"
4041
bytes = "1.11.1"

LICENSE

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
Business Source License 1.1
2+
3+
License text copyright (c) 2017 MariaDB Corporation Ab, All Rights Reserved.
4+
"Business Source License" is a trademark of MariaDB Corporation Ab.
5+
6+
Parameters
7+
8+
Licensor: TranscriptIntel
9+
Licensed Work: transcribeit
10+
The Licensed Work is (c) 2026 TranscriptIntel
11+
Additional Use Grant: You may use the Licensed Work for non-commercial
12+
and evaluation purposes without a license.
13+
Production use in a commercial setting requires
14+
a separate commercial license from the Licensor.
15+
Change Date: 2030-03-16
16+
Change License: Apache License, Version 2.0
17+
18+
Terms
19+
20+
The Licensor hereby grants you the right to copy, modify, create derivative
21+
works, redistribute, and make non-production use of the Licensed Work. The
22+
Licensor may make an Additional Use Grant, above, permitting limited
23+
production use.
24+
25+
Effective on the Change Date, or the fourth anniversary of the first publicly
26+
available distribution of a specific version of the Licensed Work under this
27+
License, whichever comes first, the Licensor hereby grants you rights under
28+
the terms of the Change License, and the rights granted in the paragraph
29+
above terminate.
30+
31+
If your use of the Licensed Work does not comply with the requirements
32+
currently in effect as described in this License, you must purchase a
33+
commercial license from the Licensor, its affiliated entities, or authorized
34+
resellers, or you must refrain from using the Licensed Work.
35+
36+
All copies of the original and modified Licensed Work, and derivative works
37+
of the Licensed Work, are subject to this License. This License applies
38+
separately for each version of the Licensed Work and the Change Date may vary
39+
for each version of the Licensed Work released by Licensor.
40+
41+
You must conspicuously display this License on each original or modified copy
42+
of the Licensed Work. If you receive the Licensed Work in original or
43+
modified form from a third party, the terms and conditions set forth in this
44+
License apply to your use of that work.
45+
46+
Any use of the Licensed Work in violation of this License will automatically
47+
terminate your rights under this License for the current and all other
48+
versions of the Licensed Work.
49+
50+
This License does not grant you any right in any trademark or logo of
51+
Licensor or its affiliates (provided that you may use a trademark or logo of
52+
Licensor as expressly required by this License).
53+
54+
TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON
55+
AN "AS IS" BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS,
56+
EXPRESS OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF
57+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND
58+
TITLE.

README.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,14 @@ transcribeit run -p azure -i recording.mp3 \
6262

6363
# Force language and normalize before transcription
6464
transcribeit run -i recording.wav -m base --language en --normalize
65+
66+
# VAD-based segmentation (speech-aware, avoids mid-word cuts)
67+
transcribeit run -p sherpa-onnx -m base -i recording.mp3 --vad-model .cache/silero_vad.onnx
68+
69+
# Speaker diarization (2 speakers)
70+
transcribeit run -i interview.mp3 -m base --speakers 2 \
71+
--diarize-segmentation-model .cache/sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
72+
--diarize-embedding-model .cache/wespeaker_en_voxceleb_CAM++.onnx
6573
```
6674

6775
## Features
@@ -72,7 +80,8 @@ transcribeit run -i recording.wav -m base --language en --normalize
7280
- **Model aliases**`-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
7381
- **Language hinting** — Pass `--language` to force local and API transcription language.
7482
- **FFmpeg audio normalization** — Optional `--normalize` to apply loudnorm before transcription.
75-
- **Silence-based segmentation** — Splits long audio at silence boundaries for better accuracy and API compatibility.
83+
- **VAD-based segmentation** — Speech-aware segmentation via Silero VAD (sherpa-onnx). Detects speech boundaries with padding and gap merging to avoid mid-word cuts. Use `--vad-model .cache/silero_vad.onnx`.
84+
- **Silence-based segmentation** — Fallback segmentation via FFmpeg `silencedetect` for API providers or when VAD model is not available.
7685
- **sherpa-onnx auto-segmentation** — Whisper ONNX models only support ≤30s per call; segmentation is enabled automatically.
7786
- **sherpa-onnx is optional** — Enabled by default as a Cargo feature. Build without it: `cargo build --no-default-features`.
7887
- **Auto-split for API limits** — Files exceeding 25MB are automatically segmented when using remote providers.
@@ -102,8 +111,19 @@ TRANSCRIBEIT_MAX_RETRIES=5
102111
TRANSCRIBEIT_REQUEST_TIMEOUT_SECS=120
103112
TRANSCRIBEIT_RETRY_WAIT_BASE_SECS=10
104113
TRANSCRIBEIT_RETRY_WAIT_MAX_SECS=120
114+
VAD_MODEL=.cache/silero_vad.onnx
115+
DIARIZE_SEGMENTATION_MODEL=.cache/sherpa-onnx-pyannote-segmentation-3-0/model.onnx
116+
DIARIZE_EMBEDDING_MODEL=.cache/wespeaker_en_voxceleb_CAM++.onnx
105117
```
106118

119+
## License
120+
121+
This project is licensed under the [Business Source License 1.1](LICENSE).
122+
123+
- **Free** for non-commercial and evaluation use
124+
- **Commercial/production use** requires a separate license — contact [TranscriptIntel](https://github.com/transcriptintel)
125+
- Converts to **Apache 2.0** on March 16, 2030
126+
107127
## Documentation
108128

109129
See the [docs](docs/) folder for detailed documentation:

docs/architecture.md

Lines changed: 55 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,15 @@ src/
1212
├── audio/
1313
│ ├── extract.rs # FFmpeg audio conversion
1414
│ ├── segment.rs # Silence detection and audio splitting
15+
│ ├── vad.rs # VAD-based speech segmentation (Silero VAD via sherpa-onnx)
1516
│ └── wav.rs # WAV reading and encoding (shared)
17+
├── diarize/
18+
│ ├── mod.rs # Speaker diarization engine and speaker assignment
19+
│ └── ffi.rs # Raw C FFI bindings for sherpa-onnx speaker diarization
1620
├── output/
17-
│ ├── vtt.rs # WebVTT subtitle writer
18-
│ ├── srt.rs # SRT subtitle writer
19-
│ └── manifest.rs # JSON manifest writer
21+
│ ├── vtt.rs # WebVTT subtitle writer (supports <v Speaker N> tags)
22+
│ ├── srt.rs # SRT subtitle writer (supports [Speaker N] labels)
23+
│ └── manifest.rs # JSON manifest writer (includes speaker labels)
2024
└── engines/
2125
├── whisper_local.rs # Local whisper.cpp via whisper-rs
2226
├── sherpa_onnx.rs # Local sherpa-onnx engine (auto-detects Whisper, Moonshine, SenseVoice)
@@ -67,21 +71,32 @@ Input file (any format)
6771
│ └─ Auto: sherpa-onnx provider (always segments; max 30s per chunk)
6872
6973
├─ If segmenting:
70-
│ ├─ detect_silence() via FFmpeg silencedetect filter
71-
│ ├─ compute_segments() at silence midpoints
72-
│ ├─ split_audio() into temp WAV files
73-
│ └─ Transcribe each segment, offset timestamps (concurrently for API providers)
74+
│ ├─ VAD path (when --vad-model is set and sherpa-onnx feature is enabled):
75+
│ │ ├─ read_wav_bytes() → f32 PCM samples
76+
│ │ ├─ vad_segment(): detect speech → pad 250ms → merge gaps <200ms → split long chunks at low-energy points
77+
│ │ ├─ Extract chunk samples directly from memory
78+
│ │ └─ Transcribe each chunk via transcribe(), offset timestamps
79+
│ ├─ FFmpeg fallback (no VAD model, or sherpa-onnx feature disabled):
80+
│ │ ├─ detect_silence() via FFmpeg silencedetect filter
81+
│ │ ├─ compute_segments() at silence midpoints
82+
│ │ ├─ split_audio() into temp WAV files
83+
│ │ └─ Transcribe each segment, offset timestamps (concurrently for API providers)
7484
7585
├─ If not segmenting:
7686
│ ├─ Local: read_wav() → transcribe() directly
7787
│ └─ API: transcribe_path() with prepared file
7888
7989
├─ normalize_audio? ──→ optional loudnorm filter in ffmpeg conversion pipeline
90+
├─ Speaker diarization? (when --speakers N is set)
91+
│ ├─ read audio samples for diarization
92+
│ ├─ Diarizer.diarize() → speaker-labeled time spans
93+
│ └─ assign_speakers() overlays speaker labels onto transcript segments
94+
8095
└─ Output:
8196
├─ Text to stdout or `<input_stem>.txt`
82-
├─ VTT to file or stdout
83-
├─ SRT to file or stdout
84-
└─ JSON manifest to output directory
97+
├─ VTT to file or stdout (with `<v Speaker N>` tags when diarized)
98+
├─ SRT to file or stdout (with `[Speaker N]` labels when diarized)
99+
└─ JSON manifest to output directory (includes speaker field per segment)
85100
```
86101

87102
Temporary files use the `tempfile` crate and are cleaned up automatically on drop.
@@ -184,6 +199,36 @@ cargo build --release --no-default-features
184199

185200
This removes the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.
186201

202+
## VAD-based segmentation (`audio/vad.rs`)
203+
204+
When `--vad-model` is set and the `sherpa-onnx` feature is enabled, the pipeline uses Silero VAD (via sherpa-onnx) for speech-aware segmentation instead of FFmpeg's `silencedetect` filter. This avoids the main problem with silence-based splitting: mid-word cuts.
205+
206+
The VAD pipeline (`vad_segment()`) has four stages:
207+
208+
1. **Detect speech** -- Silero VAD processes 512-sample frames (~32ms at 16kHz) to find speech boundaries with sample-level precision.
209+
2. **Pad 250ms** -- Each speech chunk is extended by 250ms on both sides to protect word boundaries at the edges.
210+
3. **Merge gaps <200ms** -- Adjacent chunks separated by less than 200ms are merged to avoid splitting within short pauses.
211+
4. **Split long chunks** -- Chunks exceeding `--max-segment-secs` are split at the lowest-energy point within a 1-second search window around the target cut point.
212+
213+
The VAD approach works directly on in-memory PCM samples, so there is no need for intermediate temp files during segmentation. Each chunk is transcribed via `engine.transcribe()` with sample slices, and timestamps are offset by the chunk start time.
214+
215+
When `--vad-model` is not set, segmentation falls back to FFmpeg `silencedetect` (the original behavior).
216+
217+
## Speaker diarization (`diarize/`)
218+
219+
Speaker diarization identifies which speaker is talking at each point in the audio. It requires the `sherpa-onnx` feature and two ONNX models:
220+
221+
- **Segmentation model** (`--diarize-segmentation-model`): a pyannote segmentation ONNX model that detects speaker change points.
222+
- **Embedding model** (`--diarize-embedding-model`): a speaker embedding ONNX model that clusters voice characteristics.
223+
224+
The `Diarizer` follows the same dedicated worker thread pattern as `SherpaOnnxEngine`: the C FFI types are not `Send`/`Sync`, so they live on a plain `std::thread` and communicate via channels. Diarization requests are sent through `mpsc` and results come back through `tokio::sync::oneshot`.
225+
226+
After transcription completes, `assign_speakers()` overlays speaker labels onto transcript segments by finding the diarization segment with the maximum time overlap for each transcript segment. Speaker labels appear as:
227+
228+
- **VTT**: `<v Speaker 0>text</v>`
229+
- **SRT**: `[Speaker 0] text`
230+
- **Manifest JSON**: `"speaker": "Speaker 0"` field on each segment
231+
187232
## Adding a new engine
188233

189234
1. Create `src/engines/your_engine.rs`

docs/cli-reference.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,9 +119,22 @@ These options apply to OpenAI/Azure providers:
119119
| `--min-silence-duration` | Minimum silence duration in seconds | `0.8` |
120120
| `--max-segment-secs` | Maximum segment length in seconds | `600` |
121121
| `--segment-concurrency` | Max parallel segment requests (API providers only) | `2` |
122+
| `--vad-model` | Path to Silero VAD ONNX model (`silero_vad.onnx`) for speech-aware segmentation | `VAD_MODEL` env var |
122123

123124
When using `openai` or `azure` providers, files exceeding 25MB are automatically segmented even without `--segment`. When using `sherpa-onnx`, segmentation is always enabled with a maximum segment length of 30 seconds.
124125

126+
When `--vad-model` is set and segmentation is needed, VAD-based segmentation is used instead of FFmpeg `silencedetect`. VAD detects actual speech boundaries using Silero VAD, avoiding mid-word cuts. It pads chunks by 250ms, merges gaps shorter than 200ms, and splits long chunks at low-energy points. This requires the `sherpa-onnx` feature to be enabled. When `--vad-model` is not set, the original FFmpeg silence-based segmentation is used as a fallback.
127+
128+
#### Speaker diarization options
129+
130+
| Option | Description | Default |
131+
|--------|-------------|---------|
132+
| `--speakers` | Number of speakers for diarization | disabled |
133+
| `--diarize-segmentation-model` | Path to pyannote segmentation ONNX model | `DIARIZE_SEGMENTATION_MODEL` env var |
134+
| `--diarize-embedding-model` | Path to speaker embedding ONNX model | `DIARIZE_EMBEDDING_MODEL` env var |
135+
136+
When `--speakers N` is set, speaker diarization runs after transcription to label each segment with a speaker identity. Both `--diarize-segmentation-model` and `--diarize-embedding-model` are required. Speaker labels appear in VTT output as `<v Speaker 0>`, in SRT output as `[Speaker 0]`, and in manifest JSON as a `"speaker"` field on each segment. Requires the `sherpa-onnx` feature.
137+
125138
## Output behavior
126139

127140
During transcription, the CLI shows an animated spinner in the terminal so you can see progress while waiting for Whisper/API calls to complete.
@@ -155,6 +168,9 @@ When `--input` resolves to multiple files (directory or glob), all files are pro
155168
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL | none |
156169
| `AZURE_DEPLOYMENT_NAME` | Azure deployment name | `whisper` |
157170
| `AZURE_API_VERSION` | Azure API version | `2024-06-01` |
171+
| `VAD_MODEL` | Path to Silero VAD ONNX model for speech-aware segmentation | none |
172+
| `DIARIZE_SEGMENTATION_MODEL` | Path to pyannote segmentation ONNX model for speaker diarization | none |
173+
| `DIARIZE_EMBEDDING_MODEL` | Path to speaker embedding ONNX model for speaker diarization | none |
158174
| `TRANSCRIBEIT_MAX_RETRIES` | Maximum 429 retries | `5` |
159175
| `TRANSCRIBEIT_REQUEST_TIMEOUT_SECS` | API request timeout in seconds | `120` |
160176
| `TRANSCRIBEIT_RETRY_WAIT_BASE_SECS` | Base retry wait time in seconds | `10` |
@@ -211,6 +227,28 @@ transcribeit run -i lecture.mp4 -m base -f srt -o ./output
211227
transcribeit run -i noisy.wav -m .cache/ggml-base.bin \
212228
--segment --silence-threshold -30 --min-silence-duration 0.5
213229

230+
# VAD-based segmentation (avoids mid-word cuts)
231+
transcribeit run -p sherpa-onnx -i lecture.mp4 -m base.en \
232+
--vad-model /path/to/silero_vad.onnx -f vtt -o ./output
233+
234+
# VAD with env var (set VAD_MODEL in .env)
235+
VAD_MODEL=/path/to/silero_vad.onnx transcribeit run -p sherpa-onnx -i recording.mp3 -m base.en
236+
237+
# Speaker diarization (2 speakers)
238+
transcribeit run -p sherpa-onnx -i meeting.mp4 -m base.en \
239+
--speakers 2 \
240+
--diarize-segmentation-model /path/to/segmentation.onnx \
241+
--diarize-embedding-model /path/to/embedding.onnx \
242+
-f vtt -o ./output
243+
244+
# VAD + speaker diarization combined
245+
transcribeit run -p sherpa-onnx -i interview.wav -m base.en \
246+
--vad-model /path/to/silero_vad.onnx \
247+
--speakers 2 \
248+
--diarize-segmentation-model /path/to/segmentation.onnx \
249+
--diarize-embedding-model /path/to/embedding.onnx \
250+
-f srt -o ./output
251+
214252
# OpenAI API
215253
OPENAI_API_KEY=sk-... transcribeit run -p openai -i recording.mp3
216254

@@ -267,7 +305,8 @@ When `--output-dir` is specified, the following files are created:
267305
"index": 0,
268306
"start_secs": 0.0,
269307
"end_secs": 5.25,
270-
"text": "Hello, welcome to the meeting."
308+
"text": "Hello, welcome to the meeting.",
309+
"speaker": "Speaker 0"
271310
}
272311
],
273312
"stats": {

docs/performance-benchmarks.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,16 +61,23 @@ Record:
6161
### 3. Segmentation impact
6262

6363
```bash
64+
# FFmpeg silencedetect segmentation
6465
time transcribeit run -p openai -i <long_file> --segment --segment-concurrency 2 -f text -o ./output
6566
time transcribeit run -p openai -i <long_file> --segment --segment-concurrency 1 --max-segment-secs 300 -f text -o ./output
66-
# sherpa-onnx always segments at 30s max
67+
68+
# sherpa-onnx with FFmpeg silencedetect (default, always segments at 30s max)
6769
time transcribeit run -p sherpa-onnx -i <long_file> -m base -f text -o ./output
70+
71+
# sherpa-onnx with VAD-based segmentation
72+
time transcribeit run -p sherpa-onnx -i <long_file> -m base --vad-model /path/to/silero_vad.onnx -f text -o ./output
6873
```
6974

7075
Record:
7176
- total segment count
7277
- max queue wait
7378
- request-level retry counts
79+
- segmentation method used (VAD vs silencedetect)
80+
- transcript quality at segment boundaries (check for mid-word cuts)
7481

7582
### 4. I/O + conversion overhead
7683

@@ -117,6 +124,17 @@ These results were measured on a 5-minute medical interview recording.
117124
- Moonshine provides a compact alternative but is slower than Whisper at the same size tier.
118125
- For highest quality where speed is not critical, use `large-v3-turbo` with local whisper.cpp.
119126

127+
### VAD vs FFmpeg silencedetect segmentation
128+
129+
VAD-based segmentation (Silero VAD via `--vad-model`) and FFmpeg `silencedetect` produce different segment boundaries. Key differences to observe when benchmarking:
130+
131+
- **Segment boundary quality:** VAD detects speech regions directly, so segment boundaries align with actual speech. FFmpeg `silencedetect` splits at silence midpoints, which can cut mid-word if silence gaps are short or thresholds are mistuned.
132+
- **Segment count:** VAD typically produces more segments (one per speech region after merging) while `silencedetect` produces fewer, longer segments based on silence gaps.
133+
- **Processing overhead:** VAD runs on the audio samples in-memory (fast, no subprocess). FFmpeg `silencedetect` runs as a subprocess and requires parsing its stderr output.
134+
- **Transcript quality:** VAD-segmented transcripts tend to have fewer artifacts at segment boundaries because chunks start and end at speech boundaries with 250ms padding, rather than at arbitrary silence midpoints.
135+
136+
When comparing, use the same audio file and model to isolate the effect of the segmentation method on overall transcript quality and timing.
137+
120138
## CI/automatable baseline
121139

122140
For now, treat these as manual benchmarks in a fixed environment.

0 commit comments

Comments
 (0)