|
12 | 12 | ├── audio/ |
13 | 13 | │ ├── extract.rs # FFmpeg audio conversion |
14 | 14 | │ ├── segment.rs # Silence detection and audio splitting |
| 15 | +│ ├── vad.rs # VAD-based speech segmentation (Silero VAD via sherpa-onnx) |
15 | 16 | │ └── wav.rs # WAV reading and encoding (shared) |
| 17 | +├── diarize/ |
| 18 | +│ ├── mod.rs # Speaker diarization engine and speaker assignment |
| 19 | +│ └── ffi.rs # Raw C FFI bindings for sherpa-onnx speaker diarization |
16 | 20 | ├── output/ |
17 | | -│ ├── vtt.rs # WebVTT subtitle writer |
18 | | -│ ├── srt.rs # SRT subtitle writer |
19 | | -│ └── manifest.rs # JSON manifest writer |
| 21 | +│ ├── vtt.rs # WebVTT subtitle writer (supports <v Speaker N> tags) |
| 22 | +│ ├── srt.rs # SRT subtitle writer (supports [Speaker N] labels) |
| 23 | +│ └── manifest.rs # JSON manifest writer (includes speaker labels) |
20 | 24 | └── engines/ |
21 | 25 | ├── whisper_local.rs # Local whisper.cpp via whisper-rs |
22 | 26 | ├── sherpa_onnx.rs # Local sherpa-onnx engine (auto-detects Whisper, Moonshine, SenseVoice) |
@@ -67,21 +71,32 @@ Input file (any format) |
67 | 71 | │ └─ Auto: sherpa-onnx provider (always segments; max 30s per chunk) |
68 | 72 | │ |
69 | 73 | ├─ If segmenting: |
70 | | - │ ├─ detect_silence() via FFmpeg silencedetect filter |
71 | | - │ ├─ compute_segments() at silence midpoints |
72 | | - │ ├─ split_audio() into temp WAV files |
73 | | - │ └─ Transcribe each segment, offset timestamps (concurrently for API providers) |
| 74 | + │ ├─ VAD path (when --vad-model is set and sherpa-onnx feature is enabled): |
| 75 | + │ │ ├─ read_wav_bytes() → f32 PCM samples |
| 76 | + │ │ ├─ vad_segment(): detect speech → pad 250ms → merge gaps <200ms → split long chunks at low-energy points |
| 77 | + │ │ ├─ Extract chunk samples directly from memory |
| 78 | + │ │ └─ Transcribe each chunk via transcribe(), offset timestamps |
| 79 | + │ ├─ FFmpeg fallback (no VAD model, or sherpa-onnx feature disabled): |
| 80 | + │ │ ├─ detect_silence() via FFmpeg silencedetect filter |
| 81 | + │ │ ├─ compute_segments() at silence midpoints |
| 82 | + │ │ ├─ split_audio() into temp WAV files |
| 83 | + │ │ └─ Transcribe each segment, offset timestamps (concurrently for API providers) |
74 | 84 | │ |
75 | 85 | ├─ If not segmenting: |
76 | 86 | │ ├─ Local: read_wav() → transcribe() directly |
77 | 87 | │ └─ API: transcribe_path() with prepared file |
78 | 88 | │ |
79 | 89 | ├─ normalize_audio? ──→ optional loudnorm filter in ffmpeg conversion pipeline |
| 90 | + ├─ Speaker diarization? (when --speakers N is set) |
| 91 | + │ ├─ read audio samples for diarization |
| 92 | + │ ├─ Diarizer.diarize() → speaker-labeled time spans |
| 93 | + │ └─ assign_speakers() overlays speaker labels onto transcript segments |
| 94 | + │ |
80 | 95 | └─ Output: |
81 | 96 | ├─ Text to stdout or `<input_stem>.txt` |
82 | | - ├─ VTT to file or stdout |
83 | | - ├─ SRT to file or stdout |
84 | | - └─ JSON manifest to output directory |
| 97 | + ├─ VTT to file or stdout (with `<v Speaker N>` tags when diarized) |
| 98 | + ├─ SRT to file or stdout (with `[Speaker N]` labels when diarized) |
| 99 | + └─ JSON manifest to output directory (includes speaker field per segment) |
85 | 100 | ``` |
86 | 101 |
|
87 | 102 | Temporary files use the `tempfile` crate and are cleaned up automatically on drop. |
@@ -184,6 +199,36 @@ cargo build --release --no-default-features |
184 | 199 |
|
185 | 200 | This removes the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`. |
186 | 201 |
|
| 202 | +## VAD-based segmentation (`audio/vad.rs`) |
| 203 | + |
| 204 | +When `--vad-model` is set and the `sherpa-onnx` feature is enabled, the pipeline uses Silero VAD (via sherpa-onnx) for speech-aware segmentation instead of FFmpeg's `silencedetect` filter. This avoids the main problem with silence-based splitting: mid-word cuts. |
| 205 | + |
| 206 | +The VAD pipeline (`vad_segment()`) has four stages: |
| 207 | + |
| 208 | +1. **Detect speech** -- Silero VAD processes 512-sample frames (~32ms at 16kHz) to find speech boundaries with sample-level precision. |
| 209 | +2. **Pad 250ms** -- Each speech chunk is extended by 250ms on both sides to protect word boundaries at the edges. |
| 210 | +3. **Merge gaps <200ms** -- Adjacent chunks separated by less than 200ms are merged to avoid splitting within short pauses. |
| 211 | +4. **Split long chunks** -- Chunks exceeding `--max-segment-secs` are split at the lowest-energy point within a 1-second search window around the target cut point. |
| 212 | + |
| 213 | +The VAD approach works directly on in-memory PCM samples, so there is no need for intermediate temp files during segmentation. Each chunk is transcribed via `engine.transcribe()` with sample slices, and timestamps are offset by the chunk start time. |
| 214 | + |
| 215 | +When `--vad-model` is not set, segmentation falls back to FFmpeg `silencedetect` (the original behavior). |
| 216 | + |
| 217 | +## Speaker diarization (`diarize/`) |
| 218 | + |
| 219 | +Speaker diarization identifies which speaker is talking at each point in the audio. It requires the `sherpa-onnx` feature and two ONNX models: |
| 220 | + |
| 221 | +- **Segmentation model** (`--diarize-segmentation-model`): a pyannote segmentation ONNX model that detects speaker change points. |
| 222 | +- **Embedding model** (`--diarize-embedding-model`): a speaker embedding ONNX model that clusters voice characteristics. |
| 223 | + |
| 224 | +The `Diarizer` follows the same dedicated worker thread pattern as `SherpaOnnxEngine`: the C FFI types are not `Send`/`Sync`, so they live on a plain `std::thread` and communicate via channels. Diarization requests are sent through `mpsc` and results come back through `tokio::sync::oneshot`. |
| 225 | + |
| 226 | +After transcription completes, `assign_speakers()` overlays speaker labels onto transcript segments by finding the diarization segment with the maximum time overlap for each transcript segment. Speaker labels appear as: |
| 227 | + |
| 228 | +- **VTT**: `<v Speaker 0>text</v>` |
| 229 | +- **SRT**: `[Speaker 0] text` |
| 230 | +- **Manifest JSON**: `"speaker": "Speaker 0"` field on each segment |
| 231 | + |
187 | 232 | ## Adding a new engine |
188 | 233 |
|
189 | 234 | 1. Create `src/engines/your_engine.rs` |
|
0 commit comments