A Rust CLI for speech-to-text transcription. Supports local inference via whisper.cpp, local inference via sherpa-onnx, remote transcription via OpenAI-compatible APIs, and Azure OpenAI.
Accepts any audio or video format — FFmpeg handles conversion automatically.
- Rust 1.80+ (edition 2024)
- FFmpeg installed and on PATH
- C/C++ toolchain and CMake (for building whisper.cpp)
- sherpa-onnx shared libraries (if using the
sherpa-onnxprovider) — setSHERPA_ONNX_LIB_DIRin.envto the directory containing them
# Build (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
cargo build --release
# Build without sherpa-onnx (no shared library dependency needed)
cargo build --release --no-default-features
# Download a GGML model (default format, for --provider local)
transcribeit download-model -s base
# Download an ONNX model (for --provider sherpa-onnx)
transcribeit download-model -s base -f onnx
# List all downloaded models (GGML and ONNX)
transcribeit list-models
# Transcribe with local whisper.cpp (model alias resolves from MODEL_CACHE_DIR)
transcribeit run -i recording.mp3 -m base
# Transcribe with sherpa-onnx Whisper (auto-segments at ≤30s boundaries)
transcribeit run -p sherpa-onnx -i recording.mp3 -m base
# Transcribe with sherpa-onnx Moonshine (auto-detected from model files)
transcribeit run -p sherpa-onnx -i recording.mp3 -m moonshine-base
# Transcribe with sherpa-onnx SenseVoice (auto-detected from model files)
transcribeit run -p sherpa-onnx -i recording.mp3 -m sense-voice
# Or pass an explicit model path
transcribeit run -i recording.mp3 -m .cache/ggml-base.bin
# Process a directory (default output format is vtt)
transcribeit run -i samples/ -m base -o ./output
# Process a glob
transcribeit run --input "samples/**/*.{mp3,wav,mp4}" -p azure -o ./output
# Choose output format: text, vtt (default), or srt
transcribeit run -i meeting.mp4 -m base -f srt -o ./output
# Transcribe via OpenAI API
transcribeit run -p openai -i recording.mp3
# Transcribe via Azure OpenAI
transcribeit run -p azure -i recording.mp3 \
--azure-deployment my-whisper -b https://myresource.openai.azure.com
# Force language and normalize before transcription
transcribeit run -i recording.wav -m base --language en --normalize
# VAD-based segmentation (speech-aware, avoids mid-word cuts)
transcribeit run -p sherpa-onnx -m base -i recording.mp3 --vad-model .cache/silero_vad.onnx
# Speaker diarization (2 speakers)
transcribeit run -i interview.mp3 -m base --speakers 2 \
--diarize-segmentation-model .cache/sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
--diarize-embedding-model .cache/wespeaker_en_voxceleb_CAM++.onnx- Any input format — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
- 4 providers — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI. Extensible via the
Transcribertrait. - 3 model architectures via sherpa-onnx — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point
--modelat any supported model directory. - Model aliases —
-m base,-m tiny, etc. resolve fromMODEL_CACHE_DIRfor bothlocalandsherpa-onnxproviders. The sherpa-onnx resolver also supports glob matching (e.g.,-m moonshine-base,-m sense-voice). - Language hinting — Pass
--languageto force local and API transcription language. - FFmpeg audio normalization — Optional
--normalizeto apply loudnorm before transcription. - VAD-based segmentation — Speech-aware segmentation via Silero VAD (sherpa-onnx). Detects speech boundaries with padding and gap merging to avoid mid-word cuts. Use
--vad-model .cache/silero_vad.onnx. - Silence-based segmentation — Fallback segmentation via FFmpeg
silencedetectfor API providers or when VAD model is not available. - sherpa-onnx auto-segmentation — Whisper ONNX models only support ≤30s per call; segmentation is enabled automatically.
- sherpa-onnx is optional — Enabled by default as a Cargo feature. Build without it:
cargo build --no-default-features. - Auto-split for API limits — Files exceeding 25MB are automatically segmented when using remote providers.
- Progress spinner — Shows live terminal feedback during transcription (single file and segmented mode).
- Parallel API segment transcription — Multiple segment requests can be processed concurrently with
--segment-concurrency. - VTT output (default) — WebVTT subtitle files with timestamps.
- SRT output — SubRip subtitle files with timestamps.
- Text output — Writes plain text transcript to stdout by default and
<input>.txtwhen--output-diris specified. - JSON manifest — Processing metadata, segment details, and statistics.
- Model caching — Loaded whisper models are cached in memory for batch processing.
- Model management — Download and list both GGML and ONNX models. Use
--format ggml(default) or--format onnxwithdownload-model.
Create a .env file in the project root:
HF_TOKEN=hf_your_token_here
MODEL_CACHE_DIR=.cache
SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
OPENAI_API_KEY=sk-your_key_here
AZURE_API_KEY=your_azure_key_here
AZURE_OPENAI_ENDPOINT=https://myresource.openai.azure.com
AZURE_DEPLOYMENT_NAME=whisper
AZURE_API_VERSION=2024-06-01
TRANSCRIBEIT_MAX_RETRIES=5
TRANSCRIBEIT_REQUEST_TIMEOUT_SECS=120
TRANSCRIBEIT_RETRY_WAIT_BASE_SECS=10
TRANSCRIBEIT_RETRY_WAIT_MAX_SECS=120
VAD_MODEL=.cache/silero_vad.onnx
DIARIZE_SEGMENTATION_MODEL=.cache/sherpa-onnx-pyannote-segmentation-3-0/model.onnx
DIARIZE_EMBEDDING_MODEL=.cache/wespeaker_en_voxceleb_CAM++.onnxPre-built binaries can be deployed without Rust or build tools. The binary needs FFmpeg on PATH and the sherpa-onnx shared libraries alongside it:
transcribeit # binary
lib/ # sherpa-onnx shared libraries
libsherpa-onnx-c-api.dylib
libonnxruntime.dylib
On first run, use transcribeit setup to download models and additional components. The binary looks for shared libraries in lib/ relative to itself — no environment variables needed at runtime.
To build a distributable binary:
cargo build --release
# Copy binary + libs
cp target/release/transcribeit dist/
cp vendor/sherpa-onnx-*/lib/lib*.dylib dist/lib/To build without sherpa-onnx (no shared library dependency):
cargo build --release --no-default-featuresThis project is licensed under the Business Source License 1.1.
- Free for non-commercial and evaluation use
- Commercial/production use requires a separate license — contact TranscriptIntel
- Converts to Apache 2.0 on March 16, 2030
See the docs folder for detailed documentation:
- Architecture — Project structure, trait design, processing pipeline
- CLI Reference — All commands, options, and examples
- Provider behavior — OpenAI vs Azure argument differences
- Troubleshooting — Common setup/runtime issues and fixes
- Performance benchmarks — Measurement plan, reference results, and templates