A standalone, fully local voice-to-text system using OpenAI's Whisper (open-source). All models run on your machine—no API keys, no cloud calls; audio never leaves your device. This version supports Dual Output (Original + Translation) and Local-First Speaker Diarization (no tokens required).
- Docker and Docker Compose installed.
- FFmpeg (only if running locally without Docker).
- Make (installed by default on most Linux systems).
We provide a Makefile to simplify the Docker commands.
make buildmake run AUDIO=audio/mix.mp3make translate AUDIO=audio/mix.mp3To identify different speakers and label the transcript (SPEAKER_00: Hello):
make diarize AUDIO=audio/mix.mp3make all AUDIO=audio/mix.mp3For persistent model loading and instant transcription via HTTP:
make serverThe server loads models once at startup. Subsequent requests are instant.
make docsOr visit: http://localhost:8000/docs
curl -X POST "http://localhost:8000/transcribe?translate=true&diarize=true" \
-F "file=@audio/multi_person.mp3"Query params: diarize_threshold, max_speakers, use_silhouette (estimate speaker count from embeddings).
If you don't have make installed:
docker compose build# Basic
docker compose run --rm whisper audio.wav
# Dual Output (Original + translation)
docker compose run --rm whisper audio.wav --translate
# Diarization (No token required!)
docker compose run --rm whisper audio.wav --diarize- Transcripts are saved to
transcripts/. - If diarization is enabled, output follows the format:
SPEAKER_XX: [Text segment] - If translation is enabled, files contain both the Original language and the English translation.
- Install FFmpeg:
sudo apt update && sudo apt install ffmpeg libsndfile1 - Setup Environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt - Run:
python3 transcribe.py audio.wav --translate --diarize
--model: Whisper model size (tiny,base,small,medium,large). Default:base.--whisper-backend:openai-whisper(default) ortransformers(Hugging Face). Both run locally.--translate: Translate non-English audio to English.--diarize: Enable speaker diarization (SpeechBrain + sliding-window sub-segments + temporal smoothing).--diarize-threshold: Clustering distance (lower = more speakers). Default:0.35. Ignored if--max-speakersis set.--max-speakers: Fix number of speakers (e.g.2for two-person dialogue). Overrides--diarize-threshold.--use-silhouette: Estimate number of speakers from embeddings when--max-speakersis not set.
Server: Set WHISPER_BACKEND=transformers or WHISPER_MODEL=small in the environment to change the loaded model.
voice_to_text/— Main package:config,io_utils,diarization,pipeline,cli,backends/(openai + transformers Whisper).transcribe.py— CLI entrypoint.server.py— FastAPI server.docs/— Architecture and structure (seedocs/ARCHITECTURE.md,docs/PROJECT_STRUCTURE.md).
See docs/ARCHITECTURE.md for data flow and docs/PROJECT_STRUCTURE.md for a file-by-file reference.
We use Ruff for linting and formatting.
make lintmake format