Record audio from your microphone and transcribe it to text using whisper.cpp. The program records until you press Enter, then runs the recording through Whisper and prints the transcription.
Platform: macOS (capture uses AVFoundation). Linux/Windows would require changing the ffmpeg input in code.
- Go 1.25 or later
- ffmpeg (install via Homebrew:
brew install ffmpeg) - whisper.cpp (included as a submodule; must be built and the model downloaded)
-
Clone with submodules
git clone --recurse-submodules <repo-url> cd voice-to-text
If the repo is already cloned:
git submodule update --init --recursive
-
Build whisper.cpp
From the project root:
cd third_party/whisper mkdir -p build && cd build cmake .. make cd ../../..
Ensure the CLI binary is at
third_party/whisper/build/bin/whisper-cli. If your build puts the binary elsewhere (e.g.third_party/whisper/main), copy or symlink it to that path. -
Download a Whisper model
The app expects the large-v3-turbo model by default at:
third_party/whisper/models/ggml-large-v3-turbo.binDownload it using the whisper.cpp script (from project root):
mkdir -p third_party/whisper/models third_party/whisper/models/download-ggml-model.sh large-v3-turbo third_party/whisper/models
Other models: You can use a different model by setting
WHISPER_MODEL, e.g. for the smaller English base model:third_party/whisper/models/ggml-base.en.binModel files are large (large-v3-turbo is ~1.5 GB); they are gitignored.
Lighter option: For a smaller/faster model, use e.g.
ggml-base.en.binorggml-small.en.binand setWHISPER_MODELaccordingly.
Run from the project root (paths to the CLI and model are relative to it):
go run ./cmd/vttOr build and run the binary:
go build -o vtt ./cmd/vtt
./vtt- The program starts and prints "Recording... Press ENTER to stop."
- Speak; when done, press Enter to stop recording.
- The audio is saved to
cmd/vtt/input.wavand then transcribed. - The transcription is printed under "--- TRANSCRIPTION ---".
Recordings and generated .wav files are gitignored.
cmd/vtt/main.go– entrypoint: recording via ffmpeg, then transcription via whisper-clithird_party/whisper– whisper.cpp as a git submodule (build output and models are local only)
- Recording: ffmpeg captures from the default system microphone (
:0) with AVFoundation, applies light noise reduction, and outputs 16 kHz mono for Whisper. - Transcription: The saved WAV is passed to
whisper-cliwith-sns(suppress non-speech tokens) and-l enfor cleaner English output.