Simple Voice-to-Text System

A standalone, fully local voice-to-text system using OpenAI's Whisper (open-source). All models run on your machine—no API keys, no cloud calls; audio never leaves your device. This version supports Dual Output (Original + Translation) and Local-First Speaker Diarization (no tokens required).

Prerequisites

Docker and Docker Compose installed.
FFmpeg (only if running locally without Docker).
Make (installed by default on most Linux systems).

🚀 Quick Start with Makefile (Easiest)

We provide a Makefile to simplify the Docker commands.

1. Build the Image

make build

2. Run Basic Transcription

make run AUDIO=audio/mix.mp3

3. Translate to English (Includes Original Transcript)

make translate AUDIO=audio/mix.mp3

4. Speaker Diarization

To identify different speakers and label the transcript (SPEAKER_00: Hello):

make diarize AUDIO=audio/mix.mp3

5. All-in-one (Transcribe + Translate + Diarize)

make all AUDIO=audio/mix.mp3

🌐 API Server (FastAPI)

For persistent model loading and instant transcription via HTTP:

1. Start the Server

make server

The server loads models once at startup. Subsequent requests are instant.

2. Access Swagger UI

make docs

Or visit: http://localhost:8000/docs

3. Example API Request

curl -X POST "http://localhost:8000/transcribe?translate=true&diarize=true" \
  -F "file=@audio/multi_person.mp3"

Query params: diarize_threshold, max_speakers, use_silhouette (estimate speaker count from embeddings).

🐳 Quick Start with Docker (Manual)

If you don't have make installed:

1. Build

docker compose build

2. Run

# Basic
docker compose run --rm whisper audio.wav

# Dual Output (Original + translation)
docker compose run --rm whisper audio.wav --translate

# Diarization (No token required!)
docker compose run --rm whisper audio.wav --diarize

📂 Output

Transcripts are saved to transcripts/.
If diarization is enabled, output follows the format: SPEAKER_XX: [Text segment]
If translation is enabled, files contain both the Original language and the English translation.

🛠️ Local Installation (Manual)

Install FFmpeg: sudo apt update && sudo apt install ffmpeg libsndfile1

Setup Environment:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run:

python3 transcribe.py audio.wav --translate --diarize

⚙️ Options

--model: Whisper model size (tiny, base, small, medium, large). Default: base.
--whisper-backend: openai-whisper (default) or transformers (Hugging Face). Both run locally.
--translate: Translate non-English audio to English.
--diarize: Enable speaker diarization (SpeechBrain + sliding-window sub-segments + temporal smoothing).
--diarize-threshold: Clustering distance (lower = more speakers). Default: 0.35. Ignored if --max-speakers is set.
--max-speakers: Fix number of speakers (e.g. 2 for two-person dialogue). Overrides --diarize-threshold.
--use-silhouette: Estimate number of speakers from embeddings when --max-speakers is not set.

Server: Set WHISPER_BACKEND=transformers or WHISPER_MODEL=small in the environment to change the loaded model.

📁 Project structure

voice_to_text/ — Main package: config, io_utils, diarization, pipeline, cli, backends/ (openai + transformers Whisper).
transcribe.py — CLI entrypoint.
server.py — FastAPI server.
docs/ — Architecture and structure (see docs/ARCHITECTURE.md, docs/PROJECT_STRUCTURE.md).

See docs/ARCHITECTURE.md for data flow and docs/PROJECT_STRUCTURE.md for a file-by-file reference.

🧹 Code Quality

We use Ruff for linting and formatting.

Check for issues

make lint

Auto-format code

make format

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
audio		audio
docs		docs
transcripts		transcripts
voice_to_text		voice_to_text
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.py		server.py
transcribe.py		transcribe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Voice-to-Text System

Prerequisites

🚀 Quick Start with Makefile (Easiest)

1. Build the Image

2. Run Basic Transcription

3. Translate to English (Includes Original Transcript)

4. Speaker Diarization

5. All-in-one (Transcribe + Translate + Diarize)

🌐 API Server (FastAPI)

1. Start the Server

2. Access Swagger UI

3. Example API Request

🐳 Quick Start with Docker (Manual)

1. Build

2. Run

📂 Output

🛠️ Local Installation (Manual)

⚙️ Options

📁 Project structure

🧹 Code Quality

Check for issues

Auto-format code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simple Voice-to-Text System

Prerequisites

🚀 Quick Start with Makefile (Easiest)

1. Build the Image

2. Run Basic Transcription

3. Translate to English (Includes Original Transcript)

4. Speaker Diarization

5. All-in-one (Transcribe + Translate + Diarize)

🌐 API Server (FastAPI)

1. Start the Server

2. Access Swagger UI

3. Example API Request

🐳 Quick Start with Docker (Manual)

1. Build

2. Run

📂 Output

🛠️ Local Installation (Manual)

⚙️ Options

📁 Project structure

🧹 Code Quality

Check for issues

Auto-format code

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages