git clone https://github.com/fltman/transcriber.git
cd transcriber
bash install.sh # Automated installer
bash start.sh # Start all servicesThe script handles everything below automatically. Read on if you prefer manual setup or need to troubleshoot.
- Ubuntu 22.04+, Debian 12+, or similar (other distros work with adjusted package commands)
- Docker and Docker Compose
- Python 3.11+
- Node.js 18+
- FFmpeg
- CMake and build-essential
- Ollama, or an OpenRouter API key
- Hugging Face account with access to pyannote/speaker-diarization-3.1
If you have an NVIDIA GPU, install the CUDA Toolkit and NVIDIA drivers for faster transcription. Without a GPU, whisper.cpp runs on CPU (slower but works fine).
# Ubuntu/Debian
sudo apt update
sudo apt install -y git build-essential cmake ffmpeg python3 python3-venv python3-pip curl
# Fedora
sudo dnf install -y git gcc-c++ cmake ffmpeg python3 python3-pip curl
# Arch
sudo pacman -S git base-devel cmake ffmpeg python python-pip curlInstall Node.js 18+ (if not already installed):
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejsgit clone https://github.com/fltman/transcriber.git
cd transcribergit clone https://github.com/ggerganov/whisper.cpp.git ../whisper.cpp
cd ../whisper.cpp
# CPU only
cmake -B build
cmake --build build --config Release
# OR with CUDA (if you have an NVIDIA GPU)
# cmake -B build -DWHISPER_CUDA=ON
# cmake --build build --config Release
cd ../transcriberThe binary will be at ../whisper.cpp/build/bin/whisper-cli.
mkdir -p models
# Medium model (main transcription, higher quality)
curl -L -o models/kb_whisper_ggml_medium.bin \
https://huggingface.co/KBLab/kb-whisper-medium/resolve/main/ggml-model.bin
# Small model (live transcription, faster)
curl -L -o models/kb_whisper_ggml_small.bin \
https://huggingface.co/KBLab/kb-whisper-small/resolve/main/ggml-model.bindocker compose up -dThis starts:
- PostgreSQL on port 5433
- Redis on port 6380
Verify they're running:
docker compose pscat > .env << 'EOF'
DATABASE_URL=postgresql://transcriber:transcriber@localhost:5433/transcriber
REDIS_URL=redis://localhost:6380/0
# LLM provider: "ollama" or "openrouter"
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=qwen3:8b
# Alternative: OpenRouter (uncomment and fill in)
# LLM_PROVIDER=openrouter
# OPENROUTER_API_KEY=your_key_here
# OPENROUTER_MODEL=anthropic/claude-sonnet-4
# Paths to whisper.cpp (adjust to your setup)
WHISPER_CLI_PATH=../whisper.cpp/build/bin/whisper-cli
WHISPER_MODEL_PATH=./models/kb_whisper_ggml_medium.bin
WHISPER_SMALL_MODEL_PATH=./models/kb_whisper_ggml_small.bin
STORAGE_PATH=./storage
# Hugging Face token (needed for pyannote.audio speaker diarization)
# Get yours at https://huggingface.co/settings/tokens
# You must accept the model terms at https://huggingface.co/pyannote/speaker-diarization-3.1
HF_AUTH_TOKEN=hf_your_token_here
EOFEdit the file and fill in your actual paths and tokens.
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtIf you have an NVIDIA GPU, install the CUDA version of PyTorch first:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtNote: First install downloads several GB of model files for pyannote.audio and SpeechBrain.
cd frontend
npm install
cd ..curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:8bStart all four services. Use separate terminals or a terminal multiplexer like tmux:
# Terminal 1 - Backend API
source venv/bin/activate
uvicorn main:app --port 8000 --reload
# Terminal 2 - Celery worker (background processing)
source venv/bin/activate
celery -A tasks.celery_app worker --loglevel=info --pool=solo
# Terminal 3 - Frontend
cd frontend
npm run dev
# Terminal 4 - Ollama (if using local LLM)
ollama serveOpen http://localhost:5174 in your browser.
tmux new-session -d -s transcriber
# Backend
tmux send-keys 'source venv/bin/activate && uvicorn main:app --port 8000 --reload' Enter
# Celery
tmux split-window -v
tmux send-keys 'source venv/bin/activate && celery -A tasks.celery_app worker --loglevel=info --pool=solo' Enter
# Frontend
tmux split-window -v
tmux send-keys 'cd frontend && npm run dev' Enter
# Ollama
tmux split-window -v
tmux send-keys 'ollama serve' Enter
tmux select-layout tiled
tmux attach -t transcriberAdd your user to the docker group:
sudo usermod -aG docker $USER
# Log out and back in for it to take effectMake sure you have Python 3.11+. On Ubuntu 22.04, you may need:
sudo apt install python3.11 python3.11-venv
python3.11 -m venv venvYour browser needs microphone permission. If running on a headless server, you'll need to access it via HTTPS or localhost.
CPU-only transcription with the medium model can take 2-5x the audio length. Consider using the small model (edit WHISPER_MODEL_PATH in .env) for faster results, or use a machine with an NVIDIA GPU.
Check that Redis is running (docker compose ps) and that the REDIS_URL in .env is correct.
The first transcription downloads pyannote and SpeechBrain model files (several GB). Subsequent runs use cached models.