Self-hosted AI transcription and intelligent note-taking platform
Documentation • Quick Start • Screenshots • Docker Hub • Releases
Speakr transforms your audio recordings into organized, searchable, and intelligent notes. Built for privacy-conscious groups and individuals, it runs entirely on your own infrastructure, ensuring your sensitive conversations remain completely private.
- Smart Recording & Upload - Record directly in browser or upload existing audio files
- AI Transcription - High-accuracy transcription with speaker identification
- Voice Profiles - AI-powered speaker recognition with voice embeddings (requires WhisperX ASR service)
- REST API v1 - Complete API with Swagger UI for automation tools (n8n, Zapier, Make) and dashboard widgets
- Single Sign-On - Authenticate with any OIDC provider (Keycloak, Azure AD, Google, Auth0, Pocket ID)
- Audio-Transcript Sync - Click transcript to jump to audio, auto-highlight current text, follow mode for hands-free playback
- Interactive Chat - Ask questions about your recordings and get AI-powered answers
- Inquire Mode - Semantic search across all recordings using natural language
- Internationalization - Full support for English, Spanish, French, German, Chinese, and Russian
- Beautiful Themes - Light and dark modes with customizable color schemes
- Internal Sharing - Share recordings with specific users with granular permissions (view/edit/reshare)
- Group Management - Create groups with automatic sharing via group-scoped tags
- Public Sharing - Generate secure links to share recordings externally (admin-controlled)
- Group Tags - Tags that automatically share recordings with all group members
- Smart Tagging - Organize with tags that include custom AI prompts and ASR settings
- Tag Prompt Stacking - Combine multiple tags to layer AI instructions for powerful transformations
- Tag Protection - Prevent specific recordings from being auto-deleted
- Group Retention Policies - Set custom retention periods per group tag
- Auto-Deletion - Automatic cleanup of old recordings with flexible retention policies
Different people use Speakr's collaboration and retention features in different ways:
| Use Case | Setup | What It Does |
|---|---|---|
| Family memories | Create "Family" group with protected tag | Everyone gets access to trips and events automatically, recordings preserved forever |
| Book club discussions | "Book Club" group, tag monthly meetings | All members auto-share discussions, can add personal notes about what resonated |
| Work project group | Share individually with 3 teammates | Temporary collaboration, easy to revoke when project ends |
| Daily group standups | Group tag with 14-day retention | Auto-share with group, auto-cleanup of routine meetings |
| Architecture decisions | Engineering group tag, protected from deletion | Technical discussions automatically shared, preserved permanently as reference |
| Client consultations | Individual share with view-only permission | Controlled external access, clients can't accidentally edit |
| Research interviews | Protected tag + Obsidian export | Preserve recordings indefinitely, transcripts auto-import to note-taking system |
| Legal consultations | Group tag with 7-year retention | Automatic sharing with legal group, compliance-based retention |
| Sales calls | Group tag with 1-year retention | Whole sales group learns from each call, cleanup after sales cycle |
Tags with custom prompts transform raw recordings into exactly what you need:
- Recipe recordings: Record yourself cooking while narrating - tag with "Recipe" to convert messy speech into formatted recipes with ingredient lists and numbered steps
- Lecture notes: Students tag lectures with "Study Notes" to get organized outlines with concepts, examples, and definitions instead of raw transcripts
- Code reviews: "Code Review" tag extracts issues, suggested changes, and action items in technical language developers can use directly
- Meeting summaries: "Action Items" tag ignores discussion and returns just decisions, tasks, and deadlines
Stack multiple tags to layer instructions:
- "Recipe" + "Gluten Free" = Formatted recipe with gluten substitution suggestions
- "Lecture" + "Biology 301" = Study notes format focused on biological terminology
- "Client Meeting" + "Legal Review" = Client requirements plus legal implications highlighted
The order can matter - start with format tags, then add focus tags for best results.
- Obsidian/Logseq: Enable auto-export to write completed transcripts directly to your vault using your custom template - no manual export needed
- Documentation wikis: Map auto-export to your wiki's import folder for seamless transcript publishing
- Content creation: Create SRT subtitle templates from your audio recordings for podcasts or video content
- Project management: Extract action items with custom tag prompts, then auto-export for automated task creation
# Create project directory
mkdir speakr && cd speakr
# Download docker-compose configuration:
wget https://raw.githubusercontent.com/murtaza-nasir/speakr/master/config/docker-compose.example.yml -O docker-compose.yml
# Download the environment template:
wget https://raw.githubusercontent.com/murtaza-nasir/speakr/master/config/env.transcription.example -O .env
# Configure your API keys and launch
nano .env
docker compose up -d
# Access at http://localhost:8899Lightweight image: Use
learnedmachine/speakr:litefor a smaller image (~725MB vs ~4.4GB) that skips PyTorch. All features work normally — only Inquire Mode's semantic search falls back to basic text search.
Required API Keys:
TRANSCRIPTION_API_KEY- For speech-to-text (OpenAI) orASR_BASE_URLfor self-hostedTEXT_MODEL_API_KEY- For summaries, titles, and chat (OpenRouter or OpenAI)
Speakr uses a connector-based architecture that auto-detects your transcription provider:
| Option | Setup | Speaker Diarization | Voice Profiles |
|---|---|---|---|
| OpenAI Transcribe | Just API key | ✅ gpt-4o-transcribe-diarize |
❌ |
| WhisperX ASR | GPU container | ✅ Best quality | ✅ |
| Mistral Voxtral | Just API key | ✅ Built-in | ❌ |
| VibeVoice ASR | Self-hosted (vLLM) | ✅ Built-in | ❌ |
| Legacy Whisper | Just API key | ❌ | ❌ |
Simplest setup (OpenAI with diarization):
TRANSCRIPTION_API_KEY=sk-your-openai-key
TRANSCRIPTION_MODEL=gpt-4o-transcribe-diarizeBest quality (Self-hosted WhisperX):
ASR_BASE_URL=http://whisperx-asr:9000
ASR_RETURN_SPEAKER_EMBEDDINGS=true # Enable voice profilesRequires WhisperX ASR Service container with GPU.
Mistral Voxtral (cloud diarization):
TRANSCRIPTION_CONNECTOR=mistral
TRANSCRIPTION_API_KEY=your-mistral-key
TRANSCRIPTION_MODEL=voxtral-mini-latestVibeVoice ASR (self-hosted, no cloud dependency):
TRANSCRIPTION_CONNECTOR=vibevoice
TRANSCRIPTION_BASE_URL=http://your-vllm-server:8000
TRANSCRIPTION_MODEL=vibevoiceRequires VibeVoice served via vLLM with GPU.
⚠️ PyTorch 2.6 Users: If you encounter a "Weights only load failed" error with WhisperX, addTORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=trueto your ASR container. See troubleshooting for details.
View Full Installation Guide →
Complete documentation is available at murtaza-nasir.github.io/speakr
- Getting Started - Quick setup guide
- User Guide - Learn all features
- Admin Guide - Administration and configuration
- Troubleshooting - Common issues and solutions
- FAQ - Frequently asked questions
New Transcription Connectors, Upload API Improvements & Bug Fixes
- Mistral/Voxtral Connector - Cloud-based transcription with built-in speaker diarization via Mistral's Voxtral models, with admin-configurable default hotwords
- VibeVoice Connector - Self-hosted transcription via vLLM with speaker diarization, automatic chunking for long files, and no cloud dependency
- Upload API: title & meeting_date - Optional
titleandmeeting_datefields on the upload API so integrations can set metadata directly - Regenerate Title - New button to regenerate a recording's title with AI after transcription
- Default Transcription Language - Users can set a default language that auto-fills on upload and reprocess forms
- Tag-Driven Auto-Processing - Watch folders can now auto-apply tags and trigger processing via API
- Configurable LLM Timeouts - Adjust timeout and retry settings for slower local models
Bug Fixes - Azure inquire mode crash on empty streaming chunks, chat API returning non-serializable objects, user deletion failing on NOT NULL foreign keys, duration-based chunking ignoring connector limits
Fullscreen Video, Custom Vocabulary & Localization
- Fullscreen Video Mode - Double-click or use the expand button to enter a fullscreen video player with auto-hiding controls, live subtitles showing speaker names, and full keyboard shortcuts
- Custom Vocabulary (Hotwords) - Comma-separated words to improve recognition of domain-specific terms, configurable per user, per tag, or per folder
- Initial Prompt - Provide context to steer the transcription model's style and vocabulary
- Video Passthrough - New
VIDEO_PASSTHROUGH_ASR=trueoption sends raw video files directly to ASR backends that support video input, skipping audio extraction - Upload Disclaimer Modal - Configurable disclaimer shown before uploads, with custom banner text in admin settings
- Complete Localization - All recent feature strings (incognito mode, hotwords, upload disclaimer, fullscreen video, groups, SSO, color schemes) now fully localized across all six languages
Bug Fixes - Upload notification ordering, speaker snippet extraction for video files with AAC audio, chat textarea staying focused during AI responses, upload queue blocking when adding files while processing, duplicate detection hashing before conversion, markdown list formatting
Video Retention Fix - Fixed large video files silently losing their video stream during upload when VIDEO_RETENTION=true. Probe timeout now scales with file size and falls back to extension detection if probing fails.
Lightweight Docker Image
- Lite Image - New
learnedmachine/speakr:litetag (~725MB vs ~4.4GB) skips PyTorch/sentence-transformers for users who don't need semantic search - Multi-Stage Dockerfile - Optimized build with static ffmpeg binaries and smaller final image for both variants
- Improved Text Search - Better fallback search with stop word filtering, keyword-focused query enrichment, and match ranking
Thanks to sakowicz for the suggestion
Export Templates & Localization
- Customizable Export Templates - Create markdown templates for exports with variables (
{{title}},{{summary}},{{notes}}) and conditionals for optional sections - Localized Labels - Use
{{label.metadata}},{{label.summary}}etc. for automatically translated labels based on user's UI language - Localized Date Formatting - Export dates formatted per user's language preference (e.g., "15. Januar 2026" for German)
Improvements - Opt-in ASR chunking, speaker ID remapping across chunks, simplified About page transcription display
Bug Fixes - ASR empty text validation, cascade delete for recording relationships, missing model imports
Folders & Automation
- Folders Organization - Organize recordings into folders with custom prompts and ASR settings per folder
- Auto Speaker Labeling - Automatic speaker identification using voice embedding matching
- Per-User Auto-Summarization - User-configurable automatic summary generation
- Azure OpenAI Connector - New transcription connector for Azure OpenAI (experimental, community testing welcome)
- HTTPS Validation - Clear error messages when attempting to record on non-HTTPS connections
Improvements - Legacy ASR code removed (fully migrated to connector architecture), audio codec fallback to MP3, share page click-to-seek, new READABLE_PUBLIC_LINKS option for server-rendered transcripts (LLM/scraper accessible)
Bug Fixes - PostgreSQL boolean defaults in migrations, folders feature detection, audio player visibility for incognito recordings
Incognito Mode Enhancements & Compatibility Fixes
- Incognito Mode for In-App Recordings - The incognito toggle now works for microphone recordings, not just uploads
- Default Incognito Mode - New
INCOGNITO_MODE_DEFAULT=trueoption to start with incognito enabled by default - LLM Streaming Compatibility - New
ENABLE_STREAM_OPTIONS=falseoption for LLM servers that don't support OpenAI's stream_options parameter
Bulk Operations & Privacy Features
- Multi-Select Mode - Select multiple recordings in sidebar for batch operations (delete, tag, reprocess, toggle inbox/highlight)
- Incognito Mode - Session-only transcription processing with no database storage (enable with
ENABLE_INCOGNITO_MODE=true) - Playback Speed Control - Adjustable 0.5x to 3x speed on all audio players with persistent preference
Bug Fixes - Fixed language selection not being passed to ASR service, improved reprocess modal
Naming Templates
- Custom Title Formatting - Create templates with variables like
{{ai_title}},{{filename}},{{date}}and custom regex patterns - Tag-Based or User Default - Assign templates to tags or set a user-wide default
- Token Savings - Templates without
{{ai_title}}skip the AI call entirely - API v1 Upload - New
/api/v1/uploadendpoint for programmatic recording uploads
Improvements - Tag drag-and-drop reordering, registration domain restriction, event delete button, WebM seeking fix
Transcription Usage Tracking
- Per-User Budgets - Set monthly transcription limits (in minutes) with 80% warning and 100% blocking
- Usage Dashboard - Track minutes, costs, and per-user breakdowns in Admin panel
- Cost Estimation - Automatic pricing for OpenAI Whisper/Transcribe and self-hosted ASR
Bug Fixes
- Diarization for Long Files - Fixed speaker diarization for chunked files with OpenAI's
gpt-4o-transcribe-diarize - Empty Segment Filtering - Removed empty transcript segments from diarized output
Cloud Diarization & REST API
- Speaker Diarization Without GPU - Use OpenAI's
gpt-4o-transcribe-diarizefor speaker identification with just an API key - REST API v1 - Full-featured API for automation tools (n8n, Zapier, Make) and dashboard widgets
- Connector Architecture - Modular transcription providers with simplified configuration
- Virtual Scrolling - Performance optimization for handling 4500+ transcript segments smoothly
- Audio Player Improvements - Drag-to-seek, independent modal players, improved theme support
- File Date Handling - Uses original recording date from file metadata instead of upload time
- Codec Configuration - Configure unsupported audio codecs with automatic conversion
- PostgreSQL Support - Added
psycopg2-binarydriver for PostgreSQL database option - Audio Download Button - Explicit download button next to audio player, works on mobile
- Job Queue Race Condition Fix - Fixed issue where multiple workers could claim the same job
Thanks to sakowicz, JadedBlueEyes, and Daabramov
- SSO Authentication - Sign in with any OIDC provider (Keycloak, Azure AD, Google, Auth0, Pocket ID)
- Account Linking/Unlinking - Link or unlink SSO from Account settings
- Enforce SSO-only - Disable password login for regular users
Contributed by Dmitry Abramov | SSO Setup Guide
⚠️ IMPORTANT: v0.5.9 introduced significant architectural changes. If upgrading from earlier versions, backup your data first and review the configuration guide.
- Complete Internal Sharing System - Share recordings with users with granular permissions (view/edit/reshare)
- Group Management & Collaboration - Create groups with auto-sharing via group tags and custom retention policies
- Speaker Voice Profiles - AI-powered speaker identification with 256-dimensional voice embeddings
- Audio-Transcript Synchronization - Click-to-jump, auto-highlight, and follow mode for interactive navigation
- Auto-Deletion & Retention System - Flexible retention policies with global and group-level controls
- Automated Export - Auto-export transcriptions to markdown for Obsidian, Logseq, and other note-taking apps
- Permission System - Fine-grained access control throughout the application
- Modular Architecture - Backend refactored into blueprints, frontend composables for maintainability
- UI/UX Enhancements - Compact controls, inline editing, unified toast notifications, improved badges
- Enhanced Internationalization - 29 new tooltip translations across all supported languages
Main Screen with Chat |
Video Playback with Transcript |
AI-Powered Semantic Search |
Interactive Transcription & Chat |
View Full Screenshot Gallery →
- Backend: Python/Flask with SQLAlchemy
- Frontend: Vue.js 3 with Tailwind CSS
- AI/ML: OpenAI Whisper, OpenRouter, Ollama support
- Database: SQLite (default) or PostgreSQL
- Deployment: Docker, Docker Compose
- ✅ Speaker voice profiles with AI-powered identification (v0.5.9)
- ✅ Group workspaces with shared recordings (v0.5.9)
- ✅ PWA enhancements with offline support and background sync (v0.5.10)
- ✅ Multi-user job queue with fair scheduling (v0.6.0)
- ✅ SSO integration with OIDC providers (v0.7.0)
- ✅ Token usage tracking and per-user budgets (v0.7.2)
- ✅ Connector-based transcription architecture with auto-detection (v0.8.0)
- ✅ Comprehensive REST API with Swagger UI documentation (v0.8.0)
- ✅ Video retention with in-browser video playback (v0.8.11)
- ✅ Parallel uploads with duplicate detection (v0.8.11)
- ✅ Fullscreen video mode with live subtitles (v0.8.14)
- ✅ Custom vocabulary and transcription hints (v0.8.14)
- Quick language switching for transcription
- Automated workflow triggers
- Plugin system for custom integrations
- End-to-end encryption option
This project is dual-licensed:
-
GNU Affero General Public License v3.0 (AGPLv3)
Speakr is offered under the AGPLv3 as its open-source license. You are free to use, modify, and distribute this software under the terms of the AGPLv3. A key condition of the AGPLv3 is that if you run a modified version on a network server and provide access to it for others, you must also make the source code of your modified version available to those users under the AGPLv3.
- You must create a file named
LICENSE(orCOPYING) in the root of your repository and paste the full text of the GNU AGPLv3 license into it. - Read the full license text carefully to understand your rights and obligations.
- You must create a file named
-
Commercial License
For users or organizations who cannot or do not wish to comply with the terms of the AGPLv3 (for example, if you want to integrate Speakr into a proprietary commercial product or service without being obligated to share your modifications under AGPLv3), a separate commercial license is available.
Please contact speakr maintainers for details on obtaining a commercial license.
You must choose one of these licenses under which to use, modify, or distribute this software. If you are using or distributing the software without a commercial license agreement, you must adhere to the terms of the AGPLv3.
We welcome contributions to Speakr! There are many ways to help:
- Bug Reports & Feature Requests: Open an issue
- Discussions: Share ideas and ask questions
- Documentation: Help improve our docs
- Translations: Contribute translations for internationalization
By submitting a pull request, you agree to our Contributor License Agreement (CLA). This ensures we can maintain our dual-license model (AGPLv3 and Commercial). You retain copyright ownership of your contribution — the CLA simply grants us permission to include it in both the open source and commercial versions of Speakr. Our bot will post a reminder when you open a PR.
See our Contributing Guide for complete details on:
- How the CLA works and why we need it
- Step-by-step contribution process
- Development setup instructions
- Coding standards and best practices



