voxvibe/README.md at main · jdcockrill/voxvibe

Voice dictation that just works. Press Super+X anywhere in GNOME, speak your thoughts, and watch your words appear exactly where you need them.

VoxVibe integrates directly with GNOME providing dictation functionality to any application.

VoxVibe superpowers your Linux workflow with local AI transcription - fast and accurate speech-to-text that works where you are.

Built with Claude Code and Windsurf.

Installation

Install system dependencies:

# Ubuntu/Debian
sudo apt install -y portaudio19-dev pipx

# Fedora
sudo dnf install portaudio portaudio-devel pipx
# Fedora users also need AppIndicator extension for system tray:
# https://github.com/ubuntu/gnome-shell-extension-appindicator

Download and run the installer:

curl -sSL https://raw.githubusercontent.com/jdcockrill/voxvibe/main/install.sh | bash

Reload GNOME Shell:
- X11: Press Alt+F2, type r, press Enter
- Wayland: Log out and log back in

That's it! Use Super+X to start voice dictation.

Using VoxVibe

Primary Usage

Press Super+X: Start/stop recording (main method)
Speak your thoughts: Audio is captured and transcribed
Text appears: Automatically pasted where you were typing

System Tray

VoxVibe also runs in the system tray:

Provides an alternative way to start/stop recording
Access to settings and profiles
Access to recent transcriptions that can be copied again

Configuration

VoxVibe works great with default settings, but you can customize its behavior through a configuration file. The application will automatically create a default configuration file on first run.

Accessing Settings

The easiest way to modify settings is through the system tray:

Click the VoxVibe icon in your system tray
Select Settings from the menu
Your default text editor will open the configuration file

Alternatively, you can directly edit the configuration file located at:

~/.config/voxvibe/config.toml

Configuration Options

The configuration file uses the TOML format and is organized into sections:

Transcription Settings

[transcription]
model = "base"              # Whisper model size (see options below)
language = "en"             # Language code or "auto" for auto-detection  
device = "auto"             # Processing device: "auto", "cpu", or "cuda"
compute_type = "auto"       # Precision: "auto", "int8", "int16", "float16", "float32"

Model Options (from fastest/least accurate to slowest/most accurate):

tiny, tiny.en – Smallest model (~39 M parameters). Fastest (≈3× quicker than base) and lowest memory use; suitable for quick voice commands but noticeably less accurate on long or technical sentences.
base, base.en – Default size (~74 M parameters). Good all-rounder offering real-time dictation speed on most CPUs/GPUs with solid accuracy for everyday language.
small, small.en, distil-small.en – Medium-sized (~244 M / 122 M distilled). ~3-4 % WER improvement over base with ~1.5× slower inference; distilled variant halves memory and recovers much of the speed.
medium, medium.en, distil-medium.en – Large mid-tier (~769 M / 384 M distilled). Near large-model accuracy while ~2× slower than small; GPU recommended for comfortable real-time use.
large-v1, large-v2, large-v3 – Full-size Whisper models (~1.5 B parameters). Highest accuracy but also the slowest and most resource-intensive.
distil-large-v2, distil-large-v3 – Distilled versions (~50 % fewer parameters). Around 1.7-2× faster than the full large models with only a very small drop in accuracy (≈0.5–1 % WER).
large-v3-turbo, turbo – "Turbo" variant of large-v3 that applies aggressive quantisation and decoder optimisations. Up to 4× faster than large-v3 while roughly matching the accuracy of the distilled models.

For most users, base provides the best balance. Use small or medium if you need higher accuracy and don't mind slower processing.

Audio Settings

[audio]
sample_rate = 16000         # Audio sample rate (16kHz recommended)
channels = 1                # Mono (1) or stereo (2) recording

User Interface

[ui]
startup_delay = 2.0         # Delay before starting services
show_notifications = true   # Show system notifications
minimize_to_tray = true     # Minimize to system tray

Window Management

[window_manager]
strategy = "auto"           # Window management: "auto", "dbus"
paste_delay = 0.1           # Delay before pasting text (seconds)

Hotkey Management

[hotkeys]
strategy = "auto"           # Hotkey handling: "auto", "dbus", "qt"

Strategy Options:

"auto" - Automatically choose best available strategy
"dbus" - Use GNOME extension for global hotkeys (recommended)
"qt" - Use Qt's global hotkey system (fallback)

System Tray

[ui]
startup_delay = 2.0         # Delay before starting services
show_notifications = true   # Show system notifications
minimize_to_tray = true     # Minimize to system tray

Logging

[logging]
level = "INFO"              # Log level: "DEBUG", "INFO", "WARNING", "ERROR"
file = "~/.local/share/voxvibe/voxvibe.log"

Post-Processing

[post_processing]
enabled = true              # Enable LLM-based text improvement
model = "openai/gpt-4.1-mini"  # LLM model for post-processing
temperature = 0.3           # Temperature for LLM generation
setenv = {}                 # Environment variables for LLM providers

VoxVibe includes intelligent post-processing that uses AI to improve transcribed text by:

Fixing transcription errors and typos
Improving formatting and readability
Adding proper punctuation and capitalization
Converting lists into bullet points when appropriate
Correcting common speech-to-text errors (homophones, word boundaries)
Maintaining original meaning and tone

See Profiles to customize post-processing based on the active application.

Configuration Tips

Most settings work well as defaults - only change what you need
Model size is the most common setting to adjust based on your hardware and accuracy needs
Changes take effect after restarting VoxVibe (quit and restart from system tray)
Invalid configuration values will fall back to defaults
The configuration file includes helpful comments explaining each option

History

VoxVibe saves your transcriptions locally for easy access and reuse.

Accessing History

System tray menu: Recent transcriptions appear in the tray menu
Click to copy: Click any history item to copy it to your clipboard
Storage: Transcriptions are saved to ~/.local/share/voxvibe/history.db

Configuration

[history]
enabled = true              # Store transcription history
max_entries = 20            # Maximum entries to keep
storage_path = "~/.local/share/voxvibe/history.db"

Profiles

Profiles automatically apply custom post-processing prompts based on the active application.

How Profiles Work

When recording starts, VoxVibe detects the focused window
Window information is matched against configured regex patterns
If a match is found, the profile's custom prompt is used
Otherwise, the default post-processing is applied

Configuration

Edit ~/.config/voxvibe/profiles.toml:

# Define a profile
[[profile]]
name = "software_development"
prompt = '''
You are a senior software engineer. Transform voice notes into clear technical content suitable for development environments - whether that's code comments, documentation, commit messages, issue descriptions, or technical discussions. Use precise technical terminology and proper formatting.
'''

# Link profile to applications
[[profile_matcher]]
profile_name = "software_development"
wm_class_matcher = "Code|IntelliJ|Windsurf"

Matching Options

wm_class_matcher: Matches application name (e.g., "Code", "Firefox")
title_matcher: Matches window title (e.g., "Gmail", "main.py")

Both use regex patterns and are case-insensitive.

Example Profiles

# Email formatting
[[profile]]
name = "email"
prompt = "Format as professional email content with proper structure."

[[profile_matcher]]
profile_name = "email"
wm_class_matcher = "Thunderbird|Mail"
title_matcher = "Gmail|Outlook"

# Casual messaging
[[profile]]
name = "casual"
prompt = "Format as casual conversation. Keep tone informal and friendly."

[[profile_matcher]]
profile_name = "casual"
title_matcher = "WhatsApp|Discord|Slack"

Managing Profiles

Edit: Modify ~/.config/voxvibe/profiles.toml directly or use the system tray "Profiles" menu
Debug: Enable debug logging to see which profiles match for each window
Reload: Restart VoxVibe to apply profile changes

Architecture

VoxVibe has two main components:

┌─────────────────────────────┐         ┌─────────────────────────────┐
│      GNOME Extension        │         │      Python Service         │
│     (extension.js)          │◄────────┤       (voxvibe)             │
│                             │  DBus   │                             │
│ • Global hotkey capture     │         │ • Audio recording           │
│ • System tray indicator     │         │ • Speech transcription      │
│ • Window focus tracking     │         │ • Post-processing           │
│ • Text pasting              │         │ • History storage           │
└─────────────────────────────┘         │ • State management          │
                                        └─────────────────────────────┘

Components

GNOME Extension: Handles desktop integration including global hotkeys, system tray, window tracking, and text pasting.

Python Service: Manages audio recording, transcription, post-processing, and history storage. Runs as a background service.

Data Flow:

Hotkey pressed → Extension signals Python service
Python service captures focused window and starts recording
Audio is recorded and transcribed using Whisper
Text is post-processed and saved to history
Extension pastes text to the original window

Why Two Components?

The split architecture was chosen for reliability and cross-compatibility:

Reliable Keyboard Shortcuts: The GNOME extension provides a robust way to capture global hotkeys (Super+X) that works consistently across the desktop. Python libraries for this can be unreliable with different window managers.
Wayland & X11 Compatibility: By letting the GNOME extension handle window interactions, we get a single solution that works for both Wayland and X11 display servers.
Focus on Strengths: Each component does what it's best at. The extension handles deep GNOME integration, while the Python application manages audio processing and AI transcription.

Development

Directory Structure

voxvibe/
├── app/         # Python transcription app (source, pyproject.toml)
├── extension/   # GNOME Shell extension (extension.js, metadata.json)
├── README.md    # This file

For further development details on specific components, you can explore their respective directories. The app/README.md can be used for more detailed notes on Python app development, and extension/README.md for extension-specific development notes.

Contributions

A special thanks to the following people for their contributions:

dkh99: For significant contributions to the design of the user experience.

Release

To release a new version, follow these steps:

Update Version: Update the version in app/pyproject.toml and extension/metadata.json.
Build Release: Run make release to create the release package.
Push Release: Push the release tag to GitHub: git push origin v$(VERSION).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation

Using VoxVibe

Primary Usage

System Tray

Configuration

Accessing Settings

Configuration Options

Transcription Settings

Audio Settings

User Interface

Window Management

Hotkey Management

System Tray

Logging

Post-Processing

Configuration Tips

History

Accessing History

Configuration

Profiles

How Profiles Work

Configuration

Matching Options

Example Profiles

Managing Profiles

Architecture

Components

Why Two Components?

Development

Directory Structure

Contributions

Release

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Installation

Using VoxVibe

Primary Usage

System Tray

Configuration

Accessing Settings

Configuration Options

Transcription Settings

Audio Settings

User Interface

Window Management

Hotkey Management

System Tray

Logging

Post-Processing

Configuration Tips

History

Accessing History

Configuration

Profiles

How Profiles Work

Configuration

Matching Options

Example Profiles

Managing Profiles

Architecture

Components

Why Two Components?

Development

Directory Structure

Contributions

Release