Skip to content

feat: add real-time streaming transcription with live overlay#217

Open
wrt54gl wants to merge 8 commits intoOpenWhispr:mainfrom
wrt54gl:socket-transcribe
Open

feat: add real-time streaming transcription with live overlay#217
wrt54gl wants to merge 8 commits intoOpenWhispr:mainfrom
wrt54gl:socket-transcribe

Conversation

@wrt54gl
Copy link

@wrt54gl wrt54gl commented Feb 7, 2026

Summary

  • Add real-time streaming transcription with live text overlay during recording, supporting four backends: AssemblyAI (WebSocket), Deepgram, NVIDIA Parakeet (local via sherpa-onnx), and OpenAI Realtime API
  • Add streaming provider selector in Settings with auto-detection and per-provider configuration
  • Fix localStorage serialization bug where useLocalStorage hook used JSON.stringify for string settings, causing double-quoted values that broke provider matching in audioManager.js

Dependencies

This PR includes commits from #202 and #203 for Linux compatibility. Those PRs should be merged first — the overlapping commits will be no-ops on merge.

Streaming architecture

  • Renderer: AudioWorklet (pcm-streaming-processor.js) captures 16kHz PCM and sends chunks via IPC
  • Main process: Provider-specific handlers manage WebSocket connections and broadcast partial transcripts back to renderer
  • UI: LiveTranscriptOverlay component shows real-time text; main window auto-resizes during streaming
  • Parakeet: Chunked re-transcription every 2s against local sherpa-onnx WebSocket server (accumulates full audio buffer)
  • Provider selection: getStreamingProvider() in audioManager.js reads streamingProvider from localStorage with fallback logic per mode (local/cloud/BYOK)

Commits

Commit Description
16ebeee fix(linux): prefer ydotool over xdotool on GNOME Wayland (from #202)
1b125d4 fix(linux): use AT-SPI2 for terminal detection on GNOME Wayland (from #202)
605c0a4 feat: add real-time streaming transcription with live overlay
9ee03c3 fix(linux): fix transparent window flicker and GTK 3/4 symbol crash (from #203)
f8f1569 fix: move useAudioRecording call before showTranscript to fix TDZ crash
7a847cd fix: correct OpenAI Realtime transcription API schema and buffer error
758e6d5 fix: fix Parakeet streaming by correcting status check and localStorage serialization

Test plan

  • Verify streaming toggle appears in Settings → Transcription with provider selector
  • Test AssemblyAI streaming (requires OpenWhispr Cloud sign-in)
  • Test Deepgram streaming with BYOK API key
  • Test Parakeet local streaming (requires sherpa-onnx binary + downloaded model)
  • Test OpenAI Realtime streaming with BYOK API key
  • Verify live transcript overlay appears during streaming recording and disappears on stop
  • Verify final transcription is correct after streaming stop
  • Verify non-streaming (regular) transcription still works when streaming is disabled
  • Test on Linux GNOME Wayland: transparent window, hotkey, paste

🤖 Generated with Claude Code

Wendel Toews and others added 8 commits February 6, 2026 17:29
On GNOME Wayland, xdotool can only interact with XWayland windows.
When OpenWhispr (an Electron/XWayland app) tries to paste, xdotool
targets OpenWhispr's own window instead of the focused native Wayland
window, silently reports success, and prevents fallback to ydotool.

This commit reorders the paste tool candidates on GNOME Wayland to
try ydotool first. It also uses Ctrl+Shift+V with ydotool since
terminal detection via xdotool/kdotool fails for native Wayland
windows. Ctrl+Shift+V works correctly in both terminals (paste) and
other apps (paste without formatting).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On GNOME Wayland, xdotool always returns OpenWhispr's own XWayland
window instead of the actual focused window, making terminal detection
fail. This caused ydotool to always send Ctrl+Shift+V (which doesn't
work in apps like GNOME Text Editor) or always Ctrl+V (which doesn't
work in terminals).

Use the AT-SPI2 accessibility API to detect the active application,
which works for both native Wayland and XWayland windows. This lets
ydotool send the correct keystroke: Ctrl+Shift+V for terminals,
Ctrl+V for everything else.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add multi-provider streaming transcription that displays live text
beside the floating mic icon during recording. Supports 4 backends:

- AssemblyAI (existing, now wired to UI overlay)
- Deepgram Nova-3 (new WebSocket client, ~$0.0043/min)
- OpenAI Realtime API (transcription-only mode, 16→24kHz resampling)
- Parakeet local (chunked re-transcription every 2s, fully offline)

Key changes:
- AudioWorklet (pcm-streaming-processor.js) converts mic float32→int16 PCM
- LiveTranscriptOverlay component with auto-scroll and click-through
- Deterministic window resize priority (transcript > menu > toast > base)
- Multi-provider dispatch in audioManager.js (getStreamingProvider)
- Deepgram API key persistence via PERSISTED_KEYS in environment.js
- Settings UI with provider dropdown and conditional API key input
- TypeScript declarations for all new IPC channels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Chromium flags to fix two Linux-specific issues:

- --enable-transparent-visuals + --disable-gpu-compositing + 300ms
  startup delay to prevent transparent window flickering on X11/Wayland
- --gtk-version=3 to prevent Chromium from dlopen'ing libgtk-4, which
  crashes on GTK 4.18+ (refuses to coexist with GTK 3 in same process)
- Ozone platform hints for native Wayland rendering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
isStreaming and partialTranscript were referenced at line 125 before
being destructured from useAudioRecording at line 149, causing a
"Cannot access before initialization" error that crashed the entire
React app and broke the dictation overlay rendering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use transcription_session.update event type (not session.update)
- Use flat session fields (input_audio_format, input_audio_transcription)
  matching the ?intent=transcription endpoint schema
- Handle transcription_session.created/updated server events
- Remove input_audio_buffer.commit on disconnect (empty buffer error)
- Suppress non-critical "buffer too small" errors from user toast

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ge serialization

Two bugs prevented Parakeet live streaming from working:

1. useSettings stored streamingProvider via JSON.stringify, wrapping the
   value in extra quotes ('"parakeet"' instead of 'parakeet'). audioManager's
   getStreamingProvider() comparison never matched, silently disabling streaming.

2. parakeet-streaming-start handler checked status.ready but
   ParakeetWsServer.getStatus() returns 'running', not 'ready', so it
   always returned NO_SERVER and fell back to regular recording.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merges upstream commits including inline AudioWorklet blob URL fix,
AssemblyAI keep-alive pings, runtime .env support, auth hardening,
and language selector positioning fix.

Resolved conflicts in audioManager.js (multi-provider streaming stop,
worklet blob URL adoption) and useSettings.ts (deepgramApiKey setter).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant