feat(live-preview): streaming transcription with hint path#435
Draft
guicheffer wants to merge 23 commits intomainfrom
Draft
feat(live-preview): streaming transcription with hint path#435guicheffer wants to merge 23 commits intomainfrom
guicheffer wants to merge 23 commits intomainfrom
Conversation
…cription Implements rolling window transcription to prevent CPU growth during long recordings. Changes: - Add contextPrompt parameter to transcribe() and snapshotAndTranscribe() - Use 5-second rolling window instead of full buffer for live preview - Pass accumulated text as context prompt to Whisper for better accuracy - Implement smart text merging to detect overlaps and append new content - Reset accumulated text when live transcription stops Benefits: - Constant CPU usage regardless of recording duration - Reduced hallucinations via context prompting - Smoother incremental text updates - No redundant processing of already-transcribed audio Related: app-vox/specs#67 Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Comprehensive tests covering: - Empty accumulated text - No overlap detection (simple append) - Overlap detection (1-5 words) - Case-insensitive overlap - Identical snapshots (deduplication) - Edge cases with short text All tests pass, verifying the smart merge logic works correctly. Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…d accuracy Critical improvements: - Increase window from 5s to 10s for better context and accuracy - Increase first snapshot delay from 400ms to 700ms (prevents timing issues) - Limit context prompt to last 30 words (prevents performance degradation) - Fix critical bug: use accumulatedText (not lastSnapshotText) as hint for final transcription - Add debug logging to track window size, context, and merging behavior Performance benefits: - Smart context windowing reduces Whisper processing overhead - Full accumulated text used as hint for final transcription (skips redundant work) - Better timing prevents empty snapshots Accuracy benefits: - 10s window provides more context for Whisper - Using full accumulated text in final transcription improves quality Related: app-vox/specs#67 Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Major optimizations: - Reduce rolling window from 10s to 6s (faster processing, less CPU) - Increase tick interval from 1s to 1.5s (prevents backlog) - Reduce context from 30 to 20 words (lighter prompts) - Smart logging: only log when text changes (reduces noise) - Add timing logs for hint path to measure performance Performance improvements: - 6s window = faster Whisper processing - 1.5s interval ensures Whisper finishes before next tick - Less context = less prompt processing overhead - Reduced logging = less I/O overhead Better observability: - Log hint usage and character count on stop - Show incremental text changes (chars added) - Track hint path timing (should be <100ms vs seconds for full Whisper) Related: app-vox/specs#67 Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…ng DOM Root causes of CPU growth found and fixed: 1. Audio buffer growing unboundedly (main fix): - _recChunks array accumulated ALL chunks for the entire recording - snapshot() iterated all chunks for totalLength reduce() on every tick - After 5min: ~1172 chunks, growing O(n) with recording duration - Fix: trimBuffer(30s) called after each snapshot — keeps buffer O(constant) - 30s retained as fallback for Whisper if hint path fails 2. DOM spans growing unbounded (hud.ts): - updateTextPanel() created a new <span> per word, never removed - tpScrollToBottom() forced DOM reflow on EVERY word (every 60ms) - Fix: cap DOM at 150 words (drop oldest), scroll only once at end Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
- Revert DOM word cap: was cutting visible text and breaking auto-scroll The audio buffer trimBuffer(30s) is the real CPU fix, DOM was secondary - Restore tpScrollToBottom() after each word for continuous scroll behavior Colors: align text panel with HUD theme instead of purple-heavy look - Dark mode: rgba(24,24,27) matches HUD dark gray, border subtle indigo - Light mode: clean white with soft indigo border Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Dark mode: desaturated gray-purple border (less blue, more tonal) Light mode: soft airy lavender (very subtle, not as vivid as dark) Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…r calls Three root causes fixed: 1. Multiple RAF callbacks per frame (tpScrollToBottom) - Each word added (every 60ms) was scheduling a separate requestAnimationFrame - All fired in same frame → N × scrollHeight reads (forced reflow) per frame - DOM with 500 spans: O(500) layout × N words per frame = O(n²) growth - Fix: tpScrollPending flag ensures only ONE RAF fires per animation frame 2. Unbounded DOM growth causing expensive layout calculations - Panel is 74px tall (~4 visible lines), old words scroll off-screen immediately - 500+ spans in DOM = expensive scrollHeight recalculation every frame - Fix: cap DOM at 120 words via tpWordCount counter + O(1) firstElementChild removal - Words above viewport are invisible — user never sees the cap 3. Whisper called during silence (biggest CPU waste) - Every 1.5s, Whisper processed 6s of audio even if user was pausing/thinking - Typical conversation: 40-60% is silence → 40-60% of Whisper calls wasted - Fix: RMS check of last 250ms in snapshot() tail branch - Returns null (skip Whisper) if RMS < 0.004 (below speech threshold) - Only applies when buffer > maxSeconds (after first 6s of recording) Expected result: CPU stays ~constant during recording, spikes only during speech Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…apitalization Silence detection removed — was cutting real speech during quiet moments. Replaced with smarter content-based backoff that doesn't risk audio loss. Performance: - Window 6s → 4s: Whisper processes 33% less audio per tick (~33% faster) - Interval 1.5s → 2s: more breathing room between calls - Context 20 → 15 words: lighter Whisper prompt - Stale backoff: if Whisper returns no new words 2× in a row, next tick at 4s (handles pauses without cutting audio — Whisper still runs, just less often) Capitalization fix: - Whisper always capitalizes the first word of each 4s snippet - When appending mid-sentence, normalize first word to lowercase - Exception: after . ! ? (sentence end), keep capitalization Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
The DOM cap (120 words) was causing words to disappear from the visible panel, which the user experienced as "transcribed audio being cut". The cap was introduced as a CPU optimization but is not needed: - The main CPU gains come from 4s window + 2s interval + stale backoff - RAF dedup (tpScrollPending) already prevents multiple reflows per frame - scrollHeight reads are incremental in Chromium — DOM size has minimal impact - Full scroll history remains accessible (user can scroll up to see all speech) Keep: tpScrollPending RAF deduplication (still a good batching optimization) Remove: TP_MAX_DOM_WORDS cap, tpWordCount counter, tpTrimDom() helper Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
… hint Previously, the snapshot loop only ran when live preview UI was enabled. This meant users with preview off (or who closed it) got no hint built up, forcing a full Whisper re-transcription on stop. Now the loop ALWAYS runs, regardless of preview visibility: - showUI=true: snapshot loop + HUD panel updates (preview visible) - showUI=false: snapshot loop only, no HUD updates (preview hidden/off) Key behavior changes: - closeLivePreview(): keeps timer running, sets liveTranscriptionShowUI=false - restoreLivePreview(): sets showUI=true, shows full accumulatedText immediately without word-by-word animation, then resumes animating new words - startLiveTranscription(pipeline, gen, showUI): new optional showUI param Result: stopAndProcessWithHint() can skip Whisper for ALL users (not just those with live preview on), making final transcription much faster. Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…fallback trimBuffer(30s) was called after each snapshot to keep _recChunks small. But when detectGarbage(accumulatedText) triggered the Whisper fallback, recorder.stop() returned only the last 30s of audio — losing everything before. The O(n) reduce over _recChunks was the justification, but it's negligible: 5min recording = ~1172 chunks, each iteration is a single addition (~0.001ms). The real CPU wins (4s window, 2s interval, stale backoff) are unaffected. Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…mulatedText onRecordingStop() was calling stopLiveTranscription() first, which resets accumulatedText = ''. Then the hint read at line 1113 was always empty, forcing full Whisper transcription every time. Fix: capture livePreviewHint = this.accumulatedText before the clear. This is the core of the whole optimization — without it, hint is always 0 chars.
…nal words The last snapshot fires at most 2s before stop, leaving the final 0-2s of speech out of accumulatedText. enrichHintWithTail() fixes this: 1. Extract last 3s from the full recording buffer (already in memory) 2. Transcribe with last 15 words of hint as context (~0.3-0.5s Whisper call) 3. Overlap-merge result into hint (same 6-word overlap detection logic) 4. Send enriched hint to LLM instead of incomplete accumulated text No garbage fallback triggered, no full Whisper, just a tiny tail pass. Logs: 'Hint enriched: 485 → 523 chars (420ms)'
…lback Two issues fixed: 1. In-flight snapshot ignored on stop - When user stops, a Whisper call may be mid-flight (liveTranscriptionRunning=true) - Previously: hint was read while snapshot was still processing → missed words - Fix: poll liveTranscriptionRunning every 30ms before reading accumulatedText - Cost: 0-2s wait (only if Whisper is actively running at stop time) 2. enrichHintWithTail: no-overlap case discarded tail content - When tail has no overlapping words with hint end, new words were thrown away - Fix: append normalized tail (same behavior as mergeTranscriptions fallback) - detectGarbage already filters hallucinations before we get here Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…uble-tap cancel Root cause of hallucinations/repetitions: enrichHintWithTail ran a separate Whisper call on the last 3s. When no overlap was found between hint and tail, it appended the full tail — causing duplicates when Whisper transcribed the same audio slightly differently (non-deterministic transcription). Changes: - Remove enrichHintWithTail from pipeline entirely (clean approach) - Add explicit final snapshot in onRecordingStop (same merge path as live loop) This captures words spoken after the last tick without any new Whisper logic - Delay first snapshot 800ms → 1500ms to avoid hallucinations on very short audio - Quick double-tap cancel: new recording within 600ms of previous stop = cancel (common accidental press when user immediately re-presses after stopping) - Track lastRecordingStopTime for cancel detection Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…hint path
- overrideBadge: cursor: default + pointer-events: none
Only the X button (overrideClear) should be interactive, not the label
- onTranscriptionComplete: start enhancing effect immediately
Previously only triggered via onStage('enhancing') when LLM started
With hint path (no Whisper), transcription is instant → blur/shuffle
should start the moment the text is ready, not wait for LLM to begin
Safe for Whisper path too: startEnhancingEffect() is idempotent
Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Previously: sound and state change happened AFTER in-flight snapshot wait
and final snapshot (~2-3s delay). User saw no feedback while processing.
Now: as soon as stop is detected (duration check passes):
1. playCue(stopCue) — sound immediately
2. setState('transcribing') — HUD updates instantly
3. startEnhancingEffect() — morphing/blur starts on live preview text
Background work (in-flight wait, final snapshot, pipeline) continues
asynchronously while the user already sees the visual transition.
Added to COMMON_HALLUCINATIONS (full-text rejection) and INLINE_HALLUCINATION_RE (stripped when mixed with valid content in snapshots). Also confirmed: typo fixing already covered in LLM_SYSTEM_PROMPT line 57: 'Fix speech recognition errors and typos (e.g., their vs there)'
Previously whisperMs=0 on hint path because Whisper was 'skipped'. The final snapshot Whisper call (in onRecordingStop) was unmeasured. Now: measure finalSnapshotMs in manager, pass as 2nd arg to stopAndProcessWithHint → forwarded to processFromTranscription as whisperTimeMs. Pipeline Timing log now shows actual last-snapshot duration instead of 0.
… recorder.stop() time
Contributor
CI Summary
|
Contributor
✅MegaLinter analysis: Success
See detailed reports in MegaLinter artifacts
|
Contributor
Author
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Replaces the static live preview with a true streaming experience: VAD-triggered Whisper snapshots accumulate into a running transcript, and that transcript is reused as a hint on stop — skipping the full Whisper transcription and going straight to LLM enhancement.
How it works
sequenceDiagram participant U as User participant A as Analyser (30fps) participant V as VAD Loop (150ms) participant M as ShortcutManager participant W as Whisper participant P as Pipeline participant L as LLM U->>M: Start recording M->>A: Start audio analyser M->>V: Start VAD loop loop VAD polling (every 150ms, zero IPC) V->>A: isSpeechActive(700ms)? A-->>V: level > threshold? note over V: speech→silence = pause detected note over V: after 1 pause OR 6s safety → trigger end V->>M: Trigger snapshot M->>W: snapshot(timeSinceLastSnapshot + 3s, ...lastSnapshotText) W-->>M: transcribed text M->>M: word cap + merge with overlap detection M-->>U: update live preview (adaptive animation speed) U->>M: Stop recording (shortcut/click) M-->>U: sound + transcribing state + blur effect (immediate) M->>W: final snapshot (if speech active in last 3s) W-->>M: last words → merge into livePreviewHint M->>P: stopAndProcessWithHint(hint) alt hint valid (not empty/garbage) P->>L: enhance(hint) L-->>U: paste else hint empty/garbage P->>W: transcribe(full audio) W-->>P: transcript P->>L: enhance(transcript) L-->>U: paste endKey design decisions
...lastSnapshotTextas context...signals continuationWHISPER_PROMPT"Português, English"vs meta-instructions that get echoedBenchmark
Benchmark text (~35s read aloud):
whisperMs)whisperMs)Test plan
npm run typecheckpassesnpm run lintpassesnpm testpasses (592 tests)Code standards
🤖 Generated with Claude Code