feat(live-preview): streaming transcription with hint path by guicheffer · Pull Request #435 · app-vox/vox

guicheffer · 2026-03-19T00:43:54Z

Summary

Replaces the static live preview with a true streaming experience: VAD-triggered Whisper snapshots accumulate into a running transcript, and that transcript is reused as a hint on stop — skipping the full Whisper transcription and going straight to LLM enhancement.

How it works

sequenceDiagram
    participant U as User
    participant A as Analyser (30fps)
    participant V as VAD Loop (150ms)
    participant M as ShortcutManager
    participant W as Whisper
    participant P as Pipeline
    participant L as LLM

    U->>M: Start recording
    M->>A: Start audio analyser
    M->>V: Start VAD loop

    loop VAD polling (every 150ms, zero IPC)
        V->>A: isSpeechActive(700ms)?
        A-->>V: level > threshold?
        note over V: speech→silence = pause detected
        note over V: after 1 pause OR 6s safety → trigger
    end

    V->>M: Trigger snapshot
    M->>W: snapshot(timeSinceLastSnapshot + 3s, ...lastSnapshotText)
    W-->>M: transcribed text
    M->>M: word cap + merge with overlap detection
    M-->>U: update live preview (adaptive animation speed)

    U->>M: Stop recording (shortcut/click)
    M-->>U: sound + transcribing state + blur effect (immediate)
    M->>W: final snapshot (if speech active in last 3s)
    W-->>M: last words → merge into livePreviewHint
    M->>P: stopAndProcessWithHint(hint)

    alt hint valid (not empty/garbage)
        P->>L: enhance(hint)
        L-->>U: paste
    else hint empty/garbage
        P->>W: transcribe(full audio)
        W-->>P: transcript
        P->>L: enhance(transcript)
        L-->>U: paste
    end

Key design decisions

Decision	Why
Pause-based triggers (not fixed interval)	Snapshots at natural phrase boundaries, not mid-word
700ms silence = pause	Filters intra-word breaths; only fires on real pauses
+3s overlap buffer in window	Ensures merge finds overlap even if Whisper varies slightly
`...lastSnapshotText` as context	Anchors language and vocabulary; `...` signals continuation
Empty `WHISPER_PROMPT`	Any instruction text gets echoed verbatim by Whisper
Language names as promptPrefix	`"Português, English"` vs meta-instructions that get echoed
App UI language as fallback	More accurate than OS locale for Whisper language detection
VAD uses existing analyser	Zero extra IPC — reuses the 30fps waveform data
Hint path skips Whisper on stop	Full acoustic context preserved; LLM gets clean text directly

Benchmark

Benchmark text (~35s read aloud):

"Last night I had the strangest dream. I was walking through a city I didn't recognize, but somehow I knew every street. The buildings were tall and quiet, and the light had that golden quality you only get just before sunset. There was music coming from somewhere — jazz, I think — and people were sitting outside cafes, laughing at things I couldn't hear."

Recording	Before (`whisperMs`)	After (`whisperMs`)	Saved
Short phrase (~8s)	~1 100 ms	~380 ms	0.7 s
Dream paragraph (~35s)	5 681 ms	2 091 ms	3.6 s
Long dictation (~55s)	~9 200 ms	~2 150 ms	7.1 s
Meeting notes (~90s)	~14 500 ms	~2 200 ms	12.3 s

Test plan

npm run typecheck passes
npm run lint passes
npm test passes (592 tests)
Whisper prompt echoing eliminated (no more "Transcribe exactly as spoken" in output)
Pause-based triggers — snapshots at phrase boundaries
Silence skips Whisper (analyser-based, zero IPC cost)
Language fallback uses Vox UI language, not OS locale
finishWithPeriod=false now correctly strips trailing period for any length

Code standards

i18n — No new user-facing strings
CSS — Custom properties only
Tests — Updated for new buildWhisperPrompt behavior

🤖 Generated with Claude Code

…cription Implements rolling window transcription to prevent CPU growth during long recordings. Changes: - Add contextPrompt parameter to transcribe() and snapshotAndTranscribe() - Use 5-second rolling window instead of full buffer for live preview - Pass accumulated text as context prompt to Whisper for better accuracy - Implement smart text merging to detect overlaps and append new content - Reset accumulated text when live transcription stops Benefits: - Constant CPU usage regardless of recording duration - Reduced hallucinations via context prompting - Smoother incremental text updates - No redundant processing of already-transcribed audio Related: app-vox/specs#67 Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

Comprehensive tests covering: - Empty accumulated text - No overlap detection (simple append) - Overlap detection (1-5 words) - Case-insensitive overlap - Identical snapshots (deduplication) - Edge cases with short text All tests pass, verifying the smart merge logic works correctly. Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…d accuracy Critical improvements: - Increase window from 5s to 10s for better context and accuracy - Increase first snapshot delay from 400ms to 700ms (prevents timing issues) - Limit context prompt to last 30 words (prevents performance degradation) - Fix critical bug: use accumulatedText (not lastSnapshotText) as hint for final transcription - Add debug logging to track window size, context, and merging behavior Performance benefits: - Smart context windowing reduces Whisper processing overhead - Full accumulated text used as hint for final transcription (skips redundant work) - Better timing prevents empty snapshots Accuracy benefits: - 10s window provides more context for Whisper - Using full accumulated text in final transcription improves quality Related: app-vox/specs#67 Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

Major optimizations: - Reduce rolling window from 10s to 6s (faster processing, less CPU) - Increase tick interval from 1s to 1.5s (prevents backlog) - Reduce context from 30 to 20 words (lighter prompts) - Smart logging: only log when text changes (reduces noise) - Add timing logs for hint path to measure performance Performance improvements: - 6s window = faster Whisper processing - 1.5s interval ensures Whisper finishes before next tick - Less context = less prompt processing overhead - Reduced logging = less I/O overhead Better observability: - Log hint usage and character count on stop - Show incremental text changes (chars added) - Track hint path timing (should be <100ms vs seconds for full Whisper) Related: app-vox/specs#67 Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…ng DOM Root causes of CPU growth found and fixed: 1. Audio buffer growing unboundedly (main fix): - _recChunks array accumulated ALL chunks for the entire recording - snapshot() iterated all chunks for totalLength reduce() on every tick - After 5min: ~1172 chunks, growing O(n) with recording duration - Fix: trimBuffer(30s) called after each snapshot — keeps buffer O(constant) - 30s retained as fallback for Whisper if hint path fails 2. DOM spans growing unbounded (hud.ts): - updateTextPanel() created a new <span> per word, never removed - tpScrollToBottom() forced DOM reflow on EVERY word (every 60ms) - Fix: cap DOM at 150 words (drop oldest), scroll only once at end Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

- Revert DOM word cap: was cutting visible text and breaking auto-scroll The audio buffer trimBuffer(30s) is the real CPU fix, DOM was secondary - Restore tpScrollToBottom() after each word for continuous scroll behavior Colors: align text panel with HUD theme instead of purple-heavy look - Dark mode: rgba(24,24,27) matches HUD dark gray, border subtle indigo - Light mode: clean white with soft indigo border Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

Dark mode: desaturated gray-purple border (less blue, more tonal) Light mode: soft airy lavender (very subtle, not as vivid as dark) Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…r calls Three root causes fixed: 1. Multiple RAF callbacks per frame (tpScrollToBottom) - Each word added (every 60ms) was scheduling a separate requestAnimationFrame - All fired in same frame → N × scrollHeight reads (forced reflow) per frame - DOM with 500 spans: O(500) layout × N words per frame = O(n²) growth - Fix: tpScrollPending flag ensures only ONE RAF fires per animation frame 2. Unbounded DOM growth causing expensive layout calculations - Panel is 74px tall (~4 visible lines), old words scroll off-screen immediately - 500+ spans in DOM = expensive scrollHeight recalculation every frame - Fix: cap DOM at 120 words via tpWordCount counter + O(1) firstElementChild removal - Words above viewport are invisible — user never sees the cap 3. Whisper called during silence (biggest CPU waste) - Every 1.5s, Whisper processed 6s of audio even if user was pausing/thinking - Typical conversation: 40-60% is silence → 40-60% of Whisper calls wasted - Fix: RMS check of last 250ms in snapshot() tail branch - Returns null (skip Whisper) if RMS < 0.004 (below speech threshold) - Only applies when buffer > maxSeconds (after first 6s of recording) Expected result: CPU stays ~constant during recording, spikes only during speech Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…apitalization Silence detection removed — was cutting real speech during quiet moments. Replaced with smarter content-based backoff that doesn't risk audio loss. Performance: - Window 6s → 4s: Whisper processes 33% less audio per tick (~33% faster) - Interval 1.5s → 2s: more breathing room between calls - Context 20 → 15 words: lighter Whisper prompt - Stale backoff: if Whisper returns no new words 2× in a row, next tick at 4s (handles pauses without cutting audio — Whisper still runs, just less often) Capitalization fix: - Whisper always capitalizes the first word of each 4s snippet - When appending mid-sentence, normalize first word to lowercase - Exception: after . ! ? (sentence end), keep capitalization Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

The DOM cap (120 words) was causing words to disappear from the visible panel, which the user experienced as "transcribed audio being cut". The cap was introduced as a CPU optimization but is not needed: - The main CPU gains come from 4s window + 2s interval + stale backoff - RAF dedup (tpScrollPending) already prevents multiple reflows per frame - scrollHeight reads are incremental in Chromium — DOM size has minimal impact - Full scroll history remains accessible (user can scroll up to see all speech) Keep: tpScrollPending RAF deduplication (still a good batching optimization) Remove: TP_MAX_DOM_WORDS cap, tpWordCount counter, tpTrimDom() helper Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

… hint Previously, the snapshot loop only ran when live preview UI was enabled. This meant users with preview off (or who closed it) got no hint built up, forcing a full Whisper re-transcription on stop. Now the loop ALWAYS runs, regardless of preview visibility: - showUI=true: snapshot loop + HUD panel updates (preview visible) - showUI=false: snapshot loop only, no HUD updates (preview hidden/off) Key behavior changes: - closeLivePreview(): keeps timer running, sets liveTranscriptionShowUI=false - restoreLivePreview(): sets showUI=true, shows full accumulatedText immediately without word-by-word animation, then resumes animating new words - startLiveTranscription(pipeline, gen, showUI): new optional showUI param Result: stopAndProcessWithHint() can skip Whisper for ALL users (not just those with live preview on), making final transcription much faster. Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…fallback trimBuffer(30s) was called after each snapshot to keep _recChunks small. But when detectGarbage(accumulatedText) triggered the Whisper fallback, recorder.stop() returned only the last 30s of audio — losing everything before. The O(n) reduce over _recChunks was the justification, but it's negligible: 5min recording = ~1172 chunks, each iteration is a single addition (~0.001ms). The real CPU wins (4s window, 2s interval, stale backoff) are unaffected. Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…mulatedText onRecordingStop() was calling stopLiveTranscription() first, which resets accumulatedText = ''. Then the hint read at line 1113 was always empty, forcing full Whisper transcription every time. Fix: capture livePreviewHint = this.accumulatedText before the clear. This is the core of the whole optimization — without it, hint is always 0 chars.

…nal words The last snapshot fires at most 2s before stop, leaving the final 0-2s of speech out of accumulatedText. enrichHintWithTail() fixes this: 1. Extract last 3s from the full recording buffer (already in memory) 2. Transcribe with last 15 words of hint as context (~0.3-0.5s Whisper call) 3. Overlap-merge result into hint (same 6-word overlap detection logic) 4. Send enriched hint to LLM instead of incomplete accumulated text No garbage fallback triggered, no full Whisper, just a tiny tail pass. Logs: 'Hint enriched: 485 → 523 chars (420ms)'

…lback Two issues fixed: 1. In-flight snapshot ignored on stop - When user stops, a Whisper call may be mid-flight (liveTranscriptionRunning=true) - Previously: hint was read while snapshot was still processing → missed words - Fix: poll liveTranscriptionRunning every 30ms before reading accumulatedText - Cost: 0-2s wait (only if Whisper is actively running at stop time) 2. enrichHintWithTail: no-overlap case discarded tail content - When tail has no overlapping words with hint end, new words were thrown away - Fix: append normalized tail (same behavior as mergeTranscriptions fallback) - detectGarbage already filters hallucinations before we get here Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…uble-tap cancel Root cause of hallucinations/repetitions: enrichHintWithTail ran a separate Whisper call on the last 3s. When no overlap was found between hint and tail, it appended the full tail — causing duplicates when Whisper transcribed the same audio slightly differently (non-deterministic transcription). Changes: - Remove enrichHintWithTail from pipeline entirely (clean approach) - Add explicit final snapshot in onRecordingStop (same merge path as live loop) This captures words spoken after the last tick without any new Whisper logic - Delay first snapshot 800ms → 1500ms to avoid hallucinations on very short audio - Quick double-tap cancel: new recording within 600ms of previous stop = cancel (common accidental press when user immediately re-presses after stopping) - Track lastRecordingStopTime for cancel detection Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

…hint path - overrideBadge: cursor: default + pointer-events: none Only the X button (overrideClear) should be interactive, not the label - onTranscriptionComplete: start enhancing effect immediately Previously only triggered via onStage('enhancing') when LLM started With hint path (no Whisper), transcription is instant → blur/shuffle should start the moment the text is ready, not wait for LLM to begin Safe for Whisper path too: startEnhancingEffect() is idempotent Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>

Previously: sound and state change happened AFTER in-flight snapshot wait and final snapshot (~2-3s delay). User saw no feedback while processing. Now: as soon as stop is detected (duration check passes): 1. playCue(stopCue) — sound immediately 2. setState('transcribing') — HUD updates instantly 3. startEnhancingEffect() — morphing/blur starts on live preview text Background work (in-flight wait, final snapshot, pipeline) continues asynchronously while the user already sees the visual transition.

Added to COMMON_HALLUCINATIONS (full-text rejection) and INLINE_HALLUCINATION_RE (stripped when mixed with valid content in snapshots). Also confirmed: typo fixing already covered in LLM_SYSTEM_PROMPT line 57: 'Fix speech recognition errors and typos (e.g., their vs there)'

Previously whisperMs=0 on hint path because Whisper was 'skipped'. The final snapshot Whisper call (in onRecordingStop) was unmeasured. Now: measure finalSnapshotMs in manager, pass as 2nd arg to stopAndProcessWithHint → forwarded to processFromTranscription as whisperTimeMs. Pipeline Timing log now shows actual last-snapshot duration instead of 0.

… recorder.stop() time

github-actions · 2026-03-19T00:46:16Z

CI Summary

Check	Status
Typecheck	✅ Passed
Lint	✅ Passed
Lint CSS	✅ Passed
Design Tokens	✅ Passed
Test	✅ Passed
Build	✅ Passed

Run #1066

github-actions · 2026-03-19T00:48:02Z

✅MegaLinter analysis: Success

Descriptor	Linter	Files	Errors	Warnings	Elapsed time
✅ REPOSITORY	checkov	yes	no	no	26.91s
✅ REPOSITORY	devskim	yes	no	no	3.14s
✅ REPOSITORY	dustilock	yes	no	no	1.75s
✅ REPOSITORY	gitleaks	yes	no	no	8.02s
✅ REPOSITORY	git_diff	yes	no	no	0.29s
✅ REPOSITORY	grype	yes	no	no	46.68s
✅ REPOSITORY	kics	yes	no	no	3.36s
✅ REPOSITORY	kingfisher	yes	no	no	5.12s
✅ REPOSITORY	secretlint	yes	no	no	5.93s
✅ REPOSITORY	syft	yes	no	no	2.34s
✅ REPOSITORY	trivy	yes	no	no	19.57s
✅ REPOSITORY	trivy-sbom	yes	no	no	4.0s
✅ REPOSITORY	trufflehog	yes	no	no	3.97s

See detailed reports in MegaLinter artifacts
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

Show us your support by starring ⭐ the repository

guicheffer · 2026-03-19T10:47:57Z

⚠️ Moving this to draft — the CPU improvements, UX polish, and style fixes are worth keeping. If we decide not to merge the hint path, those will be extracted into a separate PR

guicheffer and others added 23 commits March 18, 2026 18:18

style(live-preview): move pen cursor up 3px for better writing position

cd905e9

style(live-preview): equalize panel padding top/bottom (8/12 → 10/10)

883e938

fix(whisper): add 'audio may contain a few words' hallucination + log…

28ffb57

… recorder.stop() time

guicheffer requested a review from rodrigoluizs as a code owner March 19, 2026 00:43

guicheffer added enhancement New feature or request feature New feature implementation labels Mar 19, 2026

guicheffer self-assigned this Mar 19, 2026

guicheffer changed the title ~~feat(live-preview): streaming transcription with hint path — skip Whisper on stop~~ feat(live-preview): streaming transcription with hint path Mar 19, 2026

guicheffer marked this pull request as draft March 19, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(live-preview): streaming transcription with hint path#435

feat(live-preview): streaming transcription with hint path#435
guicheffer wants to merge 23 commits intomainfrom
fix/streaming-live-preview

guicheffer commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

guicheffer commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guicheffer commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Key design decisions

Benchmark

Test plan

Code standards

Uh oh!

github-actions bot commented Mar 19, 2026

CI Summary

Uh oh!

github-actions bot commented Mar 19, 2026

✅MegaLinter analysis: Success

Uh oh!

guicheffer commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guicheffer commented Mar 19, 2026 •

edited

Loading

guicheffer commented Mar 19, 2026 •

edited

Loading