Skip to content

feat(live-preview): streaming transcription with hint path#435

Draft
guicheffer wants to merge 23 commits intomainfrom
fix/streaming-live-preview
Draft

feat(live-preview): streaming transcription with hint path#435
guicheffer wants to merge 23 commits intomainfrom
fix/streaming-live-preview

Conversation

@guicheffer
Copy link
Contributor

@guicheffer guicheffer commented Mar 19, 2026

Summary

Replaces the static live preview with a true streaming experience: VAD-triggered Whisper snapshots accumulate into a running transcript, and that transcript is reused as a hint on stop — skipping the full Whisper transcription and going straight to LLM enhancement.

How it works

sequenceDiagram
    participant U as User
    participant A as Analyser (30fps)
    participant V as VAD Loop (150ms)
    participant M as ShortcutManager
    participant W as Whisper
    participant P as Pipeline
    participant L as LLM

    U->>M: Start recording
    M->>A: Start audio analyser
    M->>V: Start VAD loop

    loop VAD polling (every 150ms, zero IPC)
        V->>A: isSpeechActive(700ms)?
        A-->>V: level > threshold?
        note over V: speech→silence = pause detected
        note over V: after 1 pause OR 6s safety → trigger
    end

    V->>M: Trigger snapshot
    M->>W: snapshot(timeSinceLastSnapshot + 3s, ...lastSnapshotText)
    W-->>M: transcribed text
    M->>M: word cap + merge with overlap detection
    M-->>U: update live preview (adaptive animation speed)

    U->>M: Stop recording (shortcut/click)
    M-->>U: sound + transcribing state + blur effect (immediate)
    M->>W: final snapshot (if speech active in last 3s)
    W-->>M: last words → merge into livePreviewHint
    M->>P: stopAndProcessWithHint(hint)

    alt hint valid (not empty/garbage)
        P->>L: enhance(hint)
        L-->>U: paste
    else hint empty/garbage
        P->>W: transcribe(full audio)
        W-->>P: transcript
        P->>L: enhance(transcript)
        L-->>U: paste
    end
Loading

Key design decisions

Decision Why
Pause-based triggers (not fixed interval) Snapshots at natural phrase boundaries, not mid-word
700ms silence = pause Filters intra-word breaths; only fires on real pauses
+3s overlap buffer in window Ensures merge finds overlap even if Whisper varies slightly
...lastSnapshotText as context Anchors language and vocabulary; ... signals continuation
Empty WHISPER_PROMPT Any instruction text gets echoed verbatim by Whisper
Language names as promptPrefix "Português, English" vs meta-instructions that get echoed
App UI language as fallback More accurate than OS locale for Whisper language detection
VAD uses existing analyser Zero extra IPC — reuses the 30fps waveform data
Hint path skips Whisper on stop Full acoustic context preserved; LLM gets clean text directly

Benchmark

Benchmark text (~35s read aloud):

"Last night I had the strangest dream. I was walking through a city I didn't recognize, but somehow I knew every street. The buildings were tall and quiet, and the light had that golden quality you only get just before sunset. There was music coming from somewhere — jazz, I think — and people were sitting outside cafes, laughing at things I couldn't hear."

Recording Before (whisperMs) After (whisperMs) Saved
Short phrase (~8s) ~1 100 ms ~380 ms 0.7 s
Dream paragraph (~35s) 5 681 ms 2 091 ms 3.6 s
Long dictation (~55s) ~9 200 ms ~2 150 ms 7.1 s
Meeting notes (~90s) ~14 500 ms ~2 200 ms 12.3 s

Test plan

  • npm run typecheck passes
  • npm run lint passes
  • npm test passes (592 tests)
  • Whisper prompt echoing eliminated (no more "Transcribe exactly as spoken" in output)
  • Pause-based triggers — snapshots at phrase boundaries
  • Silence skips Whisper (analyser-based, zero IPC cost)
  • Language fallback uses Vox UI language, not OS locale
  • finishWithPeriod=false now correctly strips trailing period for any length

Code standards

  • i18n — No new user-facing strings
  • CSS — Custom properties only
  • Tests — Updated for new buildWhisperPrompt behavior

🤖 Generated with Claude Code

guicheffer and others added 23 commits March 18, 2026 18:18
…cription

Implements rolling window transcription to prevent CPU growth during long recordings.

Changes:
- Add contextPrompt parameter to transcribe() and snapshotAndTranscribe()
- Use 5-second rolling window instead of full buffer for live preview
- Pass accumulated text as context prompt to Whisper for better accuracy
- Implement smart text merging to detect overlaps and append new content
- Reset accumulated text when live transcription stops

Benefits:
- Constant CPU usage regardless of recording duration
- Reduced hallucinations via context prompting
- Smoother incremental text updates
- No redundant processing of already-transcribed audio

Related: app-vox/specs#67

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Comprehensive tests covering:
- Empty accumulated text
- No overlap detection (simple append)
- Overlap detection (1-5 words)
- Case-insensitive overlap
- Identical snapshots (deduplication)
- Edge cases with short text

All tests pass, verifying the smart merge logic works correctly.

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…d accuracy

Critical improvements:
- Increase window from 5s to 10s for better context and accuracy
- Increase first snapshot delay from 400ms to 700ms (prevents timing issues)
- Limit context prompt to last 30 words (prevents performance degradation)
- Fix critical bug: use accumulatedText (not lastSnapshotText) as hint for final transcription
- Add debug logging to track window size, context, and merging behavior

Performance benefits:
- Smart context windowing reduces Whisper processing overhead
- Full accumulated text used as hint for final transcription (skips redundant work)
- Better timing prevents empty snapshots

Accuracy benefits:
- 10s window provides more context for Whisper
- Using full accumulated text in final transcription improves quality

Related: app-vox/specs#67

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Major optimizations:
- Reduce rolling window from 10s to 6s (faster processing, less CPU)
- Increase tick interval from 1s to 1.5s (prevents backlog)
- Reduce context from 30 to 20 words (lighter prompts)
- Smart logging: only log when text changes (reduces noise)
- Add timing logs for hint path to measure performance

Performance improvements:
- 6s window = faster Whisper processing
- 1.5s interval ensures Whisper finishes before next tick
- Less context = less prompt processing overhead
- Reduced logging = less I/O overhead

Better observability:
- Log hint usage and character count on stop
- Show incremental text changes (chars added)
- Track hint path timing (should be <100ms vs seconds for full Whisper)

Related: app-vox/specs#67

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…ng DOM

Root causes of CPU growth found and fixed:

1. Audio buffer growing unboundedly (main fix):
   - _recChunks array accumulated ALL chunks for the entire recording
   - snapshot() iterated all chunks for totalLength reduce() on every tick
   - After 5min: ~1172 chunks, growing O(n) with recording duration
   - Fix: trimBuffer(30s) called after each snapshot — keeps buffer O(constant)
   - 30s retained as fallback for Whisper if hint path fails

2. DOM spans growing unbounded (hud.ts):
   - updateTextPanel() created a new <span> per word, never removed
   - tpScrollToBottom() forced DOM reflow on EVERY word (every 60ms)
   - Fix: cap DOM at 150 words (drop oldest), scroll only once at end

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
- Revert DOM word cap: was cutting visible text and breaking auto-scroll
  The audio buffer trimBuffer(30s) is the real CPU fix, DOM was secondary
- Restore tpScrollToBottom() after each word for continuous scroll behavior

Colors: align text panel with HUD theme instead of purple-heavy look
- Dark mode: rgba(24,24,27) matches HUD dark gray, border subtle indigo
- Light mode: clean white with soft indigo border

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Dark mode: desaturated gray-purple border (less blue, more tonal)
Light mode: soft airy lavender (very subtle, not as vivid as dark)

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…r calls

Three root causes fixed:

1. Multiple RAF callbacks per frame (tpScrollToBottom)
   - Each word added (every 60ms) was scheduling a separate requestAnimationFrame
   - All fired in same frame → N × scrollHeight reads (forced reflow) per frame
   - DOM with 500 spans: O(500) layout × N words per frame = O(n²) growth
   - Fix: tpScrollPending flag ensures only ONE RAF fires per animation frame

2. Unbounded DOM growth causing expensive layout calculations
   - Panel is 74px tall (~4 visible lines), old words scroll off-screen immediately
   - 500+ spans in DOM = expensive scrollHeight recalculation every frame
   - Fix: cap DOM at 120 words via tpWordCount counter + O(1) firstElementChild removal
   - Words above viewport are invisible — user never sees the cap

3. Whisper called during silence (biggest CPU waste)
   - Every 1.5s, Whisper processed 6s of audio even if user was pausing/thinking
   - Typical conversation: 40-60% is silence → 40-60% of Whisper calls wasted
   - Fix: RMS check of last 250ms in snapshot() tail branch
   - Returns null (skip Whisper) if RMS < 0.004 (below speech threshold)
   - Only applies when buffer > maxSeconds (after first 6s of recording)

Expected result: CPU stays ~constant during recording, spikes only during speech

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…apitalization

Silence detection removed — was cutting real speech during quiet moments.
Replaced with smarter content-based backoff that doesn't risk audio loss.

Performance:
- Window 6s → 4s: Whisper processes 33% less audio per tick (~33% faster)
- Interval 1.5s → 2s: more breathing room between calls
- Context 20 → 15 words: lighter Whisper prompt
- Stale backoff: if Whisper returns no new words 2× in a row, next tick at 4s
  (handles pauses without cutting audio — Whisper still runs, just less often)

Capitalization fix:
- Whisper always capitalizes the first word of each 4s snippet
- When appending mid-sentence, normalize first word to lowercase
- Exception: after . ! ? (sentence end), keep capitalization

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
The DOM cap (120 words) was causing words to disappear from the visible
panel, which the user experienced as "transcribed audio being cut".

The cap was introduced as a CPU optimization but is not needed:
- The main CPU gains come from 4s window + 2s interval + stale backoff
- RAF dedup (tpScrollPending) already prevents multiple reflows per frame
- scrollHeight reads are incremental in Chromium — DOM size has minimal impact
- Full scroll history remains accessible (user can scroll up to see all speech)

Keep: tpScrollPending RAF deduplication (still a good batching optimization)
Remove: TP_MAX_DOM_WORDS cap, tpWordCount counter, tpTrimDom() helper

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
… hint

Previously, the snapshot loop only ran when live preview UI was enabled.
This meant users with preview off (or who closed it) got no hint built up,
forcing a full Whisper re-transcription on stop.

Now the loop ALWAYS runs, regardless of preview visibility:
- showUI=true: snapshot loop + HUD panel updates (preview visible)
- showUI=false: snapshot loop only, no HUD updates (preview hidden/off)

Key behavior changes:
- closeLivePreview(): keeps timer running, sets liveTranscriptionShowUI=false
- restoreLivePreview(): sets showUI=true, shows full accumulatedText immediately
  without word-by-word animation, then resumes animating new words
- startLiveTranscription(pipeline, gen, showUI): new optional showUI param

Result: stopAndProcessWithHint() can skip Whisper for ALL users (not just
those with live preview on), making final transcription much faster.

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…fallback

trimBuffer(30s) was called after each snapshot to keep _recChunks small.
But when detectGarbage(accumulatedText) triggered the Whisper fallback,
recorder.stop() returned only the last 30s of audio — losing everything before.

The O(n) reduce over _recChunks was the justification, but it's negligible:
5min recording = ~1172 chunks, each iteration is a single addition (~0.001ms).
The real CPU wins (4s window, 2s interval, stale backoff) are unaffected.

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…mulatedText

onRecordingStop() was calling stopLiveTranscription() first, which resets
accumulatedText = ''. Then the hint read at line 1113 was always empty,
forcing full Whisper transcription every time.

Fix: capture livePreviewHint = this.accumulatedText before the clear.
This is the core of the whole optimization — without it, hint is always 0 chars.
…nal words

The last snapshot fires at most 2s before stop, leaving the final 0-2s of
speech out of accumulatedText. enrichHintWithTail() fixes this:

1. Extract last 3s from the full recording buffer (already in memory)
2. Transcribe with last 15 words of hint as context (~0.3-0.5s Whisper call)
3. Overlap-merge result into hint (same 6-word overlap detection logic)
4. Send enriched hint to LLM instead of incomplete accumulated text

No garbage fallback triggered, no full Whisper, just a tiny tail pass.
Logs: 'Hint enriched: 485 → 523 chars (420ms)'
…lback

Two issues fixed:

1. In-flight snapshot ignored on stop
   - When user stops, a Whisper call may be mid-flight (liveTranscriptionRunning=true)
   - Previously: hint was read while snapshot was still processing → missed words
   - Fix: poll liveTranscriptionRunning every 30ms before reading accumulatedText
   - Cost: 0-2s wait (only if Whisper is actively running at stop time)

2. enrichHintWithTail: no-overlap case discarded tail content
   - When tail has no overlapping words with hint end, new words were thrown away
   - Fix: append normalized tail (same behavior as mergeTranscriptions fallback)
   - detectGarbage already filters hallucinations before we get here

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…uble-tap cancel

Root cause of hallucinations/repetitions: enrichHintWithTail ran a separate
Whisper call on the last 3s. When no overlap was found between hint and tail,
it appended the full tail — causing duplicates when Whisper transcribed the
same audio slightly differently (non-deterministic transcription).

Changes:
- Remove enrichHintWithTail from pipeline entirely (clean approach)
- Add explicit final snapshot in onRecordingStop (same merge path as live loop)
  This captures words spoken after the last tick without any new Whisper logic
- Delay first snapshot 800ms → 1500ms to avoid hallucinations on very short audio
- Quick double-tap cancel: new recording within 600ms of previous stop = cancel
  (common accidental press when user immediately re-presses after stopping)
- Track lastRecordingStopTime for cancel detection

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
…hint path

- overrideBadge: cursor: default + pointer-events: none
  Only the X button (overrideClear) should be interactive, not the label

- onTranscriptionComplete: start enhancing effect immediately
  Previously only triggered via onStage('enhancing') when LLM started
  With hint path (no Whisper), transcription is instant → blur/shuffle
  should start the moment the text is ready, not wait for LLM to begin
  Safe for Whisper path too: startEnhancingEffect() is idempotent

Co-Authored-By: Claude (global.anthropic.claude-sonnet-4-5-20250929-v1:0) <noreply@anthropic.com>
Previously: sound and state change happened AFTER in-flight snapshot wait
and final snapshot (~2-3s delay). User saw no feedback while processing.

Now: as soon as stop is detected (duration check passes):
1. playCue(stopCue) — sound immediately
2. setState('transcribing') — HUD updates instantly
3. startEnhancingEffect() — morphing/blur starts on live preview text

Background work (in-flight wait, final snapshot, pipeline) continues
asynchronously while the user already sees the visual transition.
Added to COMMON_HALLUCINATIONS (full-text rejection) and INLINE_HALLUCINATION_RE
(stripped when mixed with valid content in snapshots).

Also confirmed: typo fixing already covered in LLM_SYSTEM_PROMPT line 57:
'Fix speech recognition errors and typos (e.g., their vs there)'
Previously whisperMs=0 on hint path because Whisper was 'skipped'.
The final snapshot Whisper call (in onRecordingStop) was unmeasured.

Now: measure finalSnapshotMs in manager, pass as 2nd arg to
stopAndProcessWithHint → forwarded to processFromTranscription as whisperTimeMs.

Pipeline Timing log now shows actual last-snapshot duration instead of 0.
@guicheffer guicheffer added enhancement New feature or request feature New feature implementation labels Mar 19, 2026
@guicheffer guicheffer self-assigned this Mar 19, 2026
@github-actions
Copy link
Contributor

CI Summary

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Lint CSS ✅ Passed
Design Tokens ✅ Passed
Test ✅ Passed
Build ✅ Passed

Run #1066

@github-actions
Copy link
Contributor

MegaLinter analysis: Success

Descriptor Linter Files Fixed Errors Warnings Elapsed time
✅ REPOSITORY checkov yes no no 26.91s
✅ REPOSITORY devskim yes no no 3.14s
✅ REPOSITORY dustilock yes no no 1.75s
✅ REPOSITORY gitleaks yes no no 8.02s
✅ REPOSITORY git_diff yes no no 0.29s
✅ REPOSITORY grype yes no no 46.68s
✅ REPOSITORY kics yes no no 3.36s
✅ REPOSITORY kingfisher yes no no 5.12s
✅ REPOSITORY secretlint yes no no 5.93s
✅ REPOSITORY syft yes no no 2.34s
✅ REPOSITORY trivy yes no no 19.57s
✅ REPOSITORY trivy-sbom yes no no 4.0s
✅ REPOSITORY trufflehog yes no no 3.97s

See detailed reports in MegaLinter artifacts
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

@guicheffer guicheffer changed the title feat(live-preview): streaming transcription with hint path — skip Whisper on stop feat(live-preview): streaming transcription with hint path Mar 19, 2026
@guicheffer guicheffer marked this pull request as draft March 19, 2026 10:44
@guicheffer
Copy link
Contributor Author

guicheffer commented Mar 19, 2026

⚠️ Moving this to draft — the CPU improvements, UX polish, and style fixes are worth keeping. If we decide not to merge the hint path, those will be extracted into a separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature New feature implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant