Add scribe stream command for live microphone transcription#1
Draft
javiertoledo wants to merge 7 commits intomainfrom
Draft
Add scribe stream command for live microphone transcription#1javiertoledo wants to merge 7 commits intomainfrom
javiertoledo wants to merge 7 commits intomainfrom
Conversation
New command: scribe stream — captures microphone audio and transcribes in real-time using FluidAudio's SlidingWindowAsrManager (Parakeet). Features: - Live transcription from microphone with timestamps - Text and JSONL output formats - Save to file with --output - Ctrl+C to stop cleanly - Uses streaming ASR config (11s chunks, 1s hypothesis updates) Usage: scribe stream # listen and transcribe scribe stream --format jsonl # JSONL output scribe stream --output meeting.txt # save to file System audio capture (--source) will be added in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce chunk size from 11s to 3s for ~3-4s latency (was ~13s) - Lower confirmation threshold from 0.8 to 0.5 for faster output - Reduce right context from 2s to 0.5s - Fix speaker label: remove "Others" tag for mic input - Add text dedup to avoid repeating same hypothesis - Remove --mic flag (mic is default and only source for now) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pdates The 3s chunk config was too short for Parakeet — model needs ~10s context. Reverted to the library's .streaming preset (11s chunks, 1s hypothesis). Now shows two types of updates: - Volatile (hypothesis): shown as ephemeral line on stderr with \r overwrite Gives immediate ~1-2s feedback while speaking - Confirmed: printed as permanent line to stdout Stable, final text after sufficient context Also fixes: - Stream getting stuck on longer utterances (was breaking model state) - Text format shows live preview on stderr, final on stdout - JSONL emits both volatile and confirmed (with "confirmed" field) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace SlidingWindowAsrManager (batch TDT in sliding windows, ~11s latency) with StreamingAsrEngine protocol using Nemotron 560ms: - True cache-aware streaming: each 560ms chunk inherits full context - 2.12% WER (better than TDT v3's 2.5% on LibriSpeech) - Includes punctuation and capitalization - ~560ms to first text (was ~11s) - Partial transcript callback for live preview on stderr - Confirmed text printed to stdout Architecture: - Mic audio → appendAudio() → processBufferedAudio() → getPartialTranscript() - Partial callback fires on every chunk for live preview (\r overwrite on stderr) - Main loop polls at 20Hz, emits new confirmed text to stdout - Actor-based state management for thread safety (Swift 6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Default: Parakeet TDT v3 via SlidingWindow (25 languages, higher latency) - --engine nemotron: Nemotron 560ms (English-only, ~560ms latency, punctuation) Usage: scribe stream # multilingual (default) scribe stream --engine nemotron # English-only, low latency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Nemotron: retry with cache cleanup on failed model load (fixes partial download) - Both engines: show download progress messages (not just --verbose) - README: add streaming section with engine comparison and trade-offs - README: update performance table with streaming latencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Nemotron engine's partial callback returns the full accumulated transcript each time, which grows and revises. The previous code tried to diff via getPartialTranscript() polling, causing repeated/mixed output. Fix: Track printed length in StreamState actor. The partial callback fires after each 560ms chunk — we diff to find only the new portion and emit that. Live preview shows the tail of the transcript on stderr (ephemeral, overwritten). New confirmed text goes to stdout. Also simplified SlidingWindow engine to only emit to stdout on confirmed text (volatile goes to stderr preview only). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scribe streamcommand for live microphone transcriptionStatus
Draft — streaming not working reliably yet. Known issues:
What works
scribe streamstarts and captures microphone audioscribe stream --engine nemotrondownloads and loads the Nemotron modelArchitecture decisions
StreamingAsrEngineprotocol (true cache-aware streaming)SlidingWindowAsrManager(batch model in sliding windows)Test plan
--format jsonlproduces valid JSON per line--output file.txtsaves to file🤖 Generated with Claude Code