-
Notifications
You must be signed in to change notification settings - Fork 472
Open
Description
Motivation
PR #1230 adds streaming text input via WebSocket — text arrives incrementally, audio is generated per sentence. However, each sentence's audio is returned as a single binary frame only after full synthesis completes.
For voice agents and real-time applications, users need streaming audio output — chunked PCM frames sent progressively as the model generates, enabling playback to start before synthesis finishes.
Use Cases
- Voice assistants: User hears the first audio chunk within ~200ms of sentence completion, rather than waiting 1-5s for full sentence synthesis
- Contact centers: Reduces perceived latency for callers, where every 100ms of silence increases drop-off rates
- Live translation/dubbing: Audio must stay synchronized with video — batch-per-sentence adds unacceptable delay for long sentences
Design Decisions
- Format: Raw PCM (16-bit signed, 24kHz mono). No WAV — headers require total file size upfront, which is incompatible with streaming. This matches OpenAI Realtime API and [Qwen3TTS][Feat] Streaming output #1438's approach.
- Chunk size: Follows model codec — one chunk per Code2Wav decode window (25 frames by default, configurable via Make chunk_size and left_context_size configurable via YAML for async chunking #1423). No arbitrary byte/duration splitting needed.
- Scope: Small follow-up to [Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230. Wire the streaming generation path from [Qwen3TTS][Feat] Streaming output #1438/support qwen3 tts streaming output #1189 into the WebSocket handler, replacing the blocking
_generate_audio_bytes()call.
Protocol
stream_audio defaults to false for backward compatibility. When stream_audio: true, format must be pcm.
Implementation
- In
OmniStreamingSpeechHandler([Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230), replace_generate_audio_bytes()with the streaming generation path - Each time Code2Wav produces a chunk, send it as a binary WebSocket frame
audio.start/audio.doneframing stays the same — just multiple binary frames between them instead of one
Dependencies
- Blocked on [Qwen3TTS][Feat] Streaming output #1438 (REST streaming output) or support qwen3 tts streaming output #1189 (model-level streaming) merging first
- Builds on [Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230 (WebSocket endpoint)
- Chunk size configurability from Make chunk_size and left_context_size configurable via YAML for async chunking #1423
Related
- PR [Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230 — Streaming text input via WebSocket (foundation)
- PR support qwen3 tts streaming output #1189 — Token-level streaming in Talker (
AsyncDecodingPipeline) - PR [Qwen3TTS][Feat] Streaming output #1438 — REST streaming output (
stream=true+ PCM via SSE) - PR Make chunk_size and left_context_size configurable via YAML for async chunking #1423 — Configurable
chunk_size/left_context_sizefor Code2Wav
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels