Skip to content

[RFC] Streaming Audio Output for WebSocket TTS #1479

@lishunyang12

Description

@lishunyang12

Motivation

PR #1230 adds streaming text input via WebSocket — text arrives incrementally, audio is generated per sentence. However, each sentence's audio is returned as a single binary frame only after full synthesis completes.

For voice agents and real-time applications, users need streaming audio output — chunked PCM frames sent progressively as the model generates, enabling playback to start before synthesis finishes.

Use Cases

  • Voice assistants: User hears the first audio chunk within ~200ms of sentence completion, rather than waiting 1-5s for full sentence synthesis
  • Contact centers: Reduces perceived latency for callers, where every 100ms of silence increases drop-off rates
  • Live translation/dubbing: Audio must stay synchronized with video — batch-per-sentence adds unacceptable delay for long sentences

Design Decisions

Protocol

// Client sends:
{"type": "session.config", "voice": "Vivian", "stream_audio": true}

// Server sends (per sentence):
{"type": "audio.start", "sentence_index": 0, "format": "pcm", "sample_rate": 24000}
<binary: audio chunk 1>   // one Code2Wav decode window
<binary: audio chunk 2>
...
<binary: audio chunk N>
{"type": "audio.done", "sentence_index": 0, "total_bytes": 96000}

stream_audio defaults to false for backward compatibility. When stream_audio: true, format must be pcm.

Implementation

  1. In OmniStreamingSpeechHandler ([Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230), replace _generate_audio_bytes() with the streaming generation path
  2. Each time Code2Wav produces a chunk, send it as a binary WebSocket frame
  3. audio.start / audio.done framing stays the same — just multiple binary frames between them instead of one

Dependencies

Related

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions