[RFC] Streaming Audio Output for WebSocket TTS

## Motivation

PR #1230 adds streaming **text input** via WebSocket — text arrives incrementally, audio is generated per sentence. However, each sentence's audio is returned as a single binary frame only after full synthesis completes.

For voice agents and real-time applications, users need **streaming audio output** — chunked PCM frames sent progressively as the model generates, enabling playback to start before synthesis finishes.

## Use Cases

- **Voice assistants**: User hears the first audio chunk within ~200ms of sentence completion, rather than waiting 1-5s for full sentence synthesis
- **Contact centers**: Reduces perceived latency for callers, where every 100ms of silence increases drop-off rates
- **Live translation/dubbing**: Audio must stay synchronized with video — batch-per-sentence adds unacceptable delay for long sentences

## Design Decisions

- **Format**: Raw PCM (16-bit signed, 24kHz mono). No WAV — headers require total file size upfront, which is incompatible with streaming. This matches OpenAI Realtime API and #1438's approach.
- **Chunk size**: Follows model codec — one chunk per Code2Wav decode window (25 frames by default, configurable via #1423). No arbitrary byte/duration splitting needed.
- **Scope**: Small follow-up to #1230. Wire the streaming generation path from #1438/#1189 into the WebSocket handler, replacing the blocking `_generate_audio_bytes()` call.

## Protocol

```jsonc
// Client sends:
{"type": "session.config", "voice": "Vivian", "stream_audio": true}

// Server sends (per sentence):
{"type": "audio.start", "sentence_index": 0, "format": "pcm", "sample_rate": 24000}
<binary: audio chunk 1>   // one Code2Wav decode window
<binary: audio chunk 2>
...
<binary: audio chunk N>
{"type": "audio.done", "sentence_index": 0, "total_bytes": 96000}
```

`stream_audio` defaults to `false` for backward compatibility. When `stream_audio: true`, format must be `pcm`.

## Implementation

1. In `OmniStreamingSpeechHandler` (#1230), replace `_generate_audio_bytes()` with the streaming generation path
2. Each time Code2Wav produces a chunk, send it as a binary WebSocket frame
3. `audio.start` / `audio.done` framing stays the same — just multiple binary frames between them instead of one

## Dependencies

- **Blocked on** #1438 (REST streaming output) or #1189 (model-level streaming) merging first
- Builds on #1230 (WebSocket endpoint)
- Chunk size configurability from #1423

## Related

- PR #1230 — Streaming text input via WebSocket (foundation)
- PR #1189 — Token-level streaming in Talker (`AsyncDecodingPipeline`)
- PR #1438 — REST streaming output (`stream=true` + PCM via SSE)
- PR #1423 — Configurable `chunk_size` / `left_context_size` for Code2Wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Streaming Audio Output for WebSocket TTS #1479

Motivation

Use Cases

Design Decisions

Protocol

Implementation

Dependencies

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Streaming Audio Output for WebSocket TTS #1479

Description

Motivation

Use Cases

Design Decisions

Protocol

Implementation

Dependencies

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions