-
Notifications
You must be signed in to change notification settings - Fork 472
Closed
Description
Motivation
In voice agent scenarios, users frequently interrupt ("barge in") while the agent is still speaking. The current streaming TTS WebSocket (PR #1230) has no mechanism for the client to cancel in-progress audio generation mid-sentence.
Without interruption support, the server continues generating audio for the remaining sentences even after the user has started speaking again, wasting GPU resources and adding latency to the next response.
Use Cases
- Voice assistants: User says "stop" or starts a new question while the agent is mid-sentence — the agent should immediately stop speaking and listen
- Contact center bots: Caller interrupts with "actually, never mind" — bot should stop current TTS and process the new input
- Interactive gaming: Player takes an action mid-dialogue — NPC should stop current line and react
Proposed Approach
Add an input.cancel message type to the WebSocket protocol:
Behavior
input.cancelimmediately aborts generation for the current sentence- Any buffered sentences that haven't started generating are discarded
- Server sends
audio.cancelledand returns to the text-receiving state - Client can then send new text or
input.done
Session Lifecycle with Cancel
config → text → text → cancel → text → done → session.done
↓ ↓ ↓
audio.start audio.cancelled audio.start → audio → audio.done
Implementation Considerations
- Need to propagate cancellation to the engine (
abort_request) - Race condition: cancel arrives while audio is already being sent — server should stop sending remaining chunks
- Should
input.cancelcancel only the current sentence or all pending sentences? - Consider adding a
cancel_scopefield:"current"(default) vs"all"
Related
- PR [Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230 — Streaming text input (foundation)
- OpenAI Realtime API supports
response.cancelfor similar purpose
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels