[RFC] Multi-Stage Abort / Barge-in for Omni Models

## Motivation

In voice agent scenarios, users frequently interrupt ("barge in") while the agent is still speaking. The current streaming TTS WebSocket (PR #1230) has no mechanism for the client to cancel in-progress audio generation mid-sentence.

Without interruption support, the server continues generating audio for the remaining sentences even after the user has started speaking again, wasting GPU resources and adding latency to the next response.

## Use Cases

- **Voice assistants**: User says "stop" or starts a new question while the agent is mid-sentence — the agent should immediately stop speaking and listen
- **Contact center bots**: Caller interrupts with "actually, never mind" — bot should stop current TTS and process the new input
- **Interactive gaming**: Player takes an action mid-dialogue — NPC should stop current line and react

## Proposed Approach

Add an `input.cancel` message type to the WebSocket protocol:

```jsonc
// Client sends to cancel current generation:
{"type": "input.cancel"}

// Server responds:
{"type": "audio.cancelled", "sentence_index": 2, "reason": "client_cancel"}
// Then resumes listening for new input.text or input.done
```

### Behavior

1. `input.cancel` immediately aborts generation for the current sentence
2. Any buffered sentences that haven't started generating are discarded
3. Server sends `audio.cancelled` and returns to the text-receiving state
4. Client can then send new text or `input.done`

### Session Lifecycle with Cancel

```
config → text → text → cancel → text → done → session.done
                  ↓        ↓        ↓
             audio.start  audio.cancelled  audio.start → audio → audio.done
```

## Implementation Considerations

- Need to propagate cancellation to the engine (`abort_request`)
- Race condition: cancel arrives while audio is already being sent — server should stop sending remaining chunks
- Should `input.cancel` cancel only the current sentence or all pending sentences?
- Consider adding a `cancel_scope` field: `"current"` (default) vs `"all"`

## Related

- PR #1230 — Streaming text input (foundation)
- OpenAI Realtime API supports `response.cancel` for similar purpose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Multi-Stage Abort / Barge-in for Omni Models #1480

Motivation

Use Cases

Proposed Approach

Behavior

Session Lifecycle with Cancel

Implementation Considerations

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Multi-Stage Abort / Barge-in for Omni Models #1480

Description

Motivation

Use Cases

Proposed Approach

Behavior

Session Lifecycle with Cancel

Implementation Considerations

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions