Skip to content

[RFC] Multi-Stage Abort / Barge-in for Omni Models #1480

@lishunyang12

Description

@lishunyang12

Motivation

In voice agent scenarios, users frequently interrupt ("barge in") while the agent is still speaking. The current streaming TTS WebSocket (PR #1230) has no mechanism for the client to cancel in-progress audio generation mid-sentence.

Without interruption support, the server continues generating audio for the remaining sentences even after the user has started speaking again, wasting GPU resources and adding latency to the next response.

Use Cases

  • Voice assistants: User says "stop" or starts a new question while the agent is mid-sentence — the agent should immediately stop speaking and listen
  • Contact center bots: Caller interrupts with "actually, never mind" — bot should stop current TTS and process the new input
  • Interactive gaming: Player takes an action mid-dialogue — NPC should stop current line and react

Proposed Approach

Add an input.cancel message type to the WebSocket protocol:

// Client sends to cancel current generation:
{"type": "input.cancel"}

// Server responds:
{"type": "audio.cancelled", "sentence_index": 2, "reason": "client_cancel"}
// Then resumes listening for new input.text or input.done

Behavior

  1. input.cancel immediately aborts generation for the current sentence
  2. Any buffered sentences that haven't started generating are discarded
  3. Server sends audio.cancelled and returns to the text-receiving state
  4. Client can then send new text or input.done

Session Lifecycle with Cancel

config → text → text → cancel → text → done → session.done
                  ↓        ↓        ↓
             audio.start  audio.cancelled  audio.start → audio → audio.done

Implementation Considerations

  • Need to propagate cancellation to the engine (abort_request)
  • Race condition: cancel arrives while audio is already being sent — server should stop sending remaining chunks
  • Should input.cancel cancel only the current sentence or all pending sentences?
  • Consider adding a cancel_scope field: "current" (default) vs "all"

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions