[Refactor] Decouple TTS usage metrics from OpenAI-compatible `/v1/audio/speech` response

## Background

In PR #173 we added `X-Prompt-Tokens`, `X-Completion-Tokens`, and `X-Engine-Time` as custom HTTP response headers on the `/v1/audio/speech` endpoint to expose per-request token usage and engine timing for benchmarking.

While this works, it **does not conform to the OpenAI `/v1/audio/speech` API specification**. The OpenAI speech endpoint returns a raw audio body with no usage metadata — there is no standard way to carry token counts or timing info in the response.

Embedding custom headers is a pragmatic short-term solution, but it diverges from the API contract we claim to be compatible with, and may confuse clients that expect strict OpenAI compatibility.

## Problem

- Custom `X-*` headers on a compatibility endpoint break the "drop-in replacement" promise.
- There is no OpenAI-standard mechanism to return usage info from the speech endpoint.
- As we add more models and metrics, stuffing everything into headers does not scale.

## Possible Directions (open for discussion)

1. **Separate `/v1/audio/speech/usage` or query param** — return usage in a sidecar endpoint or via `?include_usage=true` that wraps the response in JSON.
2. **Trailing headers (HTTP chunked)** — send audio as chunked body, append usage as trailing headers. Requires client support.
3. **Keep headers but behind an opt-in flag** — only emit `X-*` headers when the client sends a specific request header (e.g., `X-Include-Usage: true`), so default behavior stays OpenAI-compatible.
4. **SSE / streaming mode with structured events** — similar to chat completions streaming, emit audio chunks + a final `usage` event.

None of these is clearly superior. Community input is welcome.

## Current Behavior (status quo)

The `/v1/audio/speech` endpoint returns:
- **Body**: raw WAV/MP3 audio bytes
- **Headers**: `X-Prompt-Tokens`, `X-Completion-Tokens`, `X-Engine-Time` (non-standard)

## Relevant Code

- `sglang_omni/serve/openai_api.py` — header injection in `_register_speech`
- `sglang_omni/models/fishaudio_s2_pro/pipeline/stages.py` — usage dict construction in vocoder stage
- `sglang_omni/client/types.py` — `UsageInfo.engine_time_s` field
- `benchmarks/benchmark_tts_speed.py` — reads `X-*` headers for metrics


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Decouple TTS usage metrics from OpenAI-compatible `/v1/audio/speech` response #174

Background

Problem

Possible Directions (open for discussion)

Current Behavior (status quo)

Relevant Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Refactor] Decouple TTS usage metrics from OpenAI-compatible /v1/audio/speech response #174

Description

Background

Problem

Possible Directions (open for discussion)

Current Behavior (status quo)

Relevant Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Refactor] Decouple TTS usage metrics from OpenAI-compatible `/v1/audio/speech` response #174