-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Background
In PR #173 we added X-Prompt-Tokens, X-Completion-Tokens, and X-Engine-Time as custom HTTP response headers on the /v1/audio/speech endpoint to expose per-request token usage and engine timing for benchmarking.
While this works, it does not conform to the OpenAI /v1/audio/speech API specification. The OpenAI speech endpoint returns a raw audio body with no usage metadata — there is no standard way to carry token counts or timing info in the response.
Embedding custom headers is a pragmatic short-term solution, but it diverges from the API contract we claim to be compatible with, and may confuse clients that expect strict OpenAI compatibility.
Problem
- Custom
X-*headers on a compatibility endpoint break the "drop-in replacement" promise. - There is no OpenAI-standard mechanism to return usage info from the speech endpoint.
- As we add more models and metrics, stuffing everything into headers does not scale.
Possible Directions (open for discussion)
- Separate
/v1/audio/speech/usageor query param — return usage in a sidecar endpoint or via?include_usage=truethat wraps the response in JSON. - Trailing headers (HTTP chunked) — send audio as chunked body, append usage as trailing headers. Requires client support.
- Keep headers but behind an opt-in flag — only emit
X-*headers when the client sends a specific request header (e.g.,X-Include-Usage: true), so default behavior stays OpenAI-compatible. - SSE / streaming mode with structured events — similar to chat completions streaming, emit audio chunks + a final
usageevent.
None of these is clearly superior. Community input is welcome.
Current Behavior (status quo)
The /v1/audio/speech endpoint returns:
- Body: raw WAV/MP3 audio bytes
- Headers:
X-Prompt-Tokens,X-Completion-Tokens,X-Engine-Time(non-standard)
Relevant Code
sglang_omni/serve/openai_api.py— header injection in_register_speechsglang_omni/models/fishaudio_s2_pro/pipeline/stages.py— usage dict construction in vocoder stagesglang_omni/client/types.py—UsageInfo.engine_time_sfieldbenchmarks/benchmark_tts_speed.py— readsX-*headers for metrics
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels