Skip to content

[Bug]: /v1/audio/speech with stream_format=sse returns raw audio for OpenAI-compatible TTS backend instead of text/event-stream #24301

@tg-bomze

Description

@tg-bomze

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

LiteLLM proxy handles normal /v1/audio/speech requests correctly, but stream_format="sse" does not behave correctly.

I verified that a simple non-streaming TTS request through LiteLLM works:

  • model="gpt-4o-mini-tts"
  • no explicit stream_format
  • response: 200 OK
  • x-litellm-model-api-base: https://api.openai.com

However, when testing /v1/audio/speech with stream_format="sse" against an OpenAI-compatible TTS backend behind LiteLLM, the proxy does not preserve SSE behavior.

Instead of returning Content-Type: text/event-stream and data: {...} events, the proxy returns a binary audio response.

Expected behavior:

  • stream_format="sse" should return an SSE stream of audio events.
    Actual behavior:
  • LiteLLM returns a normal binary audio response instead of SSE.

Steps to Reproduce

  1. Verify that normal TTS works through LiteLLM:
curl -i -sS "https://<litellm-host>/audio/speech" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <litellm-key>" \
  -d '{
    "input":"Hello from gpt-4o-mini-tts.",
    "voice":"alloy",
    "model":"gpt-4o-mini-tts",
    "response_format":"pcm"
  }'
  1. Verify that the same endpoint behaves incorrectly when SSE is requested:
curl -i -sS "https://<litellm-host>/audio/speech" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <litellm-key>" \
  -d '{
    "input":"Hello from gpt-4o-mini-tts.",
    "voice":"alloy",
    "model":"gpt-4o-mini-tts",
    "response_format":"pcm",
    "stream_format":"sse"
  }'
  1. Observe that the response does not behave like OpenAI speech SSE. In my setup, LiteLLM returns HTTP 500 Internal Server Error for this request instead of an SSE stream.

  2. Compare this with an OpenAI-compatible upstream TTS backend that supports SSE directly: when called without LiteLLM, the same /audio/speech request shape returns Content-Type: text/event-stream and data: {"type":"speech.audio.delta", ...} events.

Relevant log output

Working non-streaming request through LiteLLM:

HTTP/2 200
content-type: audio/mpeg
x-litellm-model-api-base: https://api.openai.com
x-litellm-response-cost: 2.75e-05
x-litellm-version: 1.81.0


Unexpected response for stream_format="sse" through LiteLLM:

HTTP/2 200
content-type: audio/mpeg
x-litellm-version: 1.81.0
<raw audio bytes...>


Expected SSE shape:

HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8

data: {"type":"speech.audio.delta","audio":"..."}
data: {"type":"speech.audio.done","usage":{...}}

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on ?

v1.81.0

Twitter / LinkedIn details

https://x.com/tg_bomze

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions