[Bug]: /v1/audio/speech with stream_format=sse returns raw audio for OpenAI-compatible TTS backend instead of text/event-stream

### Check for existing issues

- [x] I have searched the existing issues and checked that my issue is not a duplicate.

### What happened?

LiteLLM proxy handles normal `/v1/audio/speech` requests correctly, but `stream_format="sse"` does not behave correctly.

I verified that a simple non-streaming TTS request through LiteLLM works:
- `model="gpt-4o-mini-tts"`
- no explicit `stream_format`
- response: `200 OK`
- `x-litellm-model-api-base: https://api.openai.com`

However, when testing `/v1/audio/speech` with `stream_format="sse"` against an OpenAI-compatible TTS backend behind LiteLLM, the proxy does not preserve SSE behavior.

Instead of returning `Content-Type: text/event-stream` and `data: {...}` events, the proxy returns a binary audio response.

Expected behavior:
- `stream_format="sse"` should return an SSE stream of audio events.
Actual behavior:
- LiteLLM returns a normal binary audio response instead of SSE.

### Steps to Reproduce

1. Verify that normal TTS works through LiteLLM:
```bash
curl -i -sS "https://<litellm-host>/audio/speech" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <litellm-key>" \
  -d '{
    "input":"Hello from gpt-4o-mini-tts.",
    "voice":"alloy",
    "model":"gpt-4o-mini-tts",
    "response_format":"pcm"
  }'
```

2. Verify that the same endpoint behaves incorrectly when SSE is requested:
```bash
curl -i -sS "https://<litellm-host>/audio/speech" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <litellm-key>" \
  -d '{
    "input":"Hello from gpt-4o-mini-tts.",
    "voice":"alloy",
    "model":"gpt-4o-mini-tts",
    "response_format":"pcm",
    "stream_format":"sse"
  }'
```

3. Observe that the response does not behave like OpenAI speech SSE. In my setup, LiteLLM returns HTTP 500 Internal Server Error for this request instead of an SSE stream.

4. Compare this with an OpenAI-compatible upstream TTS backend that supports SSE directly: when called without LiteLLM, the same /audio/speech request shape returns Content-Type: text/event-stream and data: {"type":"speech.audio.delta", ...} events.

### Relevant log output

```shell
Working non-streaming request through LiteLLM:

HTTP/2 200
content-type: audio/mpeg
x-litellm-model-api-base: https://api.openai.com
x-litellm-response-cost: 2.75e-05
x-litellm-version: 1.81.0


Unexpected response for stream_format="sse" through LiteLLM:

HTTP/2 200
content-type: audio/mpeg
x-litellm-version: 1.81.0
<raw audio bytes...>


Expected SSE shape:

HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8

data: {"type":"speech.audio.delta","audio":"..."}
data: {"type":"speech.audio.done","usage":{...}}
```

### What part of LiteLLM is this about?

SDK (litellm Python package)

### What LiteLLM version are you on ?

v1.81.0

### Twitter / LinkedIn details

https://x.com/tg_bomze

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: /v1/audio/speech with stream_format=sse returns raw audio for OpenAI-compatible TTS backend instead of text/event-stream #24301

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: /v1/audio/speech with stream_format=sse returns raw audio for OpenAI-compatible TTS backend instead of text/event-stream #24301

Description

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions