Skip to content

Commit 66d4b16

Browse files
[Frontend] Add OpenAI API support for input_audio (#11027)
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>
1 parent 0064f69 commit 66d4b16

File tree

5 files changed

+301
-23
lines changed

5 files changed

+301
-23
lines changed

docs/source/serving/openai_compatible_server.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,6 @@ We currently support the following OpenAI APIs:
3434
- *Note: `suffix` parameter is not supported.*
3535
- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
3636
- Only applicable to [text generation models](../models/generative_models.rst) (`--task generate`) with a [chat template](#chat-template).
37-
- [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Multimodal Inputs](../usage/multimodal_inputs.rst).
38-
- *Note: `image_url.detail` parameter is not supported.*
39-
- We also support `audio_url` content type for audio files.
40-
- Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
41-
- *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
4237
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
4338
- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
4439
- Only applicable to [embedding models](../models/pooling_models.rst) (`--task embed`).
@@ -209,6 +204,11 @@ The following extra parameters are supported:
209204

210205
Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/chat) for more details.
211206

207+
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
208+
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
209+
see our [Multimodal Inputs](../usage/multimodal_inputs.rst) guide for more information.
210+
- *Note: `image_url.detail` parameter is not supported.*
211+
212212
#### Extra parameters
213213

214214
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.

docs/source/usage/multimodal_inputs.rst

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -315,7 +315,95 @@ You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/e
315315
Audio
316316
^^^^^
317317

318-
Instead of :code:`image_url`, you can pass an audio file via :code:`audio_url`.
318+
Audio input is supported according to `OpenAI Audio API <https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in>`_.
319+
Here is a simple example using Ultravox-v0.3.
320+
321+
First, launch the OpenAI-compatible server:
322+
323+
.. code-block:: bash
324+
325+
vllm serve fixie-ai/ultravox-v0_3
326+
327+
Then, you can use the OpenAI client as follows:
328+
329+
.. code-block:: python
330+
331+
import base64
332+
import requests
333+
from openai import OpenAI
334+
from vllm.assets.audio import AudioAsset
335+
336+
def encode_base64_content_from_url(content_url: str) -> str:
337+
"""Encode a content retrieved from a remote url to base64 format."""
338+
339+
with requests.get(content_url) as response:
340+
response.raise_for_status()
341+
result = base64.b64encode(response.content).decode('utf-8')
342+
343+
return result
344+
345+
openai_api_key = "EMPTY"
346+
openai_api_base = "http://localhost:8000/v1"
347+
348+
client = OpenAI(
349+
api_key=openai_api_key,
350+
base_url=openai_api_base,
351+
)
352+
353+
# Any format supported by librosa is supported
354+
audio_url = AudioAsset("winning_call").url
355+
audio_base64 = encode_base64_content_from_url(audio_url)
356+
357+
chat_completion_from_base64 = client.chat.completions.create(
358+
messages=[{
359+
"role": "user",
360+
"content": [
361+
{
362+
"type": "text",
363+
"text": "What's in this audio?"
364+
},
365+
{
366+
"type": "input_audio",
367+
"input_audio": {
368+
"data": audio_base64,
369+
"format": "wav"
370+
},
371+
},
372+
],
373+
}],
374+
model=model,
375+
max_completion_tokens=64,
376+
)
377+
378+
result = chat_completion_from_base64.choices[0].message.content
379+
print("Chat completion output from input audio:", result)
380+
381+
Alternatively, you can pass :code:`audio_url`, which is the audio counterpart of :code:`image_url` for image input:
382+
383+
.. code-block:: python
384+
385+
chat_completion_from_url = client.chat.completions.create(
386+
messages=[{
387+
"role": "user",
388+
"content": [
389+
{
390+
"type": "text",
391+
"text": "What's in this audio?"
392+
},
393+
{
394+
"type": "audio_url",
395+
"audio_url": {
396+
"url": audio_url
397+
},
398+
},
399+
],
400+
}],
401+
model=model,
402+
max_completion_tokens=64,
403+
)
404+
405+
result = chat_completion_from_url.choices[0].message.content
406+
print("Chat completion output from audio url:", result)
319407
320408
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
321409

examples/openai_chat_completion_client_for_multimodal.py

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -153,10 +153,37 @@ def run_multi_image() -> None:
153153

154154
# Audio input inference
155155
def run_audio() -> None:
156-
# Any format supported by librosa is supported
157156
audio_url = AudioAsset("winning_call").url
157+
audio_base64 = encode_base64_content_from_url(audio_url)
158+
159+
# OpenAI-compatible schema (`input_audio`)
160+
chat_completion_from_base64 = client.chat.completions.create(
161+
messages=[{
162+
"role":
163+
"user",
164+
"content": [
165+
{
166+
"type": "text",
167+
"text": "What's in this audio?"
168+
},
169+
{
170+
"type": "input_audio",
171+
"input_audio": {
172+
# Any format supported by librosa is supported
173+
"data": audio_base64,
174+
"format": "wav"
175+
},
176+
},
177+
],
178+
}],
179+
model=model,
180+
max_completion_tokens=64,
181+
)
182+
183+
result = chat_completion_from_base64.choices[0].message.content
184+
print("Chat completion output from input audio:", result)
158185

159-
# Use audio url in the payload
186+
# HTTP URL
160187
chat_completion_from_url = client.chat.completions.create(
161188
messages=[{
162189
"role":
@@ -169,6 +196,7 @@ def run_audio() -> None:
169196
{
170197
"type": "audio_url",
171198
"audio_url": {
199+
# Any format supported by librosa is supported
172200
"url": audio_url
173201
},
174202
},
@@ -181,7 +209,7 @@ def run_audio() -> None:
181209
result = chat_completion_from_url.choices[0].message.content
182210
print("Chat completion output from audio url:", result)
183211

184-
audio_base64 = encode_base64_content_from_url(audio_url)
212+
# base64 URL
185213
chat_completion_from_base64 = client.chat.completions.create(
186214
messages=[{
187215
"role":

tests/entrypoints/openai/test_audio.py

Lines changed: 121 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,61 @@ async def test_single_chat_session_audio_base64encoded(
155155
assert message.content is not None and len(message.content) >= 0
156156

157157

158+
@pytest.mark.asyncio
159+
@pytest.mark.parametrize("model_name", [MODEL_NAME])
160+
@pytest.mark.parametrize("audio_url", TEST_AUDIO_URLS)
161+
async def test_single_chat_session_input_audio(
162+
client: openai.AsyncOpenAI, model_name: str, audio_url: str,
163+
base64_encoded_audio: Dict[str, str]):
164+
messages = [{
165+
"role":
166+
"user",
167+
"content": [
168+
{
169+
"type": "input_audio",
170+
"input_audio": {
171+
"data": base64_encoded_audio[audio_url],
172+
"format": "wav"
173+
}
174+
},
175+
{
176+
"type": "text",
177+
"text": "What's happening in this audio?"
178+
},
179+
],
180+
}]
181+
182+
# test single completion
183+
chat_completion = await client.chat.completions.create(
184+
model=model_name,
185+
messages=messages,
186+
max_completion_tokens=10,
187+
logprobs=True,
188+
top_logprobs=5)
189+
assert len(chat_completion.choices) == 1
190+
191+
choice = chat_completion.choices[0]
192+
assert choice.finish_reason == "length"
193+
assert chat_completion.usage == openai.types.CompletionUsage(
194+
completion_tokens=10, prompt_tokens=202, total_tokens=212)
195+
196+
message = choice.message
197+
message = chat_completion.choices[0].message
198+
assert message.content is not None and len(message.content) >= 10
199+
assert message.role == "assistant"
200+
messages.append({"role": "assistant", "content": message.content})
201+
202+
# test multi-turn dialogue
203+
messages.append({"role": "user", "content": "express your result in json"})
204+
chat_completion = await client.chat.completions.create(
205+
model=model_name,
206+
messages=messages,
207+
max_completion_tokens=10,
208+
)
209+
message = chat_completion.choices[0].message
210+
assert message.content is not None and len(message.content) >= 0
211+
212+
158213
@pytest.mark.asyncio
159214
@pytest.mark.parametrize("model_name", [MODEL_NAME])
160215
@pytest.mark.parametrize("audio_url", TEST_AUDIO_URLS)
@@ -212,11 +267,72 @@ async def test_chat_streaming_audio(client: openai.AsyncOpenAI,
212267
assert "".join(chunks) == output
213268

214269

270+
@pytest.mark.asyncio
271+
@pytest.mark.parametrize("model_name", [MODEL_NAME])
272+
@pytest.mark.parametrize("audio_url", TEST_AUDIO_URLS)
273+
async def test_chat_streaming_input_audio(client: openai.AsyncOpenAI,
274+
model_name: str, audio_url: str,
275+
base64_encoded_audio: Dict[str,
276+
str]):
277+
messages = [{
278+
"role":
279+
"user",
280+
"content": [
281+
{
282+
"type": "input_audio",
283+
"input_audio": {
284+
"data": base64_encoded_audio[audio_url],
285+
"format": "wav"
286+
}
287+
},
288+
{
289+
"type": "text",
290+
"text": "What's happening in this audio?"
291+
},
292+
],
293+
}]
294+
295+
# test single completion
296+
chat_completion = await client.chat.completions.create(
297+
model=model_name,
298+
messages=messages,
299+
max_completion_tokens=10,
300+
temperature=0.0,
301+
)
302+
output = chat_completion.choices[0].message.content
303+
stop_reason = chat_completion.choices[0].finish_reason
304+
305+
# test streaming
306+
stream = await client.chat.completions.create(
307+
model=model_name,
308+
messages=messages,
309+
max_completion_tokens=10,
310+
temperature=0.0,
311+
stream=True,
312+
)
313+
chunks: List[str] = []
314+
finish_reason_count = 0
315+
async for chunk in stream:
316+
delta = chunk.choices[0].delta
317+
if delta.role:
318+
assert delta.role == "assistant"
319+
if delta.content:
320+
chunks.append(delta.content)
321+
if chunk.choices[0].finish_reason is not None:
322+
finish_reason_count += 1
323+
# finish reason should only return in last block
324+
assert finish_reason_count == 1
325+
assert chunk.choices[0].finish_reason == stop_reason
326+
assert delta.content
327+
assert "".join(chunks) == output
328+
329+
215330
@pytest.mark.asyncio
216331
@pytest.mark.parametrize("model_name", [MODEL_NAME])
217332
@pytest.mark.parametrize("audio_url", TEST_AUDIO_URLS)
218333
async def test_multi_audio_input(client: openai.AsyncOpenAI, model_name: str,
219-
audio_url: str):
334+
audio_url: str,
335+
base64_encoded_audio: Dict[str, str]):
220336

221337
messages = [{
222338
"role":
@@ -229,9 +345,10 @@ async def test_multi_audio_input(client: openai.AsyncOpenAI, model_name: str,
229345
}
230346
},
231347
{
232-
"type": "audio_url",
233-
"audio_url": {
234-
"url": audio_url
348+
"type": "input_audio",
349+
"input_audio": {
350+
"data": base64_encoded_audio[audio_url],
351+
"format": "wav"
235352
}
236353
},
237354
{

0 commit comments

Comments
 (0)