Add online serving to Stable Audio Diffusion and introduce `v1/audio/generate` endpoint by ekagra-ranjan · Pull Request #1255 · vllm-project/vllm-omni

ekagra-ranjan · 2026-02-06T17:57:10Z

Purpose

Add Stable Audio to online serving: [Model] Add Stable Audio Open support for text-to-audio generation #331 (comment)
Add /v1/audio/generate endpoint as per this
As of now only Qwen3 TTS was supported on online serving.
This PR adds support for pure diffusion model TTS like Stable Audio to online serving.

1. Added Stable Audio-specific parameters to `OpenAICreateAudioGenerateRequest`

extend protocol to stable audio specific params
file: vllm_omni/entrypoints/openai/protocol/audio.py

2. Serving Logic

file: vllm_omni/entrypoints/openai/serving_audio_generate.py
relevant logic to enable /v1/audio/generate

file: vllm_omni/entrypoints/openai/api_server.py
register /v1/audio/generate

_DiffusionServingModels now mocks missing attributes like input_processor, model_config, renderer from OpenAIServingModels which is needed since OmniOpenAIServingSpeech inherits OpenAIServing. The mock allows not needing to update _DiffusionServingModels everytime we upgrade vllm version which can add new attributes to OpenAIServingModels, e.g. renderer was added newly in vllm 0.15

3. Documentation and Examples

Created complete example suite:

examples/online_serving/stable_audio/README.md - Full documentation
examples/online_serving/stable_audio/curl_examples.sh - Shell script examples
examples/online_serving/stable_audio/stable_audio_client.py - Python client
docs/serving/audio_generate_api.md - v1/audio/generate API doc
docs/user_guide/examples/online_serving/text_to_audio.md - Online serving user guide doc

4. Add test

file: tests/entrypoints/openai_api/test_serving_audio_generate.py

Test Plan

start the server and run curl_examples.sh

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni

unittest: pytest tests/entrypoints/openai_api/test_serving_audio_generate.py passes

Test Result

dog_5s.wav
ocean.wav
thunder_rain.wav

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

… er-stable-audio-online

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea38d5aeb8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/entrypoints/openai/serving_speech.py

ekagra-ranjan · 2026-02-06T19:11:58Z

I just saw this comment. Lemme know which examples should I delete in this PR.

linyueqian · 2026-02-06T21:00:25Z

i would not call it tts since it is an audio generation model, not capable of genrating speech.

linyueqian · 2026-02-06T21:30:29Z

Tested locally with a local checkpoint path. The Stable Audio specific params (audio_length, seed, num_inference_steps, etc.) are all silently ignored.

_is_stable_audio_model() hardcodes "stabilityai/stable-audio-open" so it fails for local paths or --served-model-name aliases. The request falls through to the generic branch. Got 87s audio at 24000Hz instead of the requested 5s at 44100Hz.

Should detect model type from config/architecture instead of a name substring. Also sampling_params_list is a shared reference that gets mutated in place, will leak state across concurrent requests.

Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like /v1/audio/generate instead of /v1/audio/speech? cc @hsliuustc0106 @Gaohan123 thoughts?

ekagra-ranjan · 2026-02-06T23:08:48Z

vllm_omni/entrypoints/openai/serving_speech.py

+            elif self._is_stable_audio_model():
+                # Handle Stable Audio models
+                # Stable Audio uses diffusion, needs different parameters
+                default_sr = 44100  # Default sample rate for Stable Audio


I am not 100% sure how to merge this block with is_tts_model(). As of now that is_tts_model() block is very qwen3 specific with its prompt template and "additional_information" so I think there would be some model specific if-else but I dont know there is an existing standardization across the parameters and if that can be done later on when standardization happens on vllm-omni?

I'd suggest moving the Stable Audio logic to a separate code path for now, like a diffusion-specific branch, rather than mixing it into the TTS flow. They're fundamentally different (autoregressive TTS vs diffusion gen) and trying to unify them now will be forced. We can revisit standardization later when we have a clearer picture.

I didnt get this part. Are you suggesting that we keep the current PR code as is, i.e., not try to merge this block with is_tts_model()?

ekagra-ranjan · 2026-02-07T00:01:29Z

Should detect model type from config/architecture instead of a name substring.

got it - makes sense what you are observing

Also sampling_params_list is a shared reference that gets mutated in place, will leak state across concurrent requests.

~~What is the path forward then if we cannot change the default sampling params?~~ Okay, I see what you mean. I need to create a new object instead of changing the default object in-place.

…_type code across stage config loading. Avoid inplace change in default sampling arg Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan · 2026-02-07T00:39:18Z

vllm_omni/entrypoints/omni.py

        tokenizer = kwargs.get("tokenizer", None)

        base_engine_args = {"tokenizer": tokenizer} if tokenizer is not None else None
+        self.model_type = resolve_model_type(model)


I added this to get an identifier which can be relied on when local path to the model is used. After adding this, I realised that resolve_model_config_path() and load_stage_configs_from_model() share some operations so I refactored them to reuse the intermediate variables.

@linyueqian - Pls lmk if there was a better way to use an existing identifier in case I missed it.

For Qwen3-Omni and Qwen3-TTS, we get model type through engine_client.model_config.hf_config which vLLM populates from config.json at init time, so it works with local paths out of the box. Could you check if the same approach works here instead of adding a separate resolution step?

I gave the engine_client.model_config route a shot but it doesnt work for StableAudio. I believe this is the reason, i.e., diffusion model may have model_config as None.

vllm_omni/entrypoints/openai/serving_speech.py

ekagra-ranjan · 2026-02-08T02:59:12Z

Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like /v1/audio/generate instead of /v1/audio/speech? cc @hsliuustc0106 @Gaohan123 thoughts?

This is an interesting point. My understanding is that the StableAudio is similar to any other non-streaming TTS model where input is text and output is audio. It is correct that the audio is not speech but its still audio. If we go with /v1/audio/generate route then we will be introducing another endpoint but the behavior is same as /v1/audio/speech.

The primary objective of this PR was to support pure diffusion text-to-audio models with speech endpoint but I am happy to introduce a new endpoint if you guys think that is the right approach.

Copilot

Pull request overview

Adds OpenAI-compatible online serving support for diffusion-based text-to-audio models (specifically Stable Audio), extending the existing /v1/audio/speech endpoint beyond Qwen3-TTS. This includes request schema extensions, serving-path routing for Stable Audio parameters (including 44.1kHz defaults), and new end-to-end usage examples.

Changes:

Extend OpenAICreateSpeechRequest with Stable Audio/diffusion-specific parameters (negative prompt, guidance scale, inference steps, seed, audio length/start).
Add Stable Audio handling in OmniOpenAIServingSpeech.create_speech, plus diffusion-mode server initialization wiring.
Refactor stage-config discovery to resolve model_type separately, and add Stable Audio online serving docs + client examples.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
vllm_omni/entrypoints/utils.py	Splits model type resolution from stage-config path resolution; adjusts stage config loading signature.
vllm_omni/entrypoints/openai/serving_speech.py	Adds Stable Audio diffusion parameter handling and default sample rate selection.
vllm_omni/entrypoints/openai/protocol/audio.py	Adds Stable Audio-specific request fields to the OpenAI speech request model.
vllm_omni/entrypoints/openai/audio_utils_mixin.py	Adds handling for stereo tensors shaped as `[channels, samples]`.
vllm_omni/entrypoints/openai/api_server.py	Adds diffusion-only `openai_serving_models` fallback and initializes speech serving for pure diffusion mode.
vllm_omni/entrypoints/omni_llm.py	Updates stage-config loading to use `resolve_model_type` + config path.
vllm_omni/entrypoints/omni.py	Updates stage initialization to use `resolve_model_type` + config path.
tests/entrypoints/test_omni_llm.py	Updates mocks to match the new `load_stage_configs_from_model(config_path=...)` signature.
tests/entrypoints/test_omni_diffusion.py	Updates mocks to match the new `load_stage_configs_from_model(config_path=...)` signature.
examples/online_serving/stable_audio/stable_audio_client.py	Adds a Python client example for `/v1/audio/speech` Stable Audio usage.
examples/online_serving/stable_audio/curl_examples.sh	Adds curl examples for Stable Audio online serving.
examples/online_serving/stable_audio/README.md	Adds Stable Audio online serving documentation and usage guide.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_omni/entrypoints/utils.py

vllm_omni/entrypoints/openai/serving_speech.py

vllm_omni/entrypoints/openai/api_server.py

vllm_omni/entrypoints/openai/serving_speech.py

vllm_omni/entrypoints/utils.py

Gaohan123 · 2026-02-10T06:43:42Z

Tested locally with a local checkpoint path. The Stable Audio specific params (audio_length, seed, num_inference_steps, etc.) are all silently ignored.

_is_stable_audio_model() hardcodes "stabilityai/stable-audio-open" so it fails for local paths or --served-model-name aliases. The request falls through to the generic branch. Got 87s audio at 24000Hz instead of the requested 5s at 44100Hz.

Should detect model type from config/architecture instead of a name substring. Also sampling_params_list is a shared reference that gets mutated in place, will leak state across concurrent requests.

Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like /v1/audio/generate instead of /v1/audio/speech? cc @hsliuustc0106 @Gaohan123 thoughts?

Agreed, I think /v1/audio/generate is more suitable for Stable Audio.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan · 2026-02-16T15:05:09Z

@Gaohan123 @linyueqian - I've added a new endpoint /v1/audio/generate as requested and tested with vllm v0.16.
Pls have a look!

I plan to add tests similar to test_serving_speech.py for /v1/audio/generate once someone confirms that the code changes aligns with your expectation.

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

linyueqian · 2026-02-16T22:08:31Z

Thanks for the updates @ekagra-ranjan, the separation looks much cleaner now.

Before merging, could you add tests for /v1/audio/generate similar to test_serving_speech.py? At minimum covering request validation, parameter wiring, and the audio response format.

Also the PR title still says "TTS", worth updating since we agreed Stable Audio is audio generation. And it would be good to document the default behavior when audio_length is omitted (model defaults to max ~47s which could surprise users).

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan · 2026-02-17T03:11:54Z

vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py

        negative_prompt: str | list[str] | None = None,
        audio_end_in_s: float | None = None,
        audio_start_in_s: float = 0.0,
-        num_inference_steps: int = 100,


this has been removed because this PR makes it a no-op since OmniDiffusionSamplingParams already sets it to 50 as per this

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan · 2026-02-17T03:20:09Z

@linyueqian - I've added the tests and added Audio Generate API doc similar to this and online serving doc similar to this.

Pls have a look when you can!

linyueqian

LGTM, tested locally with stable-audio-open-1.0 and the generated audio sounds reasonable.

hsliuustc0106 · 2026-02-24T08:02:35Z

@vllm-omni-reviewer

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan · 2026-02-24T20:14:35Z

I have updated the PR and resolved the merge conflict after recent changes #939. Pls have a look cc: @hsliuustc0106

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan added 2 commits February 6, 2026 17:26

Add online serving to Stable Audio Diffusion TTS

6c6fb5e

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

make sr model specifc

cdec68a

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan requested a review from hsliuustc0106 as a code owner February 6, 2026 17:57

ekagra-ranjan added 2 commits February 6, 2026 17:57

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

ea38d5a

… er-stable-audio-online

lint

c3dad34

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

chatgpt-codex-connector bot reviewed Feb 6, 2026

View reviewed changes

vllm_omni/entrypoints/openai/serving_speech.py Show resolved Hide resolved

vllm_omni/entrypoints/openai/serving_speech.py Outdated Show resolved Hide resolved

ekagra-ranjan commented Feb 6, 2026

View reviewed changes

hsliuustc0106 assigned ekagra-ranjan Feb 6, 2026

ekagra-ranjan added 2 commits February 7, 2026 00:27

save model_type in OmniBase so it can be referred easily. Reuse model…

4b5d63e

…_type code across stage config loading. Avoid inplace change in default sampling arg Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

update test

fa8fa80

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan commented Feb 7, 2026

View reviewed changes

vllm_omni/entrypoints/openai/serving_speech.py Outdated Show resolved Hide resolved

ekagra-ranjan requested a review from linyueqian February 8, 2026 03:00

hsliuustc0106 requested review from Copilot and removed request for linyueqian February 8, 2026 06:17

Copilot started reviewing on behalf of hsliuustc0106 February 8, 2026 06:17 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

ekagra-ranjan and others added 7 commits February 15, 2026 23:35

Apply suggestion from @Copilot

8063b3c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix doc

3722a9d

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix import

23235aa

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix hint

dfb4873

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

conflict

fce33d3

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix comment

b86dd42

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix

e6c8cd4

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix comment

90837b7

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan requested a review from linyueqian February 16, 2026 15:07

ekagra-ranjan changed the title ~~Add online serving to Stable Audio Diffusion TTS~~ Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint Feb 17, 2026

ekagra-ranjan added 3 commits February 17, 2026 02:46

add test

049ec17

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

add docs

071fb8e

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

remove debug

cf0a6c6

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan commented Feb 17, 2026

View reviewed changes

fix doc

9b548e7

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

linyueqian reviewed Feb 19, 2026

View reviewed changes

ekagra-ranjan added 2 commits February 24, 2026 19:29

resolve conflict

9e345ae

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix conflict

7f8380e

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 25, 2026

ekagra-ranjan added 2 commits February 26, 2026 16:09

fix conflict

88a1ad7

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

Merge branch 'main' into er-stable-audio-online

296cba3

ekagra-ranjan requested a review from linyueqian February 28, 2026 16:31

Conversation

ekagra-ranjan commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

1. Added Stable Audio-specific parameters to OpenAICreateAudioGenerateRequest

2. Serving Logic

3. Documentation and Examples

4. Add test

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

ekagra-ranjan commented Feb 6, 2026

Uh oh!

linyueqian commented Feb 6, 2026

Uh oh!

linyueqian commented Feb 6, 2026

Uh oh!

ekagra-ranjan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekagra-ranjan Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ekagra-ranjan commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gaohan123 commented Feb 10, 2026

Uh oh!

ekagra-ranjan commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Feb 16, 2026

Uh oh!

ekagra-ranjan Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Feb 17, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Feb 24, 2026

Uh oh!

ekagra-ranjan commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

ekagra-ranjan commented Feb 6, 2026 •

edited

Loading

1. Added Stable Audio-specific parameters to `OpenAICreateAudioGenerateRequest`

ekagra-ranjan commented Feb 7, 2026 •

edited

Loading

ekagra-ranjan commented Feb 8, 2026 •

edited

Loading

ekagra-ranjan commented Feb 16, 2026 •

edited

Loading

ekagra-ranjan commented Feb 24, 2026 •

edited

Loading