Skip to content

Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint#1255

Open
ekagra-ranjan wants to merge 22 commits intovllm-project:mainfrom
ekagra-ranjan:er-stable-audio-online
Open

Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint#1255
ekagra-ranjan wants to merge 22 commits intovllm-project:mainfrom
ekagra-ranjan:er-stable-audio-online

Conversation

@ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented Feb 6, 2026

Purpose

1. Added Stable Audio-specific parameters to OpenAICreateAudioGenerateRequest

extend protocol to stable audio specific params
file: vllm_omni/entrypoints/openai/protocol/audio.py

2. Serving Logic

file: vllm_omni/entrypoints/openai/serving_audio_generate.py
relevant logic to enable /v1/audio/generate

file: vllm_omni/entrypoints/openai/api_server.py
register /v1/audio/generate

_DiffusionServingModels now mocks missing attributes like input_processor, model_config, renderer from OpenAIServingModels which is needed since OmniOpenAIServingSpeech inherits OpenAIServing. The mock allows not needing to update _DiffusionServingModels everytime we upgrade vllm version which can add new attributes to OpenAIServingModels, e.g. renderer was added newly in vllm 0.15

3. Documentation and Examples

Created complete example suite:

  • examples/online_serving/stable_audio/README.md - Full documentation
  • examples/online_serving/stable_audio/curl_examples.sh - Shell script examples
  • examples/online_serving/stable_audio/stable_audio_client.py - Python client
  • docs/serving/audio_generate_api.md - v1/audio/generate API doc
  • docs/user_guide/examples/online_serving/text_to_audio.md - Online serving user guide doc

4. Add test

file: tests/entrypoints/openai_api/test_serving_audio_generate.py

Test Plan

start the server and run curl_examples.sh

vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni

unittest: pytest tests/entrypoints/openai_api/test_serving_audio_generate.py passes

Test Result

dog_5s.wav
ocean.wav
thunder_rain.wav


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea38d5aeb8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ekagra-ranjan
Copy link
Contributor Author

I just saw this comment. Lemme know which examples should I delete in this PR.

@linyueqian
Copy link
Contributor

i would not call it tts since it is an audio generation model, not capable of genrating speech.

@linyueqian
Copy link
Contributor

Tested locally with a local checkpoint path. The Stable Audio specific params (audio_length, seed, num_inference_steps, etc.) are all silently ignored.

_is_stable_audio_model() hardcodes "stabilityai/stable-audio-open" so it fails for local paths or --served-model-name aliases. The request falls through to the generic branch. Got 87s audio at 24000Hz instead of the requested 5s at 44100Hz.

Should detect model type from config/architecture instead of a name substring. Also sampling_params_list is a shared reference that gets mutated in place, will leak state across concurrent requests.

Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like /v1/audio/generate instead of /v1/audio/speech? cc @hsliuustc0106 @Gaohan123 thoughts?

Comment on lines 246 to 249
elif self._is_stable_audio_model():
# Handle Stable Audio models
# Stable Audio uses diffusion, needs different parameters
default_sr = 44100 # Default sample rate for Stable Audio
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure how to merge this block with is_tts_model(). As of now that is_tts_model() block is very qwen3 specific with its prompt template and "additional_information" so I think there would be some model specific if-else but I dont know there is an existing standardization across the parameters and if that can be done later on when standardization happens on vllm-omni?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving the Stable Audio logic to a separate code path for now, like a diffusion-specific branch, rather than mixing it into the TTS flow. They're fundamentally different (autoregressive TTS vs diffusion gen) and trying to unify them now will be forced. We can revisit standardization later when we have a clearer picture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt get this part. Are you suggesting that we keep the current PR code as is, i.e., not try to merge this block with is_tts_model()?

@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Feb 7, 2026

Should detect model type from config/architecture instead of a name substring.

got it - makes sense what you are observing

Also sampling_params_list is a shared reference that gets mutated in place, will leak state across concurrent requests.

What is the path forward then if we cannot change the default sampling params? Okay, I see what you mean. I need to create a new object instead of changing the default object in-place.

…_type code across stage config loading. Avoid inplace change in default sampling arg

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
tokenizer = kwargs.get("tokenizer", None)

base_engine_args = {"tokenizer": tokenizer} if tokenizer is not None else None
self.model_type = resolve_model_type(model)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to get an identifier which can be relied on when local path to the model is used. After adding this, I realised that resolve_model_config_path() and load_stage_configs_from_model() share some operations so I refactored them to reuse the intermediate variables.

@linyueqian - Pls lmk if there was a better way to use an existing identifier in case I missed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Qwen3-Omni and Qwen3-TTS, we get model type through engine_client.model_config.hf_config which vLLM populates from config.json at init time, so it works with local paths out of the box. Could you check if the same approach works here instead of adding a separate resolution step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave the engine_client.model_config route a shot but it doesnt work for StableAudio. I believe this is the reason, i.e., diffusion model may have model_config as None.

@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Feb 8, 2026

Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like /v1/audio/generate instead of /v1/audio/speech? cc @hsliuustc0106 @Gaohan123 thoughts?

This is an interesting point. My understanding is that the StableAudio is similar to any other non-streaming TTS model where input is text and output is audio. It is correct that the audio is not speech but its still audio. If we go with /v1/audio/generate route then we will be introducing another endpoint but the behavior is same as /v1/audio/speech.

The primary objective of this PR was to support pure diffusion text-to-audio models with speech endpoint but I am happy to introduce a new endpoint if you guys think that is the right approach.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenAI-compatible online serving support for diffusion-based text-to-audio models (specifically Stable Audio), extending the existing /v1/audio/speech endpoint beyond Qwen3-TTS. This includes request schema extensions, serving-path routing for Stable Audio parameters (including 44.1kHz defaults), and new end-to-end usage examples.

Changes:

  • Extend OpenAICreateSpeechRequest with Stable Audio/diffusion-specific parameters (negative prompt, guidance scale, inference steps, seed, audio length/start).
  • Add Stable Audio handling in OmniOpenAIServingSpeech.create_speech, plus diffusion-mode server initialization wiring.
  • Refactor stage-config discovery to resolve model_type separately, and add Stable Audio online serving docs + client examples.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vllm_omni/entrypoints/utils.py Splits model type resolution from stage-config path resolution; adjusts stage config loading signature.
vllm_omni/entrypoints/openai/serving_speech.py Adds Stable Audio diffusion parameter handling and default sample rate selection.
vllm_omni/entrypoints/openai/protocol/audio.py Adds Stable Audio-specific request fields to the OpenAI speech request model.
vllm_omni/entrypoints/openai/audio_utils_mixin.py Adds handling for stereo tensors shaped as [channels, samples].
vllm_omni/entrypoints/openai/api_server.py Adds diffusion-only openai_serving_models fallback and initializes speech serving for pure diffusion mode.
vllm_omni/entrypoints/omni_llm.py Updates stage-config loading to use resolve_model_type + config path.
vllm_omni/entrypoints/omni.py Updates stage initialization to use resolve_model_type + config path.
tests/entrypoints/test_omni_llm.py Updates mocks to match the new load_stage_configs_from_model(config_path=...) signature.
tests/entrypoints/test_omni_diffusion.py Updates mocks to match the new load_stage_configs_from_model(config_path=...) signature.
examples/online_serving/stable_audio/stable_audio_client.py Adds a Python client example for /v1/audio/speech Stable Audio usage.
examples/online_serving/stable_audio/curl_examples.sh Adds curl examples for Stable Audio online serving.
examples/online_serving/stable_audio/README.md Adds Stable Audio online serving documentation and usage guide.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Gaohan123
Copy link
Collaborator

Tested locally with a local checkpoint path. The Stable Audio specific params (audio_length, seed, num_inference_steps, etc.) are all silently ignored.

_is_stable_audio_model() hardcodes "stabilityai/stable-audio-open" so it fails for local paths or --served-model-name aliases. The request falls through to the generic branch. Got 87s audio at 24000Hz instead of the requested 5s at 44100Hz.

Should detect model type from config/architecture instead of a name substring. Also sampling_params_list is a shared reference that gets mutated in place, will leak state across concurrent requests.

Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like /v1/audio/generate instead of /v1/audio/speech? cc @hsliuustc0106 @Gaohan123 thoughts?

Agreed, I think /v1/audio/generate is more suitable for Stable Audio.

ekagra-ranjan and others added 7 commits February 15, 2026 23:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Feb 16, 2026

@Gaohan123 @linyueqian - I've added a new endpoint /v1/audio/generate as requested and tested with vllm v0.16.
Pls have a look!

I plan to add tests similar to test_serving_speech.py for /v1/audio/generate once someone confirms that the code changes aligns with your expectation.

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@linyueqian
Copy link
Contributor

Thanks for the updates @ekagra-ranjan, the separation looks much cleaner now.

Before merging, could you add tests for /v1/audio/generate similar to test_serving_speech.py? At minimum covering request validation, parameter wiring, and the audio response format.

Also the PR title still says "TTS", worth updating since we agreed Stable Audio is audio generation. And it would be good to document the default behavior when audio_length is omitted (model defaults to max ~47s which could surprise users).

@ekagra-ranjan ekagra-ranjan changed the title Add online serving to Stable Audio Diffusion TTS Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint Feb 17, 2026
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
negative_prompt: str | list[str] | None = None,
audio_end_in_s: float | None = None,
audio_start_in_s: float = 0.0,
num_inference_steps: int = 100,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been removed because this PR makes it a no-op since OmniDiffusionSamplingParams already sets it to 50 as per this

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan
Copy link
Contributor Author

@linyueqian - I've added the tests and added Audio Generate API doc similar to this and online serving doc similar to this.

Pls have a look when you can!

Copy link
Contributor

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested locally with stable-audio-open-1.0 and the generated audio sounds reasonable.

@hsliuustc0106
Copy link
Collaborator

@vllm-omni-reviewer

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Feb 24, 2026

I have updated the PR and resolved the merge conflict after recent changes #939. Pls have a look cc: @hsliuustc0106

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 25, 2026
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants