Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint#1255
Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint#1255ekagra-ranjan wants to merge 22 commits intovllm-project:mainfrom
v1/audio/generate endpoint#1255Conversation
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
… er-stable-audio-online
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea38d5aeb8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
I just saw this comment. Lemme know which examples should I delete in this PR. |
|
i would not call it tts since it is an audio generation model, not capable of genrating speech. |
|
Tested locally with a local checkpoint path. The Stable Audio specific params (
Should detect model type from config/architecture instead of a name substring. Also Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like |
| elif self._is_stable_audio_model(): | ||
| # Handle Stable Audio models | ||
| # Stable Audio uses diffusion, needs different parameters | ||
| default_sr = 44100 # Default sample rate for Stable Audio |
There was a problem hiding this comment.
I am not 100% sure how to merge this block with is_tts_model(). As of now that is_tts_model() block is very qwen3 specific with its prompt template and "additional_information" so I think there would be some model specific if-else but I dont know there is an existing standardization across the parameters and if that can be done later on when standardization happens on vllm-omni?
There was a problem hiding this comment.
I'd suggest moving the Stable Audio logic to a separate code path for now, like a diffusion-specific branch, rather than mixing it into the TTS flow. They're fundamentally different (autoregressive TTS vs diffusion gen) and trying to unify them now will be forced. We can revisit standardization later when we have a clearer picture.
There was a problem hiding this comment.
I didnt get this part. Are you suggesting that we keep the current PR code as is, i.e., not try to merge this block with is_tts_model()?
got it - makes sense what you are observing
|
…_type code across stage config loading. Avoid inplace change in default sampling arg Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
vllm_omni/entrypoints/omni.py
Outdated
| tokenizer = kwargs.get("tokenizer", None) | ||
|
|
||
| base_engine_args = {"tokenizer": tokenizer} if tokenizer is not None else None | ||
| self.model_type = resolve_model_type(model) |
There was a problem hiding this comment.
I added this to get an identifier which can be relied on when local path to the model is used. After adding this, I realised that resolve_model_config_path() and load_stage_configs_from_model() share some operations so I refactored them to reuse the intermediate variables.
@linyueqian - Pls lmk if there was a better way to use an existing identifier in case I missed it.
There was a problem hiding this comment.
For Qwen3-Omni and Qwen3-TTS, we get model type through engine_client.model_config.hf_config which vLLM populates from config.json at init time, so it works with local paths out of the box. Could you check if the same approach works here instead of adding a separate resolution step?
There was a problem hiding this comment.
I gave the engine_client.model_config route a shot but it doesnt work for StableAudio. I believe this is the reason, i.e., diffusion model may have model_config as None.
This is an interesting point. My understanding is that the StableAudio is similar to any other non-streaming TTS model where input is text and output is audio. It is correct that the audio is not speech but its still audio. If we go with The primary objective of this PR was to support pure diffusion text-to-audio models with speech endpoint but I am happy to introduce a new endpoint if you guys think that is the right approach. |
There was a problem hiding this comment.
Pull request overview
Adds OpenAI-compatible online serving support for diffusion-based text-to-audio models (specifically Stable Audio), extending the existing /v1/audio/speech endpoint beyond Qwen3-TTS. This includes request schema extensions, serving-path routing for Stable Audio parameters (including 44.1kHz defaults), and new end-to-end usage examples.
Changes:
- Extend
OpenAICreateSpeechRequestwith Stable Audio/diffusion-specific parameters (negative prompt, guidance scale, inference steps, seed, audio length/start). - Add Stable Audio handling in
OmniOpenAIServingSpeech.create_speech, plus diffusion-mode server initialization wiring. - Refactor stage-config discovery to resolve
model_typeseparately, and add Stable Audio online serving docs + client examples.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/entrypoints/utils.py | Splits model type resolution from stage-config path resolution; adjusts stage config loading signature. |
| vllm_omni/entrypoints/openai/serving_speech.py | Adds Stable Audio diffusion parameter handling and default sample rate selection. |
| vllm_omni/entrypoints/openai/protocol/audio.py | Adds Stable Audio-specific request fields to the OpenAI speech request model. |
| vllm_omni/entrypoints/openai/audio_utils_mixin.py | Adds handling for stereo tensors shaped as [channels, samples]. |
| vllm_omni/entrypoints/openai/api_server.py | Adds diffusion-only openai_serving_models fallback and initializes speech serving for pure diffusion mode. |
| vllm_omni/entrypoints/omni_llm.py | Updates stage-config loading to use resolve_model_type + config path. |
| vllm_omni/entrypoints/omni.py | Updates stage initialization to use resolve_model_type + config path. |
| tests/entrypoints/test_omni_llm.py | Updates mocks to match the new load_stage_configs_from_model(config_path=...) signature. |
| tests/entrypoints/test_omni_diffusion.py | Updates mocks to match the new load_stage_configs_from_model(config_path=...) signature. |
| examples/online_serving/stable_audio/stable_audio_client.py | Adds a Python client example for /v1/audio/speech Stable Audio usage. |
| examples/online_serving/stable_audio/curl_examples.sh | Adds curl examples for Stable Audio online serving. |
| examples/online_serving/stable_audio/README.md | Adds Stable Audio online serving documentation and usage guide. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Agreed, I think |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
@Gaohan123 @linyueqian - I've added a new endpoint I plan to add tests similar to |
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
Thanks for the updates @ekagra-ranjan, the separation looks much cleaner now. Before merging, could you add tests for Also the PR title still says "TTS", worth updating since we agreed Stable Audio is audio generation. And it would be good to document the default behavior when |
v1/audio/generate endpoint
| negative_prompt: str | list[str] | None = None, | ||
| audio_end_in_s: float | None = None, | ||
| audio_start_in_s: float = 0.0, | ||
| num_inference_steps: int = 100, |
|
@linyueqian - I've added the tests and added Audio Generate API doc similar to this and online serving doc similar to this. Pls have a look when you can! |
linyueqian
left a comment
There was a problem hiding this comment.
LGTM, tested locally with stable-audio-open-1.0 and the generated audio sounds reasonable.
|
@vllm-omni-reviewer |
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
I have updated the PR and resolved the merge conflict after recent changes #939. Pls have a look cc: @hsliuustc0106 |
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Purpose
/v1/audio/generateendpoint as per thisAs of now only Qwen3 TTS was supported on online serving.This PR adds support for pure diffusion model TTS like Stable Audio to online serving.
1. Added Stable Audio-specific parameters to
OpenAICreateAudioGenerateRequestextend protocol to stable audio specific params
file:
vllm_omni/entrypoints/openai/protocol/audio.py2. Serving Logic
file:
vllm_omni/entrypoints/openai/serving_audio_generate.pyrelevant logic to enable
/v1/audio/generatefile:
vllm_omni/entrypoints/openai/api_server.pyregister
/v1/audio/generate_DiffusionServingModelsnow mocks missing attributes likeinput_processor,model_config,rendererfromOpenAIServingModelswhich is needed sinceOmniOpenAIServingSpeechinheritsOpenAIServing. The mock allows not needing to update_DiffusionServingModelseverytime we upgrade vllm version which can add new attributes toOpenAIServingModels, e.g. renderer was added newly in vllm 0.153. Documentation and Examples
Created complete example suite:
examples/online_serving/stable_audio/README.md- Full documentationexamples/online_serving/stable_audio/curl_examples.sh- Shell script examplesexamples/online_serving/stable_audio/stable_audio_client.py- Python clientdocs/serving/audio_generate_api.md-v1/audio/generateAPI docdocs/user_guide/examples/online_serving/text_to_audio.md- Online serving user guide doc4. Add test
file:
tests/entrypoints/openai_api/test_serving_audio_generate.pyTest Plan
start the server and run curl_examples.sh
unittest:
pytest tests/entrypoints/openai_api/test_serving_audio_generate.pypassesTest Result
dog_5s.wav
ocean.wav
thunder_rain.wav
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)