Skip to content

Commit 7a56049

Browse files
Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint (vllm-project#1255)
1 parent 761eff9 commit 7a56049

File tree

16 files changed

+1860
-63
lines changed

16 files changed

+1860
-63
lines changed

docs/serving/audio_generate_api.md

Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
# Audio Generate API
2+
3+
vLLM-Omni provides an API for text-to-audio generation using diffusion-based models such as Stable Audio.
4+
5+
Unlike the [Speech API](speech_api.md) which targets text-to-speech synthesis, the Audio Generate API is designed for general-purpose audio generation from text descriptions (sound effects, music, ambient soundscapes, etc.).
6+
7+
Each server instance runs a single model (specified at startup via `vllm-omni serve <model> --omni`).
8+
9+
## Quick Start
10+
11+
### Start the Server
12+
13+
```bash
14+
vllm-omni serve stabilityai/stable-audio-open-1.0 \
15+
--host 0.0.0.0 \
16+
--port 8000 \
17+
--gpu-memory-utilization 0.9 \
18+
--trust-remote-code \
19+
--enforce-eager \
20+
--omni
21+
```
22+
23+
### Generate Audio
24+
25+
**Using curl:**
26+
27+
```bash
28+
curl -X POST http://localhost:8000/v1/audio/generate \
29+
-H "Content-Type: application/json" \
30+
-d '{
31+
"input": "The sound of a cat purring",
32+
"audio_length": 10.0
33+
}' --output cat.wav
34+
```
35+
36+
**Using Python:**
37+
38+
```python
39+
import httpx
40+
41+
response = httpx.post(
42+
"http://localhost:8000/v1/audio/generate",
43+
json={
44+
"input": "The sound of a cat purring",
45+
"audio_length": 10.0,
46+
},
47+
timeout=300.0,
48+
)
49+
50+
with open("cat.wav", "wb") as f:
51+
f.write(response.content)
52+
```
53+
54+
## API Reference
55+
56+
### Endpoint
57+
58+
```
59+
POST /v1/audio/generate
60+
Content-Type: application/json
61+
```
62+
63+
### Request Parameters
64+
65+
| Parameter | Type | Default | Description |
66+
|-----------|------|---------|-------------|
67+
| `input` | string | **required** | Text prompt describing the audio to generate |
68+
| `model` | string | server's model | Model to use (optional, should match server if specified) |
69+
| `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
70+
| `speed` | float | 1.0 | Playback speed (0.25 - 4.0) |
71+
72+
#### Diffusion Parameters
73+
74+
| Parameter | Type | Default | Description |
75+
|-----------|------|---------|-------------|
76+
| `audio_length` | float | null | Audio duration in seconds (default value is the max ~47s for `stable-audio-open-1.0`) |
77+
| `audio_start` | float | 0.0 | Audio start time in seconds |
78+
| `negative_prompt` | string | null | Text describing what to avoid in generation |
79+
| `guidance_scale` | float | model default | Classifier-free guidance scale (higher = more adherence to prompt) |
80+
| `num_inference_steps` | int | model default | Number of denoising steps (higher = better quality, slower) |
81+
| `seed` | int | null | Random seed for reproducible generation |
82+
83+
### Response Format
84+
85+
Returns binary audio data with the appropriate `Content-Type` header:
86+
87+
| `response_format` | Content-Type |
88+
|--------------------|--------------|
89+
| `wav` | `audio/wav` |
90+
| `mp3` | `audio/mpeg` |
91+
| `flac` | `audio/flac` |
92+
| `pcm` | `audio/pcm` |
93+
| `aac` | `audio/aac` |
94+
| `opus` | `audio/opus` |
95+
96+
## Examples
97+
98+
### Basic Generation
99+
100+
Generate audio with only a text prompt (model defaults for all other parameters):
101+
102+
```bash
103+
curl -X POST http://localhost:8000/v1/audio/generate \
104+
-H "Content-Type: application/json" \
105+
-d '{
106+
"input": "The sound of ocean waves crashing on a beach"
107+
}' --output ocean.wav
108+
```
109+
110+
### Custom Duration
111+
112+
Specify an explicit audio length in seconds:
113+
114+
```bash
115+
curl -X POST http://localhost:8000/v1/audio/generate \
116+
-H "Content-Type: application/json" \
117+
-d '{
118+
"input": "A dog barking",
119+
"audio_length": 5.0
120+
}' --output dog_5s.wav
121+
```
122+
123+
### High Quality with Negative Prompt
124+
125+
Use a negative prompt to steer generation away from undesired characteristics, and increase inference steps for higher quality:
126+
127+
```bash
128+
curl -X POST http://localhost:8000/v1/audio/generate \
129+
-H "Content-Type: application/json" \
130+
-d '{
131+
"input": "A piano playing a gentle melody",
132+
"audio_length": 10.0,
133+
"negative_prompt": "Low quality, distorted, noisy",
134+
"guidance_scale": 8.0,
135+
"num_inference_steps": 150
136+
}' --output piano_hq.wav
137+
```
138+
139+
### Reproducible Generation
140+
141+
Set a `seed` to get deterministic results across runs:
142+
143+
```bash
144+
curl -X POST http://localhost:8000/v1/audio/generate \
145+
-H "Content-Type: application/json" \
146+
-d '{
147+
"input": "Thunder and rain sounds",
148+
"audio_length": 15.0,
149+
"seed": 42
150+
}' --output thunder.wav
151+
```
152+
153+
### Full Control
154+
155+
Combine all parameters for precise control over generation:
156+
157+
```bash
158+
curl -X POST http://localhost:8000/v1/audio/generate \
159+
-H "Content-Type: application/json" \
160+
-d '{
161+
"input": "Thunder and rain sounds",
162+
"audio_length": 15.0,
163+
"negative_prompt": "Low quality",
164+
"guidance_scale": 7.0,
165+
"num_inference_steps": 100,
166+
"seed": 42
167+
}' --output thunder_rain.wav
168+
```
169+
170+
### Quick Generation (Fewer Steps)
171+
172+
For faster generation with slightly lower quality:
173+
174+
```bash
175+
curl -X POST http://localhost:8000/v1/audio/generate \
176+
-H "Content-Type: application/json" \
177+
-d '{
178+
"input": "Birds chirping in a forest",
179+
"audio_length": 8.0,
180+
"num_inference_steps": 50
181+
}' --output birds_quick.wav
182+
```
183+
184+
### Python Client
185+
186+
```python
187+
import httpx
188+
189+
response = httpx.post(
190+
"http://localhost:8000/v1/audio/generate",
191+
json={
192+
"input": "Thunder and rain",
193+
"audio_length": 15.0,
194+
"negative_prompt": "Low quality",
195+
"guidance_scale": 7.0,
196+
"num_inference_steps": 100,
197+
"seed": 42,
198+
"response_format": "wav",
199+
},
200+
timeout=300.0,
201+
)
202+
203+
with open("thunder.wav", "wb") as f:
204+
f.write(response.content)
205+
```
206+
207+
## Parameter Tuning Guide
208+
209+
### `guidance_scale`
210+
211+
Controls how closely the generated audio follows the text prompt.
212+
213+
| Range | Behaviour |
214+
|-------|-----------|
215+
| 3 - 5 | More creative / varied output |
216+
| 7 (default) | Balanced adherence |
217+
| 10+ | Strict adherence to the prompt |
218+
219+
### `num_inference_steps`
220+
221+
Controls the number of denoising steps in the diffusion process.
222+
223+
| Steps | Quality | Speed | Use Case |
224+
|-------|---------|-------|----------|
225+
| 50 | Good | Fast | Quick previews |
226+
| 100 | Very Good | Medium | General purpose |
227+
| 150+ | Excellent | Slow | Final / critical audio |
228+
229+
### `audio_length`
230+
231+
Duration of the generated audio clip. For `stable-audio-open-1.0`, the maximum is approximately 47 seconds. If omitted, the model uses its own default length.
232+
233+
### `negative_prompt`
234+
235+
Describes characteristics to avoid. Common negative prompts include:
236+
237+
- `"Low quality, distorted, noisy"`
238+
- `"Silence, static"`
239+
- `"Music"` (when generating sound effects only)
240+
241+
## Supported Models
242+
243+
| Model | Description |
244+
|-------|-------------|
245+
| `stabilityai/stable-audio-open-1.0` | Open-source audio generation model, up to ~47 seconds, 44.1 kHz stereo |
246+
247+
## Error Responses
248+
249+
### 400 Bad Request
250+
251+
Invalid or missing parameters:
252+
253+
```json
254+
{
255+
"error": {
256+
"message": "Audio generation model did not produce audio output.",
257+
"type": "BadRequestError",
258+
"param": null,
259+
"code": 400
260+
}
261+
}
262+
```
263+
264+
### 404 Not Found
265+
266+
Model mismatch:
267+
268+
```json
269+
{
270+
"error": {
271+
"message": "The model `xxx` does not exist.",
272+
"type": "NotFoundError",
273+
"param": "model",
274+
"code": 404
275+
}
276+
}
277+
```
278+
279+
### 422 Unprocessable Entity
280+
281+
Pydantic validation failure (e.g. invalid `response_format`, `speed` out of range):
282+
283+
```json
284+
{
285+
"detail": [
286+
{
287+
"type": "literal_error",
288+
"msg": "Input should be 'wav', 'pcm', 'flac', 'mp3', 'aac' or 'opus'",
289+
...
290+
}
291+
]
292+
}
293+
```
294+
295+
## Troubleshooting
296+
297+
### "Audio generation model did not produce audio output"
298+
299+
The model finished but returned no audio data. Verify the server started successfully and the model loaded without errors.
300+
301+
### Server Not Responding
302+
303+
```bash
304+
# Check if the server is healthy
305+
curl http://localhost:8000/health
306+
```
307+
308+
### Audio Quality Issues
309+
310+
- Increase `num_inference_steps` (e.g. 150).
311+
- Add a negative prompt: `"Low quality, distorted, noisy"`.
312+
- Increase `guidance_scale` for stronger prompt adherence.
313+
314+
### Generation Timeout
315+
316+
- Reduce `num_inference_steps`.
317+
- Reduce `audio_length`.
318+
- Check GPU memory with `nvidia-smi`.
319+
320+
### Out of Memory
321+
322+
- Lower `--gpu-memory-utilization` (e.g. 0.8).
323+
- Reduce `audio_length`.
324+
325+
## Development
326+
327+
Enable debug logging:
328+
329+
```bash
330+
vllm-omni serve stabilityai/stable-audio-open-1.0 \
331+
--host 0.0.0.0 \
332+
--port 8000 \
333+
--gpu-memory-utilization 0.9 \
334+
--trust-remote-code \
335+
--enforce-eager \
336+
--omni \
337+
--uvicorn-log-level debug
338+
```

0 commit comments

Comments
 (0)