|
| 1 | +# Audio Generate API |
| 2 | + |
| 3 | +vLLM-Omni provides an API for text-to-audio generation using diffusion-based models such as Stable Audio. |
| 4 | + |
| 5 | +Unlike the [Speech API](speech_api.md) which targets text-to-speech synthesis, the Audio Generate API is designed for general-purpose audio generation from text descriptions (sound effects, music, ambient soundscapes, etc.). |
| 6 | + |
| 7 | +Each server instance runs a single model (specified at startup via `vllm-omni serve <model> --omni`). |
| 8 | + |
| 9 | +## Quick Start |
| 10 | + |
| 11 | +### Start the Server |
| 12 | + |
| 13 | +```bash |
| 14 | +vllm-omni serve stabilityai/stable-audio-open-1.0 \ |
| 15 | + --host 0.0.0.0 \ |
| 16 | + --port 8000 \ |
| 17 | + --gpu-memory-utilization 0.9 \ |
| 18 | + --trust-remote-code \ |
| 19 | + --enforce-eager \ |
| 20 | + --omni |
| 21 | +``` |
| 22 | + |
| 23 | +### Generate Audio |
| 24 | + |
| 25 | +**Using curl:** |
| 26 | + |
| 27 | +```bash |
| 28 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 29 | + -H "Content-Type: application/json" \ |
| 30 | + -d '{ |
| 31 | + "input": "The sound of a cat purring", |
| 32 | + "audio_length": 10.0 |
| 33 | + }' --output cat.wav |
| 34 | +``` |
| 35 | + |
| 36 | +**Using Python:** |
| 37 | + |
| 38 | +```python |
| 39 | +import httpx |
| 40 | + |
| 41 | +response = httpx.post( |
| 42 | + "http://localhost:8000/v1/audio/generate", |
| 43 | + json={ |
| 44 | + "input": "The sound of a cat purring", |
| 45 | + "audio_length": 10.0, |
| 46 | + }, |
| 47 | + timeout=300.0, |
| 48 | +) |
| 49 | + |
| 50 | +with open("cat.wav", "wb") as f: |
| 51 | + f.write(response.content) |
| 52 | +``` |
| 53 | + |
| 54 | +## API Reference |
| 55 | + |
| 56 | +### Endpoint |
| 57 | + |
| 58 | +``` |
| 59 | +POST /v1/audio/generate |
| 60 | +Content-Type: application/json |
| 61 | +``` |
| 62 | + |
| 63 | +### Request Parameters |
| 64 | + |
| 65 | +| Parameter | Type | Default | Description | |
| 66 | +|-----------|------|---------|-------------| |
| 67 | +| `input` | string | **required** | Text prompt describing the audio to generate | |
| 68 | +| `model` | string | server's model | Model to use (optional, should match server if specified) | |
| 69 | +| `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus | |
| 70 | +| `speed` | float | 1.0 | Playback speed (0.25 - 4.0) | |
| 71 | + |
| 72 | +#### Diffusion Parameters |
| 73 | + |
| 74 | +| Parameter | Type | Default | Description | |
| 75 | +|-----------|------|---------|-------------| |
| 76 | +| `audio_length` | float | null | Audio duration in seconds (default value is the max ~47s for `stable-audio-open-1.0`) | |
| 77 | +| `audio_start` | float | 0.0 | Audio start time in seconds | |
| 78 | +| `negative_prompt` | string | null | Text describing what to avoid in generation | |
| 79 | +| `guidance_scale` | float | model default | Classifier-free guidance scale (higher = more adherence to prompt) | |
| 80 | +| `num_inference_steps` | int | model default | Number of denoising steps (higher = better quality, slower) | |
| 81 | +| `seed` | int | null | Random seed for reproducible generation | |
| 82 | + |
| 83 | +### Response Format |
| 84 | + |
| 85 | +Returns binary audio data with the appropriate `Content-Type` header: |
| 86 | + |
| 87 | +| `response_format` | Content-Type | |
| 88 | +|--------------------|--------------| |
| 89 | +| `wav` | `audio/wav` | |
| 90 | +| `mp3` | `audio/mpeg` | |
| 91 | +| `flac` | `audio/flac` | |
| 92 | +| `pcm` | `audio/pcm` | |
| 93 | +| `aac` | `audio/aac` | |
| 94 | +| `opus` | `audio/opus` | |
| 95 | + |
| 96 | +## Examples |
| 97 | + |
| 98 | +### Basic Generation |
| 99 | + |
| 100 | +Generate audio with only a text prompt (model defaults for all other parameters): |
| 101 | + |
| 102 | +```bash |
| 103 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 104 | + -H "Content-Type: application/json" \ |
| 105 | + -d '{ |
| 106 | + "input": "The sound of ocean waves crashing on a beach" |
| 107 | + }' --output ocean.wav |
| 108 | +``` |
| 109 | + |
| 110 | +### Custom Duration |
| 111 | + |
| 112 | +Specify an explicit audio length in seconds: |
| 113 | + |
| 114 | +```bash |
| 115 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 116 | + -H "Content-Type: application/json" \ |
| 117 | + -d '{ |
| 118 | + "input": "A dog barking", |
| 119 | + "audio_length": 5.0 |
| 120 | + }' --output dog_5s.wav |
| 121 | +``` |
| 122 | + |
| 123 | +### High Quality with Negative Prompt |
| 124 | + |
| 125 | +Use a negative prompt to steer generation away from undesired characteristics, and increase inference steps for higher quality: |
| 126 | + |
| 127 | +```bash |
| 128 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 129 | + -H "Content-Type: application/json" \ |
| 130 | + -d '{ |
| 131 | + "input": "A piano playing a gentle melody", |
| 132 | + "audio_length": 10.0, |
| 133 | + "negative_prompt": "Low quality, distorted, noisy", |
| 134 | + "guidance_scale": 8.0, |
| 135 | + "num_inference_steps": 150 |
| 136 | + }' --output piano_hq.wav |
| 137 | +``` |
| 138 | + |
| 139 | +### Reproducible Generation |
| 140 | + |
| 141 | +Set a `seed` to get deterministic results across runs: |
| 142 | + |
| 143 | +```bash |
| 144 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 145 | + -H "Content-Type: application/json" \ |
| 146 | + -d '{ |
| 147 | + "input": "Thunder and rain sounds", |
| 148 | + "audio_length": 15.0, |
| 149 | + "seed": 42 |
| 150 | + }' --output thunder.wav |
| 151 | +``` |
| 152 | + |
| 153 | +### Full Control |
| 154 | + |
| 155 | +Combine all parameters for precise control over generation: |
| 156 | + |
| 157 | +```bash |
| 158 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 159 | + -H "Content-Type: application/json" \ |
| 160 | + -d '{ |
| 161 | + "input": "Thunder and rain sounds", |
| 162 | + "audio_length": 15.0, |
| 163 | + "negative_prompt": "Low quality", |
| 164 | + "guidance_scale": 7.0, |
| 165 | + "num_inference_steps": 100, |
| 166 | + "seed": 42 |
| 167 | + }' --output thunder_rain.wav |
| 168 | +``` |
| 169 | + |
| 170 | +### Quick Generation (Fewer Steps) |
| 171 | + |
| 172 | +For faster generation with slightly lower quality: |
| 173 | + |
| 174 | +```bash |
| 175 | +curl -X POST http://localhost:8000/v1/audio/generate \ |
| 176 | + -H "Content-Type: application/json" \ |
| 177 | + -d '{ |
| 178 | + "input": "Birds chirping in a forest", |
| 179 | + "audio_length": 8.0, |
| 180 | + "num_inference_steps": 50 |
| 181 | + }' --output birds_quick.wav |
| 182 | +``` |
| 183 | + |
| 184 | +### Python Client |
| 185 | + |
| 186 | +```python |
| 187 | +import httpx |
| 188 | + |
| 189 | +response = httpx.post( |
| 190 | + "http://localhost:8000/v1/audio/generate", |
| 191 | + json={ |
| 192 | + "input": "Thunder and rain", |
| 193 | + "audio_length": 15.0, |
| 194 | + "negative_prompt": "Low quality", |
| 195 | + "guidance_scale": 7.0, |
| 196 | + "num_inference_steps": 100, |
| 197 | + "seed": 42, |
| 198 | + "response_format": "wav", |
| 199 | + }, |
| 200 | + timeout=300.0, |
| 201 | +) |
| 202 | + |
| 203 | +with open("thunder.wav", "wb") as f: |
| 204 | + f.write(response.content) |
| 205 | +``` |
| 206 | + |
| 207 | +## Parameter Tuning Guide |
| 208 | + |
| 209 | +### `guidance_scale` |
| 210 | + |
| 211 | +Controls how closely the generated audio follows the text prompt. |
| 212 | + |
| 213 | +| Range | Behaviour | |
| 214 | +|-------|-----------| |
| 215 | +| 3 - 5 | More creative / varied output | |
| 216 | +| 7 (default) | Balanced adherence | |
| 217 | +| 10+ | Strict adherence to the prompt | |
| 218 | + |
| 219 | +### `num_inference_steps` |
| 220 | + |
| 221 | +Controls the number of denoising steps in the diffusion process. |
| 222 | + |
| 223 | +| Steps | Quality | Speed | Use Case | |
| 224 | +|-------|---------|-------|----------| |
| 225 | +| 50 | Good | Fast | Quick previews | |
| 226 | +| 100 | Very Good | Medium | General purpose | |
| 227 | +| 150+ | Excellent | Slow | Final / critical audio | |
| 228 | + |
| 229 | +### `audio_length` |
| 230 | + |
| 231 | +Duration of the generated audio clip. For `stable-audio-open-1.0`, the maximum is approximately 47 seconds. If omitted, the model uses its own default length. |
| 232 | + |
| 233 | +### `negative_prompt` |
| 234 | + |
| 235 | +Describes characteristics to avoid. Common negative prompts include: |
| 236 | + |
| 237 | +- `"Low quality, distorted, noisy"` |
| 238 | +- `"Silence, static"` |
| 239 | +- `"Music"` (when generating sound effects only) |
| 240 | + |
| 241 | +## Supported Models |
| 242 | + |
| 243 | +| Model | Description | |
| 244 | +|-------|-------------| |
| 245 | +| `stabilityai/stable-audio-open-1.0` | Open-source audio generation model, up to ~47 seconds, 44.1 kHz stereo | |
| 246 | + |
| 247 | +## Error Responses |
| 248 | + |
| 249 | +### 400 Bad Request |
| 250 | + |
| 251 | +Invalid or missing parameters: |
| 252 | + |
| 253 | +```json |
| 254 | +{ |
| 255 | + "error": { |
| 256 | + "message": "Audio generation model did not produce audio output.", |
| 257 | + "type": "BadRequestError", |
| 258 | + "param": null, |
| 259 | + "code": 400 |
| 260 | + } |
| 261 | +} |
| 262 | +``` |
| 263 | + |
| 264 | +### 404 Not Found |
| 265 | + |
| 266 | +Model mismatch: |
| 267 | + |
| 268 | +```json |
| 269 | +{ |
| 270 | + "error": { |
| 271 | + "message": "The model `xxx` does not exist.", |
| 272 | + "type": "NotFoundError", |
| 273 | + "param": "model", |
| 274 | + "code": 404 |
| 275 | + } |
| 276 | +} |
| 277 | +``` |
| 278 | + |
| 279 | +### 422 Unprocessable Entity |
| 280 | + |
| 281 | +Pydantic validation failure (e.g. invalid `response_format`, `speed` out of range): |
| 282 | + |
| 283 | +```json |
| 284 | +{ |
| 285 | + "detail": [ |
| 286 | + { |
| 287 | + "type": "literal_error", |
| 288 | + "msg": "Input should be 'wav', 'pcm', 'flac', 'mp3', 'aac' or 'opus'", |
| 289 | + ... |
| 290 | + } |
| 291 | + ] |
| 292 | +} |
| 293 | +``` |
| 294 | + |
| 295 | +## Troubleshooting |
| 296 | + |
| 297 | +### "Audio generation model did not produce audio output" |
| 298 | + |
| 299 | +The model finished but returned no audio data. Verify the server started successfully and the model loaded without errors. |
| 300 | + |
| 301 | +### Server Not Responding |
| 302 | + |
| 303 | +```bash |
| 304 | +# Check if the server is healthy |
| 305 | +curl http://localhost:8000/health |
| 306 | +``` |
| 307 | + |
| 308 | +### Audio Quality Issues |
| 309 | + |
| 310 | +- Increase `num_inference_steps` (e.g. 150). |
| 311 | +- Add a negative prompt: `"Low quality, distorted, noisy"`. |
| 312 | +- Increase `guidance_scale` for stronger prompt adherence. |
| 313 | + |
| 314 | +### Generation Timeout |
| 315 | + |
| 316 | +- Reduce `num_inference_steps`. |
| 317 | +- Reduce `audio_length`. |
| 318 | +- Check GPU memory with `nvidia-smi`. |
| 319 | + |
| 320 | +### Out of Memory |
| 321 | + |
| 322 | +- Lower `--gpu-memory-utilization` (e.g. 0.8). |
| 323 | +- Reduce `audio_length`. |
| 324 | + |
| 325 | +## Development |
| 326 | + |
| 327 | +Enable debug logging: |
| 328 | + |
| 329 | +```bash |
| 330 | +vllm-omni serve stabilityai/stable-audio-open-1.0 \ |
| 331 | + --host 0.0.0.0 \ |
| 332 | + --port 8000 \ |
| 333 | + --gpu-memory-utilization 0.9 \ |
| 334 | + --trust-remote-code \ |
| 335 | + --enforce-eager \ |
| 336 | + --omni \ |
| 337 | + --uvicorn-log-level debug |
| 338 | +``` |
0 commit comments