diff --git a/docs/.nav.yml b/docs/.nav.yml index 07db1b4651..73edba1f40 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -25,11 +25,13 @@ nav: - Online Serving: - BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md - Image-To-Image: user_guide/examples/online_serving/image_to_image.md + - Image-To-Video: user_guide/examples/online_serving/image_to_video.md - LoRA Inference(Diffusion): user_guide/examples/online_serving/lora_inference.md - Qwen2.5-Omni: user_guide/examples/online_serving/qwen2_5_omni.md - Qwen3-Omni: user_guide/examples/online_serving/qwen3_omni.md - Qwen3-TTS: user_guide/examples/online_serving/qwen3_tts.md - Text-To-Image: user_guide/examples/online_serving/text_to_image.md + - Text-To-Video: user_guide/examples/online_serving/text_to_video.md - General: - usage/* - Configuration: @@ -48,6 +50,7 @@ nav: - FP8: user_guide/diffusion/quantization/fp8.md - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md + - Custom Pipeline: features/custom_pipeline.md - ComfyUI: features/comfyui.md - Developer Guide: - General: diff --git a/docs/api/README.md b/docs/api/README.md index b4ec398f24..85beca0260 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -92,8 +92,10 @@ Configuration classes. Worker classes and model runners for distributed inference. - [vllm_omni.diffusion.worker.diffusion_model_runner.DiffusionModelRunner][] +- [vllm_omni.diffusion.worker.diffusion_worker.CustomPipelineWorkerExtension][] - [vllm_omni.diffusion.worker.diffusion_worker.DiffusionWorker][] - [vllm_omni.diffusion.worker.diffusion_worker.WorkerProc][] +- [vllm_omni.diffusion.worker.diffusion_worker.WorkerWrapperBase][] - [vllm_omni.platforms.npu.worker.npu_ar_model_runner.ExecuteModelState][] - [vllm_omni.platforms.npu.worker.npu_ar_model_runner.NPUARModelRunner][] - [vllm_omni.platforms.npu.worker.npu_ar_worker.NPUARWorker][] diff --git a/docs/user_guide/examples/offline_inference/bagel.md b/docs/user_guide/examples/offline_inference/bagel.md index 8bab929779..84e2e703c8 100644 --- a/docs/user_guide/examples/offline_inference/bagel.md +++ b/docs/user_guide/examples/offline_inference/bagel.md @@ -154,6 +154,24 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can ------ +#### Tensor Parallelism (TP) + +For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) by modifying the stage configuration (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)). + +1. **Set `tensor_parallel_size`**: Increase this value (e.g., to `2` or `4`). +2. **Set `devices`**: Specify the comma-separated GPU IDs to be used for the stage (e.g., `"0,1"`). + +Example configuration for TP=2 on GPUs 0 and 1: +```yaml + engine_args: + tensor_parallel_size: 2 + ... + runtime: + devices: "0,1" +``` + +------ + #### 🔗 Runtime Configuration | Parameter | Value | Description | diff --git a/docs/user_guide/examples/online_serving/bagel.md b/docs/user_guide/examples/online_serving/bagel.md index f29c095f4f..cc1973e6cf 100644 --- a/docs/user_guide/examples/online_serving/bagel.md +++ b/docs/user_guide/examples/online_serving/bagel.md @@ -25,12 +25,29 @@ cd /workspace/vllm-omni/examples/online_serving/bagel bash run_server.sh ``` -If you have a custom stage configs file, launch the server with the command below: - ```bash vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file ``` +#### 🚀 Tensor Parallelism (TP) + +For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) for the server. + +1. **Modify Stage Config**: Create or modify a stage configuration yaml (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)). Set `tensor_parallel_size` to `2` (or more) and update `devices` to include multiple GPU IDs (e.g., `"0,1"`). + +```yaml + engine_args: + tensor_parallel_size: 2 + ... + runtime: + devices: "0,1" +``` + +2. **Launch Server**: +```bash +vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/your/custom_bagel.yaml +``` + ### Send Multi-modal Request Get into the bagel folder: diff --git a/docs/user_guide/examples/online_serving/image_to_video.md b/docs/user_guide/examples/online_serving/image_to_video.md new file mode 100644 index 0000000000..dcbb100671 --- /dev/null +++ b/docs/user_guide/examples/online_serving/image_to_video.md @@ -0,0 +1,97 @@ +# Image-To-Video + +Source . + + +This example demonstrates how to deploy the Wan2.2 image-to-video model for online video generation using vLLM-Omni. + +## Start Server + +### Basic Start + +```bash +vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --port 8091 +``` + +### Start with Parameters + +Or use the startup script: + +```bash +bash run_server.sh +``` + +The script allows overriding: +- `MODEL` (default: `Wan-AI/Wan2.2-I2V-A14B-Diffusers`) +- `PORT` (default: `8091`) +- `BOUNDARY_RATIO` (default: `0.875`) +- `FLOW_SHIFT` (default: `12.0`) +- `CACHE_BACKEND` (default: `none`) +- `ENABLE_CACHE_DIT_SUMMARY` (default: `0`) + +## API Calls + +### Method 1: Using curl + +```bash +# Basic image-to-video generation +bash run_curl_image_to_video.sh + +# Or execute directly (OpenAI-style multipart) +curl -X POST http://localhost:8091/v1/videos \ + -H "Accept: application/json" \ + -F "prompt=A bear playing with yarn, smooth motion" \ + -F "negative_prompt=low quality, blurry, static" \ + -F "input_reference=@/path/to/qwen-bear.png" \ + -F "width=832" \ + -F "height=480" \ + -F "num_frames=33" \ + -F "fps=16" \ + -F "num_inference_steps=40" \ + -F "guidance_scale=1.0" \ + -F "guidance_scale_2=1.0" \ + -F "boundary_ratio=0.875" \ + -F "flow_shift=12.0" \ + -F "seed=42" | jq -r '.data[0].b64_json' | base64 -d > wan22_i2v_output.mp4 +``` + +## Request Format + +### Required Fields + +```bash +curl -X POST http://localhost:8091/v1/videos \ + -F "prompt=A bear playing with yarn, smooth motion" \ + -F "negative_prompt=low quality, blurry, static" \ + -F "input_reference=@/path/to/qwen-bear.png" +``` + +### Generation with Parameters + +```bash +curl -X POST http://localhost:8091/v1/videos \ + -F "prompt=A bear playing with yarn, smooth motion" \ + -F "negative_prompt=low quality, blurry, static" \ + -F "input_reference=@/path/to/qwen-bear.png" \ + -F "width=832" \ + -F "height=480" \ + -F "num_frames=33" \ + -F "fps=16" \ + -F "num_inference_steps=40" \ + -F "guidance_scale=1.0" \ + -F "guidance_scale_2=1.0" \ + -F "boundary_ratio=0.875" \ + -F "flow_shift=12.0" \ + -F "seed=42" +``` + +## Example materials + +??? abstract "run_curl_image_to_video.sh" + ``````sh + --8<-- "examples/online_serving/image_to_video/run_curl_image_to_video.sh" + `````` +??? abstract "run_server.sh" + ``````sh + --8<-- "examples/online_serving/image_to_video/run_server.sh" + `````` diff --git a/docs/user_guide/examples/online_serving/qwen2_5_omni.md b/docs/user_guide/examples/online_serving/qwen2_5_omni.md index 361044ed8f..976d39a966 100644 --- a/docs/user_guide/examples/online_serving/qwen2_5_omni.md +++ b/docs/user_guide/examples/online_serving/qwen2_5_omni.md @@ -30,7 +30,7 @@ cd examples/online_serving/qwen2_5_omni #### Send request via python ```bash -python openai_chat_completion_client_for_multimodal_generation.py --query-type mixed_modalities +python openai_chat_completion_client_for_multimodal_generation.py --query-type mixed_modalities --port 8091 --host "localhost" ``` The Python client supports the following command-line arguments: diff --git a/docs/user_guide/examples/online_serving/qwen3_omni.md b/docs/user_guide/examples/online_serving/qwen3_omni.md index d262465339..c903215f2c 100644 --- a/docs/user_guide/examples/online_serving/qwen3_omni.md +++ b/docs/user_guide/examples/online_serving/qwen3_omni.md @@ -36,7 +36,7 @@ cd examples/online_serving/qwen3_omni #### Send request via python ```bash -python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image +python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image --port 8091 --host "localhost" ``` The Python client supports the following command-line arguments: diff --git a/docs/user_guide/examples/online_serving/qwen3_tts.md b/docs/user_guide/examples/online_serving/qwen3_tts.md index f899e362ee..ca4ad6e36f 100644 --- a/docs/user_guide/examples/online_serving/qwen3_tts.md +++ b/docs/user_guide/examples/online_serving/qwen3_tts.md @@ -3,9 +3,7 @@ Source . -## 🛠️ Installation - -Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md) +This directory contains examples for running Qwen3-TTS models with vLLM-Omni's online serving API. ## Supported Models @@ -17,62 +15,29 @@ Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/ | `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant | | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning | -## Run examples (Qwen3-TTS) +## Quick Start -### Launch the Server +### 1. Start the Server ```bash -# CustomVoice model (predefined speakers) -vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ - --omni \ - --port 8091 \ - --trust-remote-code \ - --enforce-eager +# CustomVoice model (default) +./run_server.sh -# VoiceDesign model -vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ - --omni \ - --port 8091 \ - --trust-remote-code \ - --enforce-eager - -# Base model (voice cloning) -vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ - --omni \ - --port 8091 \ - --trust-remote-code \ - --enforce-eager +# Or specify task type +./run_server.sh CustomVoice +./run_server.sh VoiceDesign +./run_server.sh Base ``` -If you have custom stage configs file, launch the server with command below -```bash -vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ - --stage-configs-path /path/to/stage_configs_file \ - --omni \ - --port 8091 \ - --trust-remote-code \ - --enforce-eager -``` +Or launch directly with vllm serve: -Alternatively, use the convenience script: ```bash -./run_server.sh # Default: CustomVoice model -./run_server.sh CustomVoice # CustomVoice model -./run_server.sh VoiceDesign # VoiceDesign model -./run_server.sh Base # Base (voice clone) model -``` - -### Send TTS Request - -Get into the example folder -```bash -cd examples/online_serving/qwen3_tts +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --omni --port 8091 --trust-remote-code --enforce-eager ``` -#### Send request via python +### 2. Run the Client ```bash # CustomVoice: Use predefined speaker @@ -103,21 +68,7 @@ python openai_speech_client.py \ --ref-text "Original transcript of the reference audio" ``` -The Python client supports the following command-line arguments: - -- `--api-base`: API base URL (default: `http://localhost:8091`) -- `--model` (or `-m`): Model name/path (default: `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`) -- `--task-type` (or `-t`): TTS task type. Options: `CustomVoice`, `VoiceDesign`, `Base` -- `--text`: Text to synthesize (required) -- `--voice`: Speaker/voice name (default: `vivian`). Options: `vivian`, `ryan`, `aiden`, etc. -- `--language`: Language. Options: `Auto`, `Chinese`, `English`, `Japanese`, `Korean`, `German`, `French`, `Russian`, `Portuguese`, `Spanish`, `Italian` -- `--instructions`: Voice style/emotion instructions -- `--ref-audio`: Reference audio file path or URL for voice cloning (Base task) -- `--ref-text`: Reference audio transcript for voice cloning (Base task) -- `--response-format`: Audio output format (default: `wav`). Options: `wav`, `mp3`, `flac`, `pcm`, `aac`, `opus` -- `--output` (or `-o`): Output audio file path (default: `tts_output.wav`) - -#### Send request via curl +### 3. Using curl ```bash # Simple TTS request @@ -142,56 +93,12 @@ curl -X POST http://localhost:8091/v1/audio/speech \ curl http://localhost:8091/v1/audio/voices ``` -### Using OpenAI SDK - -```python -from openai import OpenAI - -client = OpenAI(base_url="http://localhost:8091/v1", api_key="none") - -response = client.audio.speech.create( - model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", - voice="vivian", - input="Hello, how are you?", -) - -response.stream_to_file("output.wav") -``` - -### Using Python httpx - -```python -import httpx - -response = httpx.post( - "http://localhost:8091/v1/audio/speech", - json={ - "input": "Hello, how are you?", - "voice": "vivian", - "language": "English", - }, - timeout=300.0, -) - -with open("output.wav", "wb") as f: - f.write(response.content) -``` - -### FAQ - -If you encounter error about backend of librosa, try to install ffmpeg with command below. -``` -sudo apt update -sudo apt install ffmpeg -``` - ## API Reference ### Endpoint ``` POST /v1/audio/speech -Content-Type: application/json ``` This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech) format with additional Qwen3-TTS parameters. @@ -231,7 +138,7 @@ Lists available voices for the loaded model: ### Response -Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/wav`). +Returns audio data in the requested format (default: WAV). ## Parameters @@ -258,11 +165,48 @@ Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/w ### Voice Clone Parameters (Base task) -| Parameter | Type | Required | Description | +| Parameter | Type | Default | Description | |-----------|------|----------|-------------| -| `ref_audio` | string | **Yes** | Reference audio (URL or base64 data URL) | -| `ref_text` | string | No | Transcript of reference audio (for ICL mode) | -| `x_vector_only_mode` | bool | No | Use speaker embedding only (no ICL) | +| `ref_audio` | string | null | Reference audio (URL or base64 data URL) | +| `ref_text` | string | null | Transcript of reference audio | +| `x_vector_only_mode` | bool | null | Use speaker embedding only (no ICL) | + +## Python Usage + +### Using OpenAI SDK + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8091/v1", api_key="none") + +response = client.audio.speech.create( + model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", + voice="vivian", + input="Hello, how are you?", +) + +response.stream_to_file("output.wav") +``` + +### Using httpx + +```python +import httpx + +response = httpx.post( + "http://localhost:8091/v1/audio/speech", + json={ + "input": "Hello, how are you?", + "voice": "vivian", + "language": "English", + }, + timeout=300.0, +) + +with open("output.wav", "wb") as f: + f.write(response.content) +``` ## Limitations @@ -271,7 +215,7 @@ Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/w ## Troubleshooting -1. **TTS model did not produce audio output**: Ensure you're using the correct model variant for your task type (CustomVoice task → CustomVoice model, etc.) +1. **"TTS model did not produce audio output"**: Ensure you're using the correct model variant for your task type (CustomVoice task → CustomVoice model, etc.) 2. **Connection refused**: Make sure the server is running on the correct port 3. **Out of memory**: Use smaller model variant (`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`) or reduce `--gpu-memory-utilization` 4. **Unsupported speaker**: Use `/v1/audio/voices` to list available voices for the loaded model diff --git a/docs/user_guide/examples/online_serving/text_to_video.md b/docs/user_guide/examples/online_serving/text_to_video.md new file mode 100644 index 0000000000..a632c49a73 --- /dev/null +++ b/docs/user_guide/examples/online_serving/text_to_video.md @@ -0,0 +1,130 @@ +# Text-To-Video + +Source . + + +This example demonstrates how to deploy the Wan2.2 text-to-video model for online video generation using vLLM-Omni. + +## Start Server + +### Basic Start + +```bash +vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091 +``` + +### Start with Parameters + +Or use the startup script: + +```bash +bash run_server.sh +``` + +The script allows overriding: +- `MODEL` (default: `Wan-AI/Wan2.2-T2V-A14B-Diffusers`) +- `PORT` (default: `8091`) +- `BOUNDARY_RATIO` (default: `0.875`) +- `FLOW_SHIFT` (default: `5.0`) +- `CACHE_BACKEND` (default: `none`) +- `ENABLE_CACHE_DIT_SUMMARY` (default: `0`) + +## API Calls + +### Method 1: Using curl + +```bash +# Basic text-to-video generation +bash run_curl_text_to_video.sh + +# Or execute directly (OpenAI-style multipart) +curl -s http://localhost:8091/v1/videos \ + -H "Accept: application/json" \ + -F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \ + -F "width=832" \ + -F "height=480" \ + -F "num_frames=33" \ + -F "negative_prompt=色调艳丽 ,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \ + -F "fps=16" \ + -F "num_inference_steps=40" \ + -F "guidance_scale=4.0" \ + -F "guidance_scale_2=4.0" \ + -F "boundary_ratio=0.875" \ + -F "seed=42" | jq -r '.data[0].b64_json' | base64 -d > wan22_output.mp4 +``` + +## Request Format + +### Simple Text-to-Video Generation + +```bash +curl -X POST http://localhost:8091/v1/videos \ + -F "prompt=A cinematic view of a futuristic city at sunset" +``` + +### Generation with Parameters + +```bash +curl -X POST http://localhost:8091/v1/videos \ + -F "prompt=A cinematic view of a futuristic city at sunset" \ + -F "width=832" \ + -F "height=480" \ + -F "num_frames=33" \ + -F "negative_prompt=low quality, blurry, static" \ + -F "fps=16" \ + -F "num_inference_steps=40" \ + -F "guidance_scale=4.0" \ + -F "guidance_scale_2=4.0" \ + -F "boundary_ratio=0.875" \ + -F "flow_shift=5.0" \ + -F "seed=42" +``` + +## Generation Parameters + +| Parameter | Type | Default | Description | +| --------------------- | ------ | ------- | ------------------------------------------------ | +| `prompt` | str | - | Text description of the desired video | +| `negative_prompt` | str | None | Negative prompt | +| `n` | int | 1 | Number of videos to generate | +| `width` | int | None | Video width in pixels | +| `height` | int | None | Video height in pixels | +| `num_frames` | int | None | Number of frames to generate | +| `fps` | int | None | Frames per second for output video | +| `num_inference_steps` | int | None | Number of denoising steps | +| `guidance_scale` | float | None | CFG guidance scale (low-noise stage) | +| `guidance_scale_2` | float | None | CFG guidance scale (high-noise stage, Wan2.2) | +| `boundary_ratio` | float | None | Boundary split ratio for low/high DiT (Wan2.2) | +| `flow_shift` | float | None | Scheduler flow shift (Wan2.2) | +| `seed` | int | None | Random seed (reproducible) | +| `lora` | object | None | LoRA configuration | +| `extra_body` | object | None | Model-specific extra parameters | + +## Response Format + +```json +{ + "created": 1234567890, + "data": [ + { "b64_json": "" } + ] +} +``` + +## Extract Video + +```bash +# Extract base64 from response and decode to video +cat response.json | jq -r '.data[0].b64_json' | base64 -d > wan22_output.mp4 +``` + +## Example materials + +??? abstract "run_curl_text_to_video.sh" + ``````sh + --8<-- "examples/online_serving/text_to_video/run_curl_text_to_video.sh" + `````` +??? abstract "run_server.sh" + ``````sh + --8<-- "examples/online_serving/text_to_video/run_server.sh" + ``````