Skip to content

Commit 64f5ef6

Browse files
congw729linyueqian
andauthored
[Doc] [skip ci]Sync. (#1363)
Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
1 parent bc13f7c commit 64f5ef6

File tree

15 files changed

+1119
-138
lines changed

15 files changed

+1119
-138
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTT
1212
<summary> Essential Elements of an Effective PR Description Checklist </summary>
1313

1414
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
15-
- [ ] The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
16-
- [ ] The test results. Please pasting the results comparison before and after, or e2e results.
15+
- [ ] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
16+
- [ ] The test results. Please paste the results comparison before and after, or the e2e results.
1717
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. **Please run `mkdocs serve` to sync the documentation editions to `./docs`.**
18-
- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft.
18+
- [ ] (Optional) Release notes update. If your change is user-facing, please update the release notes draft.
1919
</details>
2020

2121
**BEFORE SUBMITTING, PLEASE READ <https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md>** (anything written below this line will be removed by GitHub Actions)

docs/.nav.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ nav:
1313
- examples/README.md
1414
- Offline Inference:
1515
- BAGEL-7B-MoT: user_guide/examples/offline_inference/bagel.md
16+
- GLM-Image Multistage End-to-End Inference: user_guide/examples/offline_inference/glm_image.md
1617
- Image-To-Image: user_guide/examples/offline_inference/image_to_image.md
1718
- Image-To-Video: user_guide/examples/offline_inference/image_to_video.md
1819
- Qwen2.5-Omni: user_guide/examples/offline_inference/qwen2_5_omni.md
@@ -23,6 +24,7 @@ nav:
2324
- Text-To-Video: user_guide/examples/offline_inference/text_to_video.md
2425
- Online Serving:
2526
- BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md
27+
- GLM-Image Online Serving: user_guide/examples/online_serving/glm_image.md
2628
- Image-To-Image: user_guide/examples/online_serving/image_to_image.md
2729
- Image-To-Video: user_guide/examples/online_serving/image_to_video.md
2830
- Qwen2.5-Omni: user_guide/examples/online_serving/qwen2_5_omni.md
@@ -50,10 +52,13 @@ nav:
5052
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
5153
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
5254
- LoRA: user_guide/diffusion/lora.md
55+
- Hybrid Sharded Data Parallel: design/feature/hsdp.md
56+
- Custom Pipeline: features/custom_pipeline.md
5357
- ComfyUI: features/comfyui.md
5458
- Developer Guide:
5559
- General:
5660
- contributing/README.md
61+
- pr_reviewer.md
5762
- glob: contributing/*
5863
flatten_single_child_sections: true
5964
- Model Implementation:
@@ -73,6 +78,7 @@ nav:
7378
- design/feature/tensor_parallel.md
7479
- design/feature/cache_dit.md
7580
- design/feature/teacache.md
81+
- design/feature/async_chunk_design.md
7682
- Module Design:
7783
- design/module/ar_module.md
7884
- design/module/dit_module.md

docs/user_guide/examples/offline_inference/bagel.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,24 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can
154154

155155
------
156156

157+
#### Tensor Parallelism (TP)
158+
159+
For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) by modifying the stage configuration (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)).
160+
161+
1. **Set `tensor_parallel_size`**: Increase this value (e.g., to `2` or `4`).
162+
2. **Set `devices`**: Specify the comma-separated GPU IDs to be used for the stage (e.g., `"0,1"`).
163+
164+
Example configuration for TP=2 on GPUs 0 and 1:
165+
```yaml
166+
engine_args:
167+
tensor_parallel_size: 2
168+
...
169+
runtime:
170+
devices: "0,1"
171+
```
172+
173+
------
174+
157175
#### 🔗 Runtime Configuration
158176
159177
| Parameter | Value | Description |
@@ -162,6 +180,74 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can
162180
| `max_inflight` | `1` | Maximum inflight requests |
163181
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) |
164182

183+
## Using Mooncake Connector
184+
185+
[Mooncake](https://github.com/kvcache-ai/Mooncake) is a high-performance distributed KV cache transfer engine that enables efficient cross-node data movement via TCP or RDMA, making it ideal for multi-node disaggregated inference.
186+
187+
By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can switch to the Mooncake connector for better performance on multi-GPU setups and to enable multi-node deployment.
188+
189+
### Prerequisites
190+
191+
Install the Mooncake transfer engine:
192+
193+
```bash
194+
# For CUDA-enabled systems (recommended)
195+
pip install mooncake-transfer-engine
196+
197+
# For non-CUDA systems
198+
pip install mooncake-transfer-engine-non-cuda
199+
```
200+
201+
### Step 1: Start the Mooncake Master
202+
203+
On the **primary node**, start the Mooncake master service (run in a separate terminal or background with `&`):
204+
205+
```bash
206+
# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
207+
# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
208+
mkdir -p ./mc_storage
209+
210+
mooncake_master \
211+
--rpc_port=50051 \
212+
--enable_http_metadata_server=true \
213+
--http_metadata_server_host=0.0.0.0 \
214+
--http_metadata_server_port=8080 \
215+
--metrics_port=9003 \
216+
--root_fs_dir=./mc_storage/ \
217+
--cluster_id=mc-local-1 &
218+
```
219+
220+
### Step 2: Run Offline Inference with Mooncake
221+
222+
Use the provided Mooncake stage config [`bagel_multiconnector.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml). Before launching, update the `metadata_server` and `master` addresses in the YAML to match your Mooncake master node's IP (use `127.0.0.1` for single-node testing).
223+
224+
```bash
225+
cd examples/offline_inference/bagel
226+
227+
# Text to Image with Mooncake
228+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
229+
--modality text2img \
230+
--prompts "A cute cat" \
231+
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
232+
233+
# Image to Text with Mooncake
234+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
235+
--modality img2text \
236+
--image-path /path/to/image.jpg \
237+
--prompts "Describe this image" \
238+
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
239+
240+
# Text to Text with Mooncake
241+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
242+
--modality text2text \
243+
--prompts "What is the capital of France?" \
244+
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
245+
```
246+
247+
For more details on the Mooncake connector and multi-node setup, see the [Mooncake Store Connector documentation](https://github.com/vllm-project/vllm-omni/tree/main/docs/design/feature/omni_connectors/mooncake_store_connector.md).
248+
249+
------
250+
165251
## FAQ
166252

167253
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# GLM-Image Multistage End-to-End Inference
2+
3+
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/glm_image>.
4+
5+
6+
This example demonstrates how to run GLM-Image with the vLLM-Omni multistage architecture.
7+
8+
## Architecture
9+
10+
GLM-Image uses a 2-stage pipeline:
11+
12+
```
13+
┌─────────────────────────────────────────────────────────────┐
14+
│ GLM-Image Pipeline │
15+
├─────────────────────────────────────────────────────────────┤
16+
│ │
17+
│ Stage 0 (AR Model) Stage 1 (Diffusion) │
18+
│ ┌─────────────────┐ ┌─────────────────────┐ │
19+
│ │ vLLM-optimized │ │ GlmImagePipeline │ │
20+
│ │ GlmImageFor │ prior │ ┌───────────────┐ │ │
21+
│ │ Conditional │──tokens───►│ │ DiT Denoiser │ │ │
22+
│ │ Generation │ │ └───────────────┘ │ │
23+
│ │ (9B AR model) │ │ │ │ │
24+
│ └─────────────────┘ │ ▼ │ │
25+
│ ▲ │ ┌───────────────┐ │ │
26+
│ │ │ │ VAE Decode │──┼──► Image
27+
│ Text/Image │ └───────────────┘ │ │
28+
│ Input └─────────────────────┘ │
29+
│ │
30+
└─────────────────────────────────────────────────────────────┘
31+
```
32+
33+
## Features
34+
35+
- **vLLM-optimized AR**: Uses PagedAttention and tensor parallelism for faster prior token generation
36+
- **Flexible deployment**: AR and Diffusion stages can run on different GPUs
37+
- **Text-to-Image**: Generate images from text descriptions
38+
- **Image-to-Image**: Edit existing images with text prompts
39+
40+
## Usage
41+
42+
### Text-to-Image
43+
44+
```bash
45+
python end2end.py \
46+
--model-path /path/to/glm-image \
47+
--config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
48+
--prompt "A beautiful sunset over the ocean with sailing boats" \
49+
--height 1024 \
50+
--width 1024 \
51+
--output output_t2i.png
52+
```
53+
54+
### Image-to-Image (Image Editing)
55+
56+
```bash
57+
python end2end.py \
58+
--model-path /path/to/glm-image \
59+
--config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
60+
--prompt "Transform this scene into a winter wonderland" \
61+
--image input.png \
62+
--output output_i2i.png
63+
```
64+
65+
### With Custom Parameters
66+
67+
```bash
68+
python end2end.py \
69+
--model-path /path/to/glm-image \
70+
--config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
71+
--prompt "A photorealistic cat sitting on a window sill" \
72+
--height 1024 \
73+
--width 1024 \
74+
--num-inference-steps 50 \
75+
--guidance-scale 1.5 \
76+
--seed 42 \
77+
--output output.png
78+
```
79+
80+
## Shell Scripts
81+
82+
### Run Text-to-Image
83+
84+
```bash
85+
./run_t2i.sh
86+
```
87+
88+
### Run Image-to-Image
89+
90+
```bash
91+
./run_i2i.sh --image /path/to/input.png
92+
```
93+
94+
## Stage Configuration
95+
96+
The stage config (`glm_image.yaml`) defines:
97+
98+
- **Stage 0 (AR)**: Uses `GPUARWorker` with vLLM engine
99+
100+
- Model: `GlmImageForConditionalGeneration`
101+
- Output: `token_ids` (prior tokens)
102+
103+
- **Stage 1 (Diffusion)**: Uses diffusion engine
104+
- Model: `GlmImagePipeline`
105+
- Output: Generated image
106+
107+
See `vllm_omni/model_executor/stage_configs/glm_image.yaml` for full configuration.
108+
109+
## Comparison with Single-Stage
110+
111+
| Aspect | Single-Stage (transformers) | Multistage (vLLM) |
112+
| ----------- | --------------------------- | ------------------- |
113+
| AR Model | transformers native | vLLM PagedAttention |
114+
| Memory | Higher (no KV cache opt) | Lower (optimized) |
115+
| Throughput | Lower | Higher |
116+
| Flexibility | Single GPU | Multi-GPU support |
117+
118+
## Troubleshooting
119+
120+
### OOM Error
121+
122+
Try reducing memory usage:
123+
124+
```bash
125+
# In glm_image.yaml, adjust:
126+
gpu_memory_utilization: 0.5 # Reduce from 0.6
127+
```
128+
129+
### Slow Initialization
130+
131+
The first run loads model weights. Subsequent runs are faster:
132+
133+
```bash
134+
--stage-init-timeout 900 # Increase timeout for slow storage
135+
```
136+
137+
## Requirements
138+
139+
- vLLM-Omni with GLM-Image support
140+
- CUDA-capable GPU (recommended: H100/A100 with 80GB)
141+
- GLM-Image model weights
142+
143+
## Example materials
144+
145+
??? abstract "end2end.py"
146+
``````py
147+
--8<-- "examples/offline_inference/glm_image/end2end.py"
148+
``````
149+
??? abstract "run_i2i.sh"
150+
``````sh
151+
--8<-- "examples/offline_inference/glm_image/run_i2i.sh"
152+
``````
153+
??? abstract "run_t2i.sh"
154+
``````sh
155+
--8<-- "examples/offline_inference/glm_image/run_t2i.sh"
156+
``````

docs/user_guide/examples/offline_inference/image_to_video.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,11 @@ Key arguments:
6969
- `--vae-use-tiling`: Enable VAE tiling for memory optimization.
7070
- `--cfg-parallel-size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
7171
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
72+
- `--use-hsdp`: Enable Hybrid Sharded Data Parallel to shard model weights across GPUs.
73+
- `--hsdp-shard-size`: Number of GPUs to shard model weights across within each replica group. -1 (default) auto-calculates as world_size / replicate_size.
74+
- `--hsdp-replicate-size`: Number of replica groups for HSDP. Each replica holds a full sharded copy. Default 1 means pure sharding (no replication).
75+
76+
7277

7378
> ℹ️ If you encounter OOM errors, try using `--vae-use-slicing` and `--vae-use-tiling` to reduce memory usage.
7479

docs/user_guide/examples/offline_inference/qwen3_tts.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,13 +90,43 @@ Examples:
9090
python end2end.py --query-type Base --mode-tag icl
9191
```
9292

93+
## Streaming Mode
94+
95+
Add `--streaming` to stream audio chunks progressively via `AsyncOmni` (requires `async_chunk: true` in the stage config):
96+
97+
```bash
98+
python end2end.py --query-type CustomVoice --streaming --output-dir /tmp/out_stream
99+
```
100+
101+
Each 25-frame Code2Wav chunk is logged as it arrives. The final WAV file is written once generation
102+
completes. This demonstrates that audio data is available progressively rather than only at the end.
103+
104+
> **Note:** Streaming uses `AsyncOmni` internally. The non-streaming path (`Omni`) is unchanged.
105+
106+
## Batched Decoding
107+
108+
The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, provide a stage config with `max_batch_size > 1` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`.
109+
110+
```
111+
python end2end.py --query-type CustomVoice \
112+
--txt-prompts benchmark_prompts.txt \
113+
--batch-size 4 \
114+
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml
115+
```
116+
117+
**Important:** `--batch-size` must match a CUDA graph capture size (1, 2, 4, 8, 16...) because the Talker's code predictor KV cache is sized to `max_num_seqs`, and CUDA graphs pad the batch to the next capture size. Both stages need `max_batch_size >= batch_size` in the stage config for batching to take effect. If only stage 1 has a higher `max_batch_size`, it won't help — stage 1 can only batch chunks from requests that are in-flight simultaneously, which requires stage 0 to also process multiple requests concurrently.
118+
93119
## Notes
94120

95121
- The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
96-
- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.
122+
- Use `--output-dir` to change the output folder.
97123

98124
## Example materials
99125

126+
??? abstract "benchmark_prompts.txt"
127+
``````txt
128+
--8<-- "examples/offline_inference/qwen3_tts/benchmark_prompts.txt"
129+
``````
100130
??? abstract "end2end.py"
101131
``````py
102132
--8<-- "examples/offline_inference/qwen3_tts/end2end.py"

0 commit comments

Comments
 (0)