Skip to content

Commit 0e07eb6

Browse files
authored
[Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration (vllm-project#987)
Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: jzz <e1583181@u.nus.edu>
1 parent 47a0969 commit 0e07eb6

File tree

7 files changed

+821
-9
lines changed

7 files changed

+821
-9
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# BAGEL-7B-MoT
2+
3+
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel>.
4+
5+
## Set up
6+
7+
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
8+
9+
## Run examples
10+
11+
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
12+
13+
Get into the bagel folder
14+
15+
```bash
16+
cd examples/offline_inference/bagel
17+
```
18+
19+
### Modality Control
20+
21+
BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:
22+
23+
#### Text to Image (text2img)
24+
25+
- **Pipeline**: Text → Thinker → DiT → VAE Decode → Image
26+
- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT)
27+
- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation
28+
29+
Generate images from text prompts:
30+
31+
```bash
32+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
33+
--modality text2img \
34+
--prompts "A cute cat"
35+
```
36+
37+
#### Image to Image (img2img)
38+
39+
- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image
40+
- **Stages Used**: Stage 1 (DiT) only
41+
- **Special**: Bypasses the Thinker stage, direct image-to-image transformation
42+
43+
Transform images based on text prompts:
44+
45+
```bash
46+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
47+
--modality img2img \
48+
--image-path /path/to/image.jpg \
49+
--prompts "Let the woman wear a blue dress"
50+
```
51+
52+
#### Image to Text (img2text)
53+
54+
- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output
55+
- **Stages Used**: Stage 0 (Thinker) only
56+
- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding
57+
58+
Generate text descriptions from images:
59+
60+
```bash
61+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
62+
--modality img2text \
63+
--image-path /path/to/image.jpg \
64+
--prompts "Describe this image in detail"
65+
```
66+
67+
#### Text to Text (text2text)
68+
69+
- **Pipeline**: Text → Thinker → Text Output
70+
- **Stages Used**: Stage 0 (Thinker) only
71+
- **Special**: No visual components involved, operates as pure language model
72+
73+
Pure text generation:
74+
75+
```bash
76+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
77+
--modality text2text \
78+
--prompts "What is the capital of France?"
79+
80+
# You can load prompts from a text file (one prompt per line):
81+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
82+
--modality text2text \
83+
--txt-prompts /path/to/prompts.txt
84+
```
85+
86+
### Inference Steps
87+
88+
Control the number of inference steps for image generation:
89+
90+
```bash
91+
# You can adjust steps to 100 to improve image quality
92+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
93+
--modality text2img \
94+
--steps 50 \
95+
--prompts "A cute cat"
96+
```
97+
98+
### Key arguments
99+
100+
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
101+
102+
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
103+
104+
#### 📌 Command Line Arguments (end2end.py)
105+
106+
| Argument | Type | Default | Description |
107+
| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- |
108+
| `--model` | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name |
109+
| `--modality` | choice | `text2img` | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` |
110+
| `--prompts` | list | `None` | Input text prompts directly |
111+
| `--txt-prompts` | string | `None` | Path to txt file with one prompt per line |
112+
| `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) |
113+
| `--steps` | int | `50` | Number of inference steps |
114+
| `--stage-configs-path` | string | `None` | Custom stage config file path |
115+
| `--worker-backend` | choice | `process` | Worker backend: `process` or `ray` |
116+
| `--ray-address` | string | `None` | Ray cluster address |
117+
| `--enable-stats` | flag | `False` | Enable statistics logging |
118+
| `--init-sleep-seconds` | int | `20` | Initialization sleep time |
119+
| `--batch-timeout` | int | `5` | Batch timeout |
120+
| `--init-timeout` | int | `300` | Initialization timeout |
121+
122+
------
123+
124+
#### ⚙️ Stage Configuration Parameters (bagel.yaml)
125+
126+
**Stage 0 - Thinker (LLM Stage)**
127+
128+
| Parameter | Value | Description |
129+
| :------------------------------- | :------------------------------ | :----------------------- |
130+
| `stage_type` | `llm` | Stage type |
131+
| `devices` | `"0"` | GPU device ID |
132+
| `max_batch_size` | `1` | Maximum batch size |
133+
| `model_stage` | `thinker` | Model stage identifier |
134+
| `model_arch` | `BagelForConditionalGeneration` | Model architecture |
135+
| `gpu_memory_utilization` | `0.4` | GPU memory utilization |
136+
| `tensor_parallel_size` | `1` | Tensor parallel size |
137+
| `max_num_batched_tokens` | `32768` | Maximum batched tokens |
138+
| `omni_kv_config.need_send_cache` | `true` | Whether to send KV cache |
139+
140+
------
141+
142+
**Stage 1 - DiT (Diffusion Stage)**
143+
144+
| Parameter | Value | Description |
145+
| :------------------------------- | :---------- | :-------------------------- |
146+
| `stage_type` | `diffusion` | Stage type |
147+
| `devices` | `"0"` | GPU device ID |
148+
| `max_batch_size` | `1` | Maximum batch size |
149+
| `model_stage` | `dit` | Model stage identifier |
150+
| `gpu_memory_utilization` | `0.4` | GPU memory utilization |
151+
| `omni_kv_config.need_recv_cache` | `true` | Whether to receive KV cache |
152+
| `engine_input_source` | `[0]` | Input source from Stage 0 |
153+
154+
------
155+
156+
#### 🔗 Runtime Configuration
157+
158+
| Parameter | Value | Description |
159+
| :-------------------- | :------ | :------------------------------- |
160+
| `window_size` | `-1` | Window size (-1 means unlimited) |
161+
| `max_inflight` | `1` | Maximum inflight requests |
162+
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) |
163+
164+
## FAQ
165+
166+
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
167+
168+
```bash
169+
sudo apt update
170+
sudo apt install ffmpeg
171+
```
172+
173+
- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
174+
175+
| Stage | VRAM |
176+
| :------------------ | :--------------------------- |
177+
| Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** |
178+
| Stage-1 (DiT) | **26.50 GiB** |
179+
| Total | **~42 GiB + KV Cache** |

0 commit comments

Comments
 (0)