|
| 1 | +# BAGEL-7B-MoT |
| 2 | + |
| 3 | +Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel>. |
| 4 | + |
| 5 | +## Set up |
| 6 | + |
| 7 | +Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup. |
| 8 | + |
| 9 | +## Run examples |
| 10 | + |
| 11 | +**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices. |
| 12 | + |
| 13 | +Get into the bagel folder |
| 14 | + |
| 15 | +```bash |
| 16 | +cd examples/offline_inference/bagel |
| 17 | +``` |
| 18 | + |
| 19 | +### Modality Control |
| 20 | + |
| 21 | +BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument: |
| 22 | + |
| 23 | +#### Text to Image (text2img) |
| 24 | + |
| 25 | +- **Pipeline**: Text → Thinker → DiT → VAE Decode → Image |
| 26 | +- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT) |
| 27 | +- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation |
| 28 | + |
| 29 | +Generate images from text prompts: |
| 30 | + |
| 31 | +```bash |
| 32 | +python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ |
| 33 | + --modality text2img \ |
| 34 | + --prompts "A cute cat" |
| 35 | +``` |
| 36 | + |
| 37 | +#### Image to Image (img2img) |
| 38 | + |
| 39 | +- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image |
| 40 | +- **Stages Used**: Stage 1 (DiT) only |
| 41 | +- **Special**: Bypasses the Thinker stage, direct image-to-image transformation |
| 42 | + |
| 43 | +Transform images based on text prompts: |
| 44 | + |
| 45 | +```bash |
| 46 | +python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ |
| 47 | + --modality img2img \ |
| 48 | + --image-path /path/to/image.jpg \ |
| 49 | + --prompts "Let the woman wear a blue dress" |
| 50 | +``` |
| 51 | + |
| 52 | +#### Image to Text (img2text) |
| 53 | + |
| 54 | +- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output |
| 55 | +- **Stages Used**: Stage 0 (Thinker) only |
| 56 | +- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding |
| 57 | + |
| 58 | +Generate text descriptions from images: |
| 59 | + |
| 60 | +```bash |
| 61 | +python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ |
| 62 | + --modality img2text \ |
| 63 | + --image-path /path/to/image.jpg \ |
| 64 | + --prompts "Describe this image in detail" |
| 65 | +``` |
| 66 | + |
| 67 | +#### Text to Text (text2text) |
| 68 | + |
| 69 | +- **Pipeline**: Text → Thinker → Text Output |
| 70 | +- **Stages Used**: Stage 0 (Thinker) only |
| 71 | +- **Special**: No visual components involved, operates as pure language model |
| 72 | + |
| 73 | +Pure text generation: |
| 74 | + |
| 75 | +```bash |
| 76 | +python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ |
| 77 | + --modality text2text \ |
| 78 | + --prompts "What is the capital of France?" |
| 79 | + |
| 80 | +# You can load prompts from a text file (one prompt per line): |
| 81 | +python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ |
| 82 | + --modality text2text \ |
| 83 | + --txt-prompts /path/to/prompts.txt |
| 84 | +``` |
| 85 | + |
| 86 | +### Inference Steps |
| 87 | + |
| 88 | +Control the number of inference steps for image generation: |
| 89 | + |
| 90 | +```bash |
| 91 | +# You can adjust steps to 100 to improve image quality |
| 92 | +python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \ |
| 93 | + --modality text2img \ |
| 94 | + --steps 50 \ |
| 95 | + --prompts "A cute cat" |
| 96 | +``` |
| 97 | + |
| 98 | +### Key arguments |
| 99 | + |
| 100 | +BAGEL-7B-MoT supports **multiple modality modes** for different use cases. |
| 101 | + |
| 102 | +The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml) |
| 103 | + |
| 104 | +#### 📌 Command Line Arguments (end2end.py) |
| 105 | + |
| 106 | +| Argument | Type | Default | Description | |
| 107 | +| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- | |
| 108 | +| `--model` | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name | |
| 109 | +| `--modality` | choice | `text2img` | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` | |
| 110 | +| `--prompts` | list | `None` | Input text prompts directly | |
| 111 | +| `--txt-prompts` | string | `None` | Path to txt file with one prompt per line | |
| 112 | +| `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) | |
| 113 | +| `--steps` | int | `50` | Number of inference steps | |
| 114 | +| `--stage-configs-path` | string | `None` | Custom stage config file path | |
| 115 | +| `--worker-backend` | choice | `process` | Worker backend: `process` or `ray` | |
| 116 | +| `--ray-address` | string | `None` | Ray cluster address | |
| 117 | +| `--enable-stats` | flag | `False` | Enable statistics logging | |
| 118 | +| `--init-sleep-seconds` | int | `20` | Initialization sleep time | |
| 119 | +| `--batch-timeout` | int | `5` | Batch timeout | |
| 120 | +| `--init-timeout` | int | `300` | Initialization timeout | |
| 121 | + |
| 122 | +------ |
| 123 | + |
| 124 | +#### ⚙️ Stage Configuration Parameters (bagel.yaml) |
| 125 | + |
| 126 | + **Stage 0 - Thinker (LLM Stage)** |
| 127 | + |
| 128 | +| Parameter | Value | Description | |
| 129 | +| :------------------------------- | :------------------------------ | :----------------------- | |
| 130 | +| `stage_type` | `llm` | Stage type | |
| 131 | +| `devices` | `"0"` | GPU device ID | |
| 132 | +| `max_batch_size` | `1` | Maximum batch size | |
| 133 | +| `model_stage` | `thinker` | Model stage identifier | |
| 134 | +| `model_arch` | `BagelForConditionalGeneration` | Model architecture | |
| 135 | +| `gpu_memory_utilization` | `0.4` | GPU memory utilization | |
| 136 | +| `tensor_parallel_size` | `1` | Tensor parallel size | |
| 137 | +| `max_num_batched_tokens` | `32768` | Maximum batched tokens | |
| 138 | +| `omni_kv_config.need_send_cache` | `true` | Whether to send KV cache | |
| 139 | + |
| 140 | +------ |
| 141 | + |
| 142 | +**Stage 1 - DiT (Diffusion Stage)** |
| 143 | + |
| 144 | +| Parameter | Value | Description | |
| 145 | +| :------------------------------- | :---------- | :-------------------------- | |
| 146 | +| `stage_type` | `diffusion` | Stage type | |
| 147 | +| `devices` | `"0"` | GPU device ID | |
| 148 | +| `max_batch_size` | `1` | Maximum batch size | |
| 149 | +| `model_stage` | `dit` | Model stage identifier | |
| 150 | +| `gpu_memory_utilization` | `0.4` | GPU memory utilization | |
| 151 | +| `omni_kv_config.need_recv_cache` | `true` | Whether to receive KV cache | |
| 152 | +| `engine_input_source` | `[0]` | Input source from Stage 0 | |
| 153 | + |
| 154 | +------ |
| 155 | + |
| 156 | +#### 🔗 Runtime Configuration |
| 157 | + |
| 158 | +| Parameter | Value | Description | |
| 159 | +| :-------------------- | :------ | :------------------------------- | |
| 160 | +| `window_size` | `-1` | Window size (-1 means unlimited) | |
| 161 | +| `max_inflight` | `1` | Maximum inflight requests | |
| 162 | +| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) | |
| 163 | + |
| 164 | +## FAQ |
| 165 | + |
| 166 | +- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below. |
| 167 | + |
| 168 | +```bash |
| 169 | +sudo apt update |
| 170 | +sudo apt install ffmpeg |
| 171 | +``` |
| 172 | + |
| 173 | +- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len. |
| 174 | + |
| 175 | +| Stage | VRAM | |
| 176 | +| :------------------ | :--------------------------- | |
| 177 | +| Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** | |
| 178 | +| Stage-1 (DiT) | **26.50 GiB** | |
| 179 | +| Total | **~42 GiB + KV Cache** | |
0 commit comments