|
| 1 | +# GLM-Image Multistage End-to-End Inference |
| 2 | + |
| 3 | +Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/glm_image>. |
| 4 | + |
| 5 | + |
| 6 | +This example demonstrates how to run GLM-Image with the vLLM-Omni multistage architecture. |
| 7 | + |
| 8 | +## Architecture |
| 9 | + |
| 10 | +GLM-Image uses a 2-stage pipeline: |
| 11 | + |
| 12 | +``` |
| 13 | +┌─────────────────────────────────────────────────────────────┐ |
| 14 | +│ GLM-Image Pipeline │ |
| 15 | +├─────────────────────────────────────────────────────────────┤ |
| 16 | +│ │ |
| 17 | +│ Stage 0 (AR Model) Stage 1 (Diffusion) │ |
| 18 | +│ ┌─────────────────┐ ┌─────────────────────┐ │ |
| 19 | +│ │ vLLM-optimized │ │ GlmImagePipeline │ │ |
| 20 | +│ │ GlmImageFor │ prior │ ┌───────────────┐ │ │ |
| 21 | +│ │ Conditional │──tokens───►│ │ DiT Denoiser │ │ │ |
| 22 | +│ │ Generation │ │ └───────────────┘ │ │ |
| 23 | +│ │ (9B AR model) │ │ │ │ │ |
| 24 | +│ └─────────────────┘ │ ▼ │ │ |
| 25 | +│ ▲ │ ┌───────────────┐ │ │ |
| 26 | +│ │ │ │ VAE Decode │──┼──► Image |
| 27 | +│ Text/Image │ └───────────────┘ │ │ |
| 28 | +│ Input └─────────────────────┘ │ |
| 29 | +│ │ |
| 30 | +└─────────────────────────────────────────────────────────────┘ |
| 31 | +``` |
| 32 | + |
| 33 | +## Features |
| 34 | + |
| 35 | +- **vLLM-optimized AR**: Uses PagedAttention and tensor parallelism for faster prior token generation |
| 36 | +- **Flexible deployment**: AR and Diffusion stages can run on different GPUs |
| 37 | +- **Text-to-Image**: Generate images from text descriptions |
| 38 | +- **Image-to-Image**: Edit existing images with text prompts |
| 39 | + |
| 40 | +## Usage |
| 41 | + |
| 42 | +### Text-to-Image |
| 43 | + |
| 44 | +```bash |
| 45 | +python end2end.py \ |
| 46 | + --model-path /path/to/glm-image \ |
| 47 | + --config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \ |
| 48 | + --prompt "A beautiful sunset over the ocean with sailing boats" \ |
| 49 | + --height 1024 \ |
| 50 | + --width 1024 \ |
| 51 | + --output output_t2i.png |
| 52 | +``` |
| 53 | + |
| 54 | +### Image-to-Image (Image Editing) |
| 55 | + |
| 56 | +```bash |
| 57 | +python end2end.py \ |
| 58 | + --model-path /path/to/glm-image \ |
| 59 | + --config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \ |
| 60 | + --prompt "Transform this scene into a winter wonderland" \ |
| 61 | + --image input.png \ |
| 62 | + --output output_i2i.png |
| 63 | +``` |
| 64 | + |
| 65 | +### With Custom Parameters |
| 66 | + |
| 67 | +```bash |
| 68 | +python end2end.py \ |
| 69 | + --model-path /path/to/glm-image \ |
| 70 | + --config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \ |
| 71 | + --prompt "A photorealistic cat sitting on a window sill" \ |
| 72 | + --height 1024 \ |
| 73 | + --width 1024 \ |
| 74 | + --num-inference-steps 50 \ |
| 75 | + --guidance-scale 1.5 \ |
| 76 | + --seed 42 \ |
| 77 | + --output output.png |
| 78 | +``` |
| 79 | + |
| 80 | +## Shell Scripts |
| 81 | + |
| 82 | +### Run Text-to-Image |
| 83 | + |
| 84 | +```bash |
| 85 | +./run_t2i.sh |
| 86 | +``` |
| 87 | + |
| 88 | +### Run Image-to-Image |
| 89 | + |
| 90 | +```bash |
| 91 | +./run_i2i.sh --image /path/to/input.png |
| 92 | +``` |
| 93 | + |
| 94 | +## Stage Configuration |
| 95 | + |
| 96 | +The stage config (`glm_image.yaml`) defines: |
| 97 | + |
| 98 | +- **Stage 0 (AR)**: Uses `GPUARWorker` with vLLM engine |
| 99 | + |
| 100 | + - Model: `GlmImageForConditionalGeneration` |
| 101 | + - Output: `token_ids` (prior tokens) |
| 102 | + |
| 103 | +- **Stage 1 (Diffusion)**: Uses diffusion engine |
| 104 | + - Model: `GlmImagePipeline` |
| 105 | + - Output: Generated image |
| 106 | + |
| 107 | +See `vllm_omni/model_executor/stage_configs/glm_image.yaml` for full configuration. |
| 108 | + |
| 109 | +## Comparison with Single-Stage |
| 110 | + |
| 111 | +| Aspect | Single-Stage (transformers) | Multistage (vLLM) | |
| 112 | +| ----------- | --------------------------- | ------------------- | |
| 113 | +| AR Model | transformers native | vLLM PagedAttention | |
| 114 | +| Memory | Higher (no KV cache opt) | Lower (optimized) | |
| 115 | +| Throughput | Lower | Higher | |
| 116 | +| Flexibility | Single GPU | Multi-GPU support | |
| 117 | + |
| 118 | +## Troubleshooting |
| 119 | + |
| 120 | +### OOM Error |
| 121 | + |
| 122 | +Try reducing memory usage: |
| 123 | + |
| 124 | +```bash |
| 125 | +# In glm_image.yaml, adjust: |
| 126 | +gpu_memory_utilization: 0.5 # Reduce from 0.6 |
| 127 | +``` |
| 128 | + |
| 129 | +### Slow Initialization |
| 130 | + |
| 131 | +The first run loads model weights. Subsequent runs are faster: |
| 132 | + |
| 133 | +```bash |
| 134 | +--stage-init-timeout 900 # Increase timeout for slow storage |
| 135 | +``` |
| 136 | + |
| 137 | +## Requirements |
| 138 | + |
| 139 | +- vLLM-Omni with GLM-Image support |
| 140 | +- CUDA-capable GPU (recommended: H100/A100 with 80GB) |
| 141 | +- GLM-Image model weights |
| 142 | + |
| 143 | +## Example materials |
| 144 | + |
| 145 | +??? abstract "end2end.py" |
| 146 | + ``````py |
| 147 | + --8<-- "examples/offline_inference/glm_image/end2end.py" |
| 148 | + `````` |
| 149 | +??? abstract "run_i2i.sh" |
| 150 | + ``````sh |
| 151 | + --8<-- "examples/offline_inference/glm_image/run_i2i.sh" |
| 152 | + `````` |
| 153 | +??? abstract "run_t2i.sh" |
| 154 | + ``````sh |
| 155 | + --8<-- "examples/offline_inference/glm_image/run_t2i.sh" |
| 156 | + `````` |
0 commit comments