Skip to content

Commit 2c0e9a8

Browse files
authored
Merge branch 'vllm-project:main' into serialize-multimodal-kwargs
2 parents 57e1922 + f49e5af commit 2c0e9a8

32 files changed

+441
-149
lines changed

docs/source/models/supported_models.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -759,7 +759,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
759759
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
760760

761761
:::{important}
762-
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
762+
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
763763
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
764764

765765
Offline inference:
@@ -777,6 +777,8 @@ Online serving:
777777
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
778778
```
779779

780+
**This is no longer required if you are using vLLM V1.**
781+
780782
:::
781783

782784
:::{note}

docs/source/serving/offline_inference.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
110110
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
111111
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
112112

113+
#### Disable unused modalities
114+
115+
You can disable unused modalities (except for text) by setting its limit to zero.
116+
117+
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
118+
119+
```python
120+
from vllm import LLM
121+
122+
# Accept images but not videos
123+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
124+
limit_mm_per_prompt={"video": 0})
125+
```
126+
127+
You can even run a multi-modal model for text-only inference:
128+
129+
```python
130+
from vllm import LLM
131+
132+
# Don't accept images. Just text.
133+
llm = LLM(model="google/gemma-3-27b-it",
134+
limit_mm_per_prompt={"image": 0})
135+
```
136+
113137
### Performance optimization and tuning
114138

115139
You can potentially improve the performance of vLLM by finetuning various options.

examples/offline_inference/audio_language.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,11 @@ def main(args):
196196
req_data = model_example_map[model](question_per_audio_count[audio_count],
197197
audio_count)
198198

199+
# Disable other modalities to save memory
200+
default_limits = {"image": 0, "video": 0, "audio": 0}
201+
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
202+
req_data.engine_args.limit_mm_per_prompt or {})
203+
199204
engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
200205
llm = LLM(**engine_args)
201206

examples/offline_inference/encoder_decoder_multimodal.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,11 @@ def main(args):
133133

134134
req_data = model_example_map[model]()
135135

136+
# Disable other modalities to save memory
137+
default_limits = {"image": 0, "video": 0, "audio": 0}
138+
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
139+
req_data.engine_args.limit_mm_per_prompt or {})
140+
136141
engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
137142
llm = LLM(**engine_args)
138143

0 commit comments

Comments
 (0)