vllm-project
diff --git a/‎docs/source/models/supported_models.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/source/models/supported_models.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source/serving/offline_inference.md‎
Lines changed: 24 additions & 0 deletions b/‎docs/source/serving/offline_inference.md‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎examples/offline_inference/audio_language.py‎
Lines changed: 5 additions & 0 deletions b/‎examples/offline_inference/audio_language.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examples/offline_inference/encoder_decoder_multimodal.py‎
Lines changed: 5 additions & 0 deletions b/‎examples/offline_inference/encoder_decoder_multimodal.py‎
Lines changed: 5 additions & 0 deletions
@@ -759,7 +759,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
 See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
 
 :::{important}
-To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
+**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
 or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
 
 Offline inference:
@@ -777,6 +777,8 @@ Online serving:
 vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
 ```
 
+**This is no longer required if you are using vLLM V1.**
+
 :::
 
 :::{note}
 
@@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
 - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
 - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
 
+#### Disable unused modalities
+
+You can disable unused modalities (except for text) by setting its limit to zero.
+
+For example, if your application only accepts image input, there is no need to allocate any memory for videos.
+
+```python
+from vllm import LLM
+
+# Accept images but not videos
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          limit_mm_per_prompt={"video": 0})
+```
+
+You can even run a multi-modal model for text-only inference:
+
+```python
+from vllm import LLM
+
+# Don't accept images. Just text.
+llm = LLM(model="google/gemma-3-27b-it",
+          limit_mm_per_prompt={"image": 0})
+```
+
 ### Performance optimization and tuning
 
 You can potentially improve the performance of vLLM by finetuning various options.
 
@@ -196,6 +196,11 @@ def main(args):
     req_data = model_example_map[model](question_per_audio_count[audio_count],
                                         audio_count)
 
+    # Disable other modalities to save memory
+    default_limits = {"image": 0, "video": 0, "audio": 0}
+    req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
+        req_data.engine_args.limit_mm_per_prompt or {})
+
     engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
     llm = LLM(**engine_args)
 
 
@@ -133,6 +133,11 @@ def main(args):
 
     req_data = model_example_map[model]()
 
+    # Disable other modalities to save memory
+    default_limits = {"image": 0, "video": 0, "audio": 0}
+    req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
+        req_data.engine_args.limit_mm_per_prompt or {})
+
     engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
     llm = LLM(**engine_args)