Skip to content

Commit 55222a2

Browse files
DarkLight1337lec77
authored andcommitted
[Core] Store only the keys for multi-modal data in P0 (#22198)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent a10e042 commit 55222a2

File tree

17 files changed

+320
-229
lines changed

17 files changed

+320
-229
lines changed

docs/configuration/conserving_memory.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
8686

8787
If you run out of CPU RAM, try the following options:
8888

89-
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
89+
- (Multi-modal models only) you can set the size of multi-modal processor cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB per API process + 4 GiB per engine core process)
9090
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
9191

9292
## Multi-modal input limits
@@ -129,20 +129,18 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
129129

130130
Here are some examples:
131131

132-
??? code
133-
134-
```python
135-
from vllm import LLM
132+
```python
133+
from vllm import LLM
136134

137-
# Available for Qwen2-VL series models
138-
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
139-
mm_processor_kwargs={
140-
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
141-
})
142-
143-
# Available for InternVL series models
144-
llm = LLM(model="OpenGVLab/InternVL2-2B",
145-
mm_processor_kwargs={
146-
"max_dynamic_patch": 4, # Default is 12
147-
})
148-
```
135+
# Available for Qwen2-VL series models
136+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
137+
mm_processor_kwargs={
138+
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
139+
})
140+
141+
# Available for InternVL series models
142+
llm = LLM(model="OpenGVLab/InternVL2-2B",
143+
mm_processor_kwargs={
144+
"max_dynamic_patch": 4, # Default is 12
145+
})
146+
```

docs/configuration/optimization.md

Lines changed: 29 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
This guide covers optimization strategies and performance tuning for vLLM V1.
44

5+
!!! tip
6+
Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.
7+
58
## Preemption
69

710
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
@@ -126,62 +129,44 @@ Data parallelism replicates the entire model across multiple GPU sets and proces
126129
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
127130
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
128131

129-
## Reducing Memory Usage
130-
131-
If you encounter out-of-memory issues, consider these strategies:
132+
## Input Processing
132133

133-
### Context Length and Batch Size
134+
### Parallel Processing
134135

135-
You can reduce memory usage by limiting the context length and batch size:
136+
You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
137+
This is useful when input processing (which is run inside the API server)
138+
becomes a bottleneck compared to model execution (which is run inside engine core)
139+
and you have excess CPU capacity.
136140

137-
```python
138-
from vllm import LLM
141+
```console
142+
# Run 4 API processes and 1 engine core process
143+
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4
139144

140-
llm = LLM(
141-
model="meta-llama/Llama-3.1-8B-Instruct",
142-
max_model_len=2048, # Limit context window
143-
max_num_seqs=4 # Limit batch size
144-
)
145+
# Run 4 API processes and 2 engine core processes
146+
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
145147
```
146148

147-
### Adjust CUDA Graph Compilation
149+
!!! note
150+
API server scale-out is only available for online inference.
148151

149-
CUDA graph compilation in V1 uses more memory than in V0. You can reduce memory usage by adjusting the compilation level:
150-
151-
```python
152-
from vllm import LLM
153-
from vllm.config import CompilationConfig, CompilationLevel
154-
155-
llm = LLM(
156-
model="meta-llama/Llama-3.1-8B-Instruct",
157-
compilation_config=CompilationConfig(
158-
level=CompilationLevel.PIECEWISE,
159-
cudagraph_capture_sizes=[1, 2, 4, 8] # Capture fewer batch sizes
160-
)
161-
)
162-
```
152+
!!! note
153+
[Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
154+
because it requires a one-to-one correspondance between API and engine core processes.
163155

164-
Or, if you are not concerned about latency or overall performance, disable CUDA graph compilation entirely with `enforce_eager=True`:
156+
## Multi-Modal Caching
165157

166-
```python
167-
from vllm import LLM
158+
### Processor Cache
168159

169-
llm = LLM(
170-
model="meta-llama/Llama-3.1-8B-Instruct",
171-
enforce_eager=True # Disable CUDA graph compilation
172-
)
173-
```
160+
By default, the multi-modal processor cache is enabled to avoid repeatedly processing
161+
the same multi-modal inputs via Hugging Face `AutoProcessor`,
162+
which commonly occurs in multi-turn conversations.
174163

175-
### Multimodal Models
164+
You can adjust the size of the cache via `VLLM_MM_INPUT_CACHE_GIB` environment variable
165+
(default 4 GiB per API process + 4 GiB per engine core process).
176166

177-
For multi-modal models, you can reduce memory usage by limiting the number of images/videos per request:
167+
If you do not benefit much from the cache, you can disable it completely via `disable_mm_preprocessor_cache`:
178168

179169
```python
180-
from vllm import LLM
181-
182-
# Accept up to 2 images per prompt
183-
llm = LLM(
184-
model="Qwen/Qwen2.5-VL-3B-Instruct",
185-
limit_mm_per_prompt={"image": 2}
186-
)
170+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
171+
disable_mm_preprocessor_cache=True)
187172
```

examples/offline_inference/mistral-small.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ def parse_args():
166166
parser.add_argument(
167167
"--disable-mm-preprocessor-cache",
168168
action="store_true",
169-
help="If True, disables caching of multi-modal preprocessor/mapper.",
169+
help="If True, disables caching of multi-modal processor.",
170170
)
171171
return parser.parse_args()
172172

examples/offline_inference/vision_language.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1565,7 +1565,7 @@ def parse_args():
15651565
parser.add_argument(
15661566
"--disable-mm-preprocessor-cache",
15671567
action="store_true",
1568-
help="If True, disables caching of multi-modal preprocessor/mapper.",
1568+
help="If True, disables caching of multi-modal processor.",
15691569
)
15701570

15711571
parser.add_argument(

tests/models/utils.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
import torch.nn.functional as F
1010
from transformers import PretrainedConfig
1111

12-
from vllm.config import ModelConfig, RunnerOption
12+
from vllm.config import ModelConfig, ModelDType, RunnerOption
1313
from vllm.inputs import InputContext
1414
from vllm.sequence import Logprob, PromptLogprobs, SampleLogprobs
1515

@@ -257,7 +257,7 @@ def check_logprobs_close(
257257
def build_model_context(
258258
model_id: str,
259259
runner: RunnerOption = "auto",
260-
dtype: Union[str, torch.dtype] = "auto",
260+
dtype: ModelDType = "auto",
261261
model_config_kwargs: Optional[dict[str, Any]] = None,
262262
mm_processor_kwargs: Optional[dict[str, Any]] = None,
263263
limit_mm_per_prompt: Optional[dict[str, int]] = None,
@@ -279,6 +279,7 @@ def build_model_context(
279279
model_info.check_transformers_version(on_fail="skip")
280280

281281
model_config_kwargs = model_config_kwargs or {}
282+
limit_mm_per_prompt = limit_mm_per_prompt or {}
282283
model_config = ModelConfig(
283284
model_id,
284285
runner=runner,

tests/multimodal/test_cache.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
import pytest
4+
import torch
5+
6+
from vllm.multimodal.cache import MultiModalCache, MultiModalCacheItemMetadata
7+
from vllm.multimodal.inputs import (MultiModalFieldElem, MultiModalKwargs,
8+
MultiModalKwargsItem,
9+
MultiModalSharedField)
10+
11+
12+
def _dummy_elem(modality: str, key: str, size: int):
13+
return MultiModalFieldElem(
14+
modality=modality,
15+
key=key,
16+
data=torch.empty((size, ), dtype=torch.int8),
17+
field=MultiModalSharedField(1),
18+
)
19+
20+
21+
def _dummy_item(modality: str, size_by_key: dict[str, int]):
22+
return MultiModalKwargsItem.from_elems([
23+
_dummy_elem(modality, key, size) for key, size in size_by_key.items()
24+
])
25+
26+
27+
def _dummy_kw(size_by_key_modality: dict[str, dict[str, int]]):
28+
return MultiModalKwargs.from_items([
29+
_dummy_item(modality, size_by_key)
30+
for modality, size_by_key in size_by_key_modality.items()
31+
])
32+
33+
34+
# yapf: disable
35+
@pytest.mark.parametrize(
36+
("item", "expected_size"),
37+
[
38+
(_dummy_item("a", {"a1": 100}), 100),
39+
(_dummy_item("a", {"a1": 100, "a2": 110}), 210),
40+
(_dummy_kw({"a": {"a1": 100, "a2": 110}, "b": {"b1": 120, "b2": 130}}), 460), # noqa: E501
41+
],
42+
)
43+
# yapf: enable
44+
def test_cache_item_size(item, expected_size):
45+
cache = MultiModalCache.get_lru_cache(2048, type(item))
46+
47+
cache[""] = item
48+
assert cache.currsize == expected_size
49+
50+
cache[""] = MultiModalCacheItemMetadata.wraps(item)
51+
assert cache.currsize == expected_size

tests/multimodal/test_processing.py

Lines changed: 2 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,15 @@
66

77
import numpy as np
88
import pytest
9-
import torch
109

1110
from vllm.config import ModelConfig
1211
from vllm.inputs import InputProcessingContext
1312
from vllm.multimodal import MULTIMODAL_REGISTRY
14-
from vllm.multimodal.inputs import (MultiModalFieldElem, MultiModalKwargs,
15-
MultiModalKwargsItem,
16-
MultiModalSharedField)
1713
# yapf conflicts with isort for this block
1814
# yapf: disable
1915
from vllm.multimodal.processing import (PlaceholderFeaturesInfo,
20-
ProcessingCache, PromptIndexTargets,
21-
PromptInsertion, PromptReplacement,
22-
apply_text_matches,
16+
PromptIndexTargets, PromptInsertion,
17+
PromptReplacement, apply_text_matches,
2318
apply_token_matches,
2419
find_mm_placeholders,
2520
find_text_matches, find_token_matches,
@@ -902,45 +897,6 @@ def test_find_mm_placeholders(
902897
assert result == expected
903898

904899

905-
def _dummy_elem(modality: str, key: str, size: int):
906-
return MultiModalFieldElem(
907-
modality=modality,
908-
key=key,
909-
data=torch.empty((size, ), dtype=torch.int8),
910-
field=MultiModalSharedField(1),
911-
)
912-
913-
914-
def _dummy_item(modality: str, size_by_key: dict[str, int]):
915-
return MultiModalKwargsItem.from_elems([
916-
_dummy_elem(modality, key, size) for key, size in size_by_key.items()
917-
])
918-
919-
920-
def _dummy_kw(size_by_key_modality: dict[str, dict[str, int]]):
921-
return MultiModalKwargs.from_items([
922-
_dummy_item(modality, size_by_key)
923-
for modality, size_by_key in size_by_key_modality.items()
924-
])
925-
926-
927-
# yapf: disable
928-
@pytest.mark.parametrize(
929-
("item", "expected_size"),
930-
[
931-
(_dummy_item("a", {"a1": 100}), 100),
932-
(_dummy_item("a", {"a1": 100, "a2": 110}), 210),
933-
(_dummy_kw({"a": {"a1": 100, "a2": 110}, "b": {"b1": 120, "b2": 130}}), 460), # noqa: E501
934-
],
935-
)
936-
# yapf: enable
937-
def test_cache_item_size(item, expected_size):
938-
cache = ProcessingCache.get_lru_cache(2048, type(item))
939-
cache[""] = item
940-
941-
assert cache.currsize == expected_size
942-
943-
944900
@pytest.mark.parametrize("model_id", ["llava-hf/llava-v1.6-mistral-7b-hf"])
945901
@pytest.mark.parametrize(
946902
("limit", "num_supported", "is_valid"),

vllm/config.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -444,8 +444,7 @@ class ModelConfig:
444444
model that is being run. For example, for Phi-3-Vision: `{"num_crops": 4}`.
445445
"""
446446
disable_mm_preprocessor_cache: bool = False
447-
"""If `True`, disable caching of the multi-modal preprocessor/mapper (not
448-
recommended)."""
447+
"""If `True`, disable caching of the multi-modal processor."""
449448
override_neuron_config: dict[str, Any] = field(default_factory=dict)
450449
"""Initialize non-default neuron config or override default neuron config
451450
that are specific to Neuron devices, this argument will be used to
@@ -1692,6 +1691,31 @@ def uses_mrope(self) -> bool:
16921691
def is_multimodal_model(self) -> bool:
16931692
return self.multimodal_config is not None
16941693

1694+
@property
1695+
def processor_return_mm_hashes(self) -> bool:
1696+
"""Whether the multi-modal processor should output hashes."""
1697+
mm_config = self.multimodal_config
1698+
if mm_config is None:
1699+
return False
1700+
1701+
return not mm_config.disable_mm_preprocessor_cache
1702+
1703+
@property
1704+
def enable_mm_input_cache(self) -> bool:
1705+
"""Whether the multi-modal input cache should be enabled."""
1706+
mm_config = self.multimodal_config
1707+
if mm_config is None:
1708+
return False
1709+
1710+
return not mm_config.disable_mm_preprocessor_cache
1711+
1712+
def get_mm_input_cache_gb(self) -> int:
1713+
mm_config = self.multimodal_config
1714+
if mm_config is None:
1715+
return 0
1716+
1717+
return envs.VLLM_MM_INPUT_CACHE_GIB
1718+
16951719
@property
16961720
def is_cross_encoder(self) -> bool:
16971721
return (self._model_info.supports_cross_encoding
@@ -3363,7 +3387,7 @@ class MultiModalConfig:
33633387

33643388
disable_mm_preprocessor_cache: bool = False
33653389
"""
3366-
If `True`, disable caching of the processed multi-modal inputs.
3390+
If `True`, disable caching of the multi-modal processor.
33673391
"""
33683392

33693393
interleave_mm_strings: bool = False

vllm/engine/arg_utils.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1230,17 +1230,17 @@ def create_engine_config(
12301230
enable_multimodal_encoder_data_parallel,
12311231
)
12321232

1233-
supports_mm_preprocessor_cache = (self.data_parallel_size == 1
1234-
or data_parallel_external_lb)
1235-
if (not supports_mm_preprocessor_cache
1236-
and model_config.is_multimodal_model
1237-
and not model_config.disable_mm_preprocessor_cache):
1238-
logger.warning(
1239-
"Multi-modal preprocessor cache is not compatible "
1240-
"with data parallelism when there does not exist a "
1241-
"one-to-one correspondance between API process and "
1242-
"EngineCore process, so the cache will be disabled.")
1243-
model_config.set_disable_mm_preprocessor_cache(True)
1233+
if model_config.is_multimodal_model:
1234+
dp_supports_mm_processor_cache = (self.data_parallel_size == 1
1235+
or data_parallel_external_lb)
1236+
if (not dp_supports_mm_processor_cache
1237+
and not model_config.disable_mm_preprocessor_cache):
1238+
logger.warning(
1239+
"Multi-modal processor cache is disabled because "
1240+
"it is not compatible with data parallelism when "
1241+
"there does not exist a one-to-one correspondance "
1242+
"between API and engine core processes.")
1243+
model_config.set_disable_mm_preprocessor_cache(True)
12441244

12451245
speculative_config = self.create_speculative_config(
12461246
target_model_config=model_config,

vllm/entrypoints/cli/serve.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,9 +163,8 @@ def run_multi_api_server(args: argparse.Namespace):
163163

164164
if model_config.is_multimodal_model and not (
165165
orig_disable_mm_preprocessor_cache):
166-
logger.warning(
167-
"Multi-modal preprocessor cache is not compatible "
168-
"with api_server_count > 1, so the cache will be disabled.")
166+
logger.warning("Multi-modal processor cache is disabled because "
167+
"it is not compatible with `api_server_count > 1`.")
169168

170169
executor_class = Executor.get_class(vllm_config)
171170
log_stats = not engine_args.disable_log_stats

0 commit comments

Comments
 (0)