Skip to content

Commit d9e98f4

Browse files
xwjiang2010ywang96
andauthored
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Roger Wang <[email protected]>
1 parent 3c6325f commit d9e98f4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+372
-466
lines changed

docs/source/dev/multimodal/multimodal_index.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,13 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
1010
:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
1111
which allows you to pass in multi-modal input alongside text and token prompts.
1212

13+
.. note::
14+
``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
15+
:class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
16+
1317
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.
1418

19+
1520
# TODO: Add more instructions on how to do that once embeddings is in.
1621

1722
Guides

docs/source/models/vlm.rst

Lines changed: 39 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,6 @@ vLLM provides experimental support for Vision Language Models (VLMs). This docum
88
.. important::
99
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
1010

11-
Engine Arguments
12-
----------------
13-
14-
The following :ref:`engine arguments <engine_args>` are specific to VLMs:
15-
16-
.. argparse::
17-
:module: vllm.engine.arg_utils
18-
:func: _vlm_engine_args_parser
19-
:prog: -m vllm.entrypoints.openai.api_server
20-
:nodefaultconst:
21-
22-
.. important::
2311
Currently, the support for vision language models on vLLM has the following limitations:
2412

2513
* Only single image input is supported per text prompt.
@@ -33,40 +21,33 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
3321

3422
.. code-block:: python
3523
36-
llm = LLM(
37-
model="llava-hf/llava-1.5-7b-hf",
38-
image_token_id=32000,
39-
image_input_shape="1,3,336,336",
40-
image_feature_size=576,
41-
)
24+
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
4225
4326
.. important::
44-
Currently, you have to specify ``image_feature_size`` to support memory profiling.
45-
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
46-
The calculation of feature size is specific to the model. For more details, please refer to
47-
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
27+
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
28+
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
29+
every model to perform profiling with.
4830

49-
We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
31+
This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through
32+
:meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>`
33+
for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced
34+
with a more accurate profiling strategy in the future.
5035

5136

5237
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
5338

5439
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
5540
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
5641

57-
.. note::
58-
59-
``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
60-
:class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
61-
6242
.. code-block:: python
6343
6444
# Refer to the HuggingFace repo for the correct format to use
6545
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
6646
6747
# Load the image using PIL.Image
68-
image = ...
69-
48+
image = PIL.Image.open(...)
49+
50+
# Single prompt inference
7051
outputs = llm.generate({
7152
"prompt": prompt,
7253
"multi_modal_data": {"image": image},
@@ -75,6 +56,26 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
7556
for o in outputs:
7657
generated_text = o.outputs[0].text
7758
print(generated_text)
59+
60+
# Batch inference
61+
image_1 = PIL.Image.open(...)
62+
image_2 = PIL.Image.open(...)
63+
outputs = llm.generate(
64+
[
65+
{
66+
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
67+
"multi_modal_data": {"image": image_1},
68+
},
69+
{
70+
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
71+
"multi_modal_data": {"image": image_2},
72+
}
73+
]
74+
)
75+
76+
for o in outputs:
77+
generated_text = o.outputs[0].text
78+
print(generated_text)
7879
7980
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
8081

@@ -99,18 +100,17 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
99100
100101
python -m vllm.entrypoints.openai.api_server \
101102
--model llava-hf/llava-1.5-7b-hf \
102-
--image-token-id 32000 \
103-
--image-input-shape 1,3,336,336 \
104-
--image-feature-size 576 \
105103
--chat-template template_llava.jinja
106104
107105
.. important::
108-
Currently, you have to specify ``image_feature_size`` to support memory profiling.
109-
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
110-
The calculation of feature size is specific to the model. For more details, please refer to
111-
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
112-
113-
We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
106+
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
107+
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
108+
every model to perform profiling with.
109+
110+
This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through
111+
:meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>`
112+
for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced
113+
with a more accurate profiling strategy in the future.
114114

115115
To consume the server, you can use the OpenAI client like in the example below:
116116

examples/llava_example.py

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,7 @@
1010

1111

1212
def run_llava():
13-
llm = LLM(
14-
model="llava-hf/llava-1.5-7b-hf",
15-
image_token_id=32000,
16-
image_input_shape="1,3,336,336",
17-
image_feature_size=576,
18-
)
13+
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
1914

2015
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
2116

examples/llava_next_example.py

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,7 @@
77

88

99
def run_llava_next():
10-
llm = LLM(
11-
model="llava-hf/llava-v1.6-mistral-7b-hf",
12-
image_token_id=32000,
13-
image_input_shape="1,3,336,336",
14-
# Use the maximum possible value for memory profiling
15-
image_feature_size=2928,
16-
)
10+
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
1711

1812
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
1913
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"

examples/openai_vision_api_client.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@
33
Launch the vLLM server with the following command:
44
python -m vllm.entrypoints.openai.api_server \
55
--model llava-hf/llava-1.5-7b-hf \
6-
--image-token-id 32000 \
7-
--image-input-shape 1,3,336,336 \
8-
--image-feature-size 576 \
96
--chat-template template_llava.jinja
107
"""
118
import base64

examples/phi3v_example.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,13 @@ def run_phi3v():
1414

1515
# Note: The default setting of max_num_seqs (256) and
1616
# max_model_len (128k) for this model may cause OOM.
17+
# You may lower either to run this example on lower-end GPUs.
18+
1719
# In this example, we override max_num_seqs to 5 while
1820
# keeping the original context length of 128k.
1921
llm = LLM(
2022
model=model_path,
2123
trust_remote_code=True,
22-
image_token_id=32044,
23-
image_input_shape="1,3,1008,1344",
24-
# Use the maximum possible value for memory profiling
25-
image_feature_size=2653,
2624
max_num_seqs=5,
2725
)
2826

tests/distributed/test_multimodal_broadcast.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@
2020
model = os.environ["TEST_DIST_MODEL"]
2121

2222
if model.startswith("llava-hf/llava"):
23-
from ..models.test_llava import model_and_vl_config, run_test
23+
from ..models.test_llava import models, run_test
2424
elif model.startswith("microsoft/Phi-3-vision"):
25-
from ..models.test_phi3v import model_and_vl_config, run_test
25+
from ..models.test_phi3v import models, run_test
2626
else:
2727
raise NotImplementedError(f"Unsupported model: {model}")
2828

@@ -44,7 +44,7 @@ def test_models(hf_runner, vllm_runner, image_assets,
4444
hf_runner,
4545
vllm_runner,
4646
image_assets,
47-
model_and_config=model_and_vl_config[0],
47+
model=models[0],
4848
size_factors=[1.0],
4949
dtype=dtype,
5050
max_tokens=max_tokens,

tests/entrypoints/openai/test_vision.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,6 @@ def server(ray_ctx):
3939
"--max-model-len",
4040
"4096",
4141
"--enforce-eager",
42-
"--image-token-id",
43-
"32000",
44-
"--image-input-shape",
45-
"1,3,336,336",
46-
"--image-feature-size",
47-
"576",
4842
"--chat-template",
4943
str(LLAVA_CHAT_TEMPLATE),
5044
])

tests/models/test_llava.py

Lines changed: 17 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
import pytest
44
from transformers import AutoTokenizer
55

6-
from vllm.config import VisionLanguageConfig
76
from vllm.multimodal.utils import rescale_image_size
87
from vllm.sequence import SampleLogprobs
98

@@ -21,49 +20,27 @@
2120
"USER: <image>\nWhat's in this image?\nASSISTANT:",
2221
})
2322

23+
IMAGE_TOKEN_ID = 32000
2424

25-
def iter_llava_configs(model_name: str):
26-
image_hw_to_feature_size = {
27-
(336, 336): 576,
28-
}
29-
30-
for (h, w), f in image_hw_to_feature_size.items():
31-
input_shape = (1, 3, h, w)
32-
yield (model_name,
33-
VisionLanguageConfig(image_feature_size=f,
34-
image_token_id=32000,
35-
image_input_shape=input_shape))
36-
37-
38-
model_and_vl_config = [
39-
*iter_llava_configs("llava-hf/llava-1.5-7b-hf"),
40-
]
25+
models = ["llava-hf/llava-1.5-7b-hf"]
4126

4227

4328
def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
4429
Optional[SampleLogprobs]],
45-
vlm_config: VisionLanguageConfig, model_id: str):
46-
"""Sanitize vllm output to be comparable with hf output.
47-
The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
48-
x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
49-
It also reduces `output_str` from "<image><image>bla" to "bla".
50-
"""
30+
model: str):
31+
"""Sanitize vllm output to be comparable with hf output."""
5132
output_ids, output_str, out_logprobs = vllm_output
52-
image_token_id = vlm_config.image_token_id
5333

54-
tokenizer = AutoTokenizer.from_pretrained(model_id)
55-
image_token_str = tokenizer.decode(image_token_id)
34+
tokenizer = AutoTokenizer.from_pretrained(model)
5635
eos_token_id = tokenizer.eos_token_id
5736

5837
hf_output_ids = [
5938
token_id for idx, token_id in enumerate(output_ids)
60-
if token_id != image_token_id or output_ids[idx - 1] != image_token_id
39+
if token_id != IMAGE_TOKEN_ID or output_ids[idx - 1] != IMAGE_TOKEN_ID
6140
]
6241

63-
hf_output_str = output_str \
64-
.replace(image_token_str * vlm_config.image_feature_size, "")
65-
assert hf_output_str[0] == " "
66-
hf_output_str = hf_output_str[1:]
42+
assert output_str[0] == " "
43+
hf_output_str = output_str[1:]
6744
if hf_output_ids[-1] == eos_token_id:
6845
hf_output_str = hf_output_str + tokenizer.decode(eos_token_id)
6946

@@ -74,7 +51,7 @@ def run_test(
7451
hf_runner: Type[HfRunner],
7552
vllm_runner: Type[VllmRunner],
7653
image_assets: _ImageAssets,
77-
model_and_config: Tuple[str, VisionLanguageConfig],
54+
model: str,
7855
*,
7956
size_factors: List[float],
8057
dtype: str,
@@ -92,7 +69,6 @@ def run_test(
9269
Note, the text input is also adjusted to abide by vllm contract.
9370
The text output is sanitized to be able to compare with hf.
9471
"""
95-
model_id, vlm_config = model_and_config
9672
images = [asset.pil_image for asset in image_assets]
9773

9874
inputs_per_image = [(
@@ -106,12 +82,11 @@ def run_test(
10682
# will hurt multiprocessing backend with fork method (the default method).
10783

10884
# max_model_len should be greater than image_feature_size
109-
with vllm_runner(model_id,
85+
with vllm_runner(model,
11086
dtype=dtype,
11187
tensor_parallel_size=tensor_parallel_size,
11288
distributed_executor_backend=distributed_executor_backend,
113-
enforce_eager=True,
114-
**vlm_config.as_cli_args_dict()) as vllm_model:
89+
enforce_eager=True) as vllm_model:
11590
vllm_outputs_per_image = [
11691
vllm_model.generate_greedy_logprobs(prompts,
11792
max_tokens,
@@ -120,7 +95,7 @@ def run_test(
12095
for prompts, images in inputs_per_image
12196
]
12297

123-
with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model:
98+
with hf_runner(model, dtype=dtype, is_vision_model=True) as hf_model:
12499
hf_outputs_per_image = [
125100
hf_model.generate_greedy_logprobs_limit(prompts,
126101
max_tokens,
@@ -136,15 +111,15 @@ def run_test(
136111
check_logprobs_close(
137112
outputs_0_lst=hf_outputs,
138113
outputs_1_lst=[
139-
vllm_to_hf_output(vllm_output, vlm_config, model_id)
114+
vllm_to_hf_output(vllm_output, model)
140115
for vllm_output in vllm_outputs
141116
],
142117
name_0="hf",
143118
name_1="vllm",
144119
)
145120

146121

147-
@pytest.mark.parametrize("model_and_config", model_and_vl_config)
122+
@pytest.mark.parametrize("model", models)
148123
@pytest.mark.parametrize(
149124
"size_factors",
150125
[
@@ -161,14 +136,13 @@ def run_test(
161136
@pytest.mark.parametrize("dtype", ["half"])
162137
@pytest.mark.parametrize("max_tokens", [128])
163138
@pytest.mark.parametrize("num_logprobs", [5])
164-
def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
165-
size_factors, dtype: str, max_tokens: int,
166-
num_logprobs: int) -> None:
139+
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
140+
dtype: str, max_tokens: int, num_logprobs: int) -> None:
167141
run_test(
168142
hf_runner,
169143
vllm_runner,
170144
image_assets,
171-
model_and_config,
145+
model,
172146
size_factors=size_factors,
173147
dtype=dtype,
174148
max_tokens=max_tokens,

0 commit comments

Comments
 (0)