Skip to content

Commit 2f652e6

Browse files
[Doc] Improve MM Pooling model documentation (#25966)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent e6a226e commit 2f652e6

File tree

9 files changed

+290
-98
lines changed

9 files changed

+290
-98
lines changed

docs/features/multimodal_inputs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -428,7 +428,7 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
428428
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
429429

430430
For certain models, we provide alternative chat templates inside <gh-dir:examples>.
431-
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
431+
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec_phi3v.jinja> which is different from the default one for Phi-3-Vision.
432432

433433
### Image Inputs
434434

docs/models/supported_models.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -626,7 +626,29 @@ See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inp
626626
For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache.
627627

628628
!!! note
629-
vLLM currently only supports adding LoRA to the language backbone of multimodal models.
629+
vLLM currently only supports dynamic LoRA adapters on the language backbone of multimodal models.
630+
If you wish to use a model with LoRA in the multi-modal encoder,
631+
please merge the weights into the base model first before running it in vLLM like a regular model.
632+
633+
```python
634+
from peft import PeftConfig, PeftModel
635+
from transformers import AutoModelForImageTextToText, AutoProcessor
636+
637+
def merge_and_save(model_id: str, output_dir: str):
638+
base_model = AutoModelForImageTextToText.from_pretrained(model_id)
639+
lora_model = PeftModel.from_pretrained(
640+
base_model,
641+
model_id,
642+
config=PeftConfig.from_pretrained(model_id),
643+
)
644+
model = lora_model.merge_and_unload().to(dtype=base_model.dtype)
645+
model._hf_peft_config_loaded = False # Needed to save the merged model
646+
647+
processor = AutoProcessor.from_pretrained(model_id)
648+
649+
model.save_pretrained(output_dir)
650+
processor.save_pretrained(output_dir)
651+
```
630652

631653
### Generative Models
632654

@@ -805,8 +827,8 @@ The following table lists those that are tested in vLLM.
805827

806828
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | [V1](gh-issue:8779) |
807829
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
808-
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
809-
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
830+
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | ✅︎ | ✅︎ |
831+
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | | ✅︎ | ✅︎ |
810832
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
811833

812834
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))

docs/serving/openai_compatible_server.md

Lines changed: 41 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -236,10 +236,32 @@ The following extra parameters are supported:
236236
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
237237
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
238238

239+
Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py>
240+
239241
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
240-
which will be treated as a single prompt to the model.
242+
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
241243

242-
Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py>
244+
??? code
245+
246+
```python
247+
from openai import OpenAI
248+
from openai._types import NOT_GIVEN, NotGiven
249+
from openai.types.chat import ChatCompletionMessageParam
250+
from openai.types.create_embedding_response import CreateEmbeddingResponse
251+
252+
def create_chat_embeddings(
253+
client: OpenAI,
254+
*,
255+
messages: list[ChatCompletionMessageParam],
256+
model: str,
257+
encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
258+
) -> CreateEmbeddingResponse:
259+
return client.post(
260+
"/embeddings",
261+
cast_to=CreateEmbeddingResponse,
262+
body={"messages": messages, "model": model, "encoding_format": encoding_format},
263+
)
264+
```
243265

244266
#### Multi-modal inputs
245267

@@ -254,42 +276,44 @@ and passing a list of `messages` in the request. Refer to the examples below for
254276
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
255277
--trust-remote-code \
256278
--max-model-len 4096 \
257-
--chat-template examples/template_vlm2vec.jinja
279+
--chat-template examples/template_vlm2vec_phi3v.jinja
258280
```
259281

260282
!!! important
261283
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
262284
to run this model in embedding mode instead of text generation mode.
263285

264286
The custom chat template is completely different from the original one for this model,
265-
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
287+
and can be found here: <gh-file:examples/template_vlm2vec_phi3v.jinja>
266288

267289
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
268290

269291
??? code
270292

271293
```python
272-
import requests
273-
294+
from openai import OpenAI
295+
client = OpenAI(
296+
base_url="http://localhost:8000/v1",
297+
api_key="EMPTY",
298+
)
274299
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
275300

276-
response = requests.post(
277-
"http://localhost:8000/v1/embeddings",
278-
json={
279-
"model": "TIGER-Lab/VLM2Vec-Full",
280-
"messages": [{
301+
response = create_chat_embeddings(
302+
client,
303+
model="TIGER-Lab/VLM2Vec-Full",
304+
messages=[
305+
{
281306
"role": "user",
282307
"content": [
283308
{"type": "image_url", "image_url": {"url": image_url}},
284309
{"type": "text", "text": "Represent the given image."},
285310
],
286-
}],
287-
"encoding_format": "float",
288-
},
311+
}
312+
],
313+
encoding_format="float",
289314
)
290-
response.raise_for_status()
291-
response_json = response.json()
292-
print("Embedding output:", response_json["data"][0]["embedding"])
315+
316+
print("Image embedding output:", response.data[0].embedding)
293317
```
294318

295319
=== "DSE-Qwen2-MRL"

examples/offline_inference/vision_language_pooling.py

Lines changed: 77 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010

1111
from argparse import Namespace
1212
from dataclasses import asdict
13+
from pathlib import Path
1314
from typing import Literal, NamedTuple, Optional, TypedDict, Union, get_args
1415

1516
from PIL.Image import Image
@@ -19,6 +20,9 @@
1920
from vllm.multimodal.utils import fetch_image
2021
from vllm.utils import FlexibleArgumentParser
2122

23+
ROOT_DIR = Path(__file__).parent.parent.parent
24+
EXAMPLES_DIR = ROOT_DIR / "examples"
25+
2226

2327
class TextQuery(TypedDict):
2428
modality: Literal["text"]
@@ -82,23 +86,27 @@ def run_e5_v(query: Query) -> ModelRequestData:
8286
)
8387

8488

85-
def run_vlm2vec(query: Query) -> ModelRequestData:
89+
def _get_vlm2vec_prompt_image(query: Query, image_token: str):
8690
if query["modality"] == "text":
8791
text = query["text"]
8892
prompt = f"Find me an everyday image that matches the given caption: {text}" # noqa: E501
8993
image = None
9094
elif query["modality"] == "image":
91-
prompt = "<|image_1|> Find a day-to-day image that looks similar to the provided image." # noqa: E501
95+
prompt = f"{image_token} Find a day-to-day image that looks similar to the provided image." # noqa: E501
9296
image = query["image"]
9397
elif query["modality"] == "text+image":
9498
text = query["text"]
95-
prompt = (
96-
f"<|image_1|> Represent the given image with the following question: {text}" # noqa: E501
97-
)
99+
prompt = f"{image_token} Represent the given image with the following question: {text}" # noqa: E501
98100
image = query["image"]
99101
else:
100102
modality = query["modality"]
101-
raise ValueError(f"Unsupported query modality: '{modality}'")
103+
raise ValueError(f"Unsupported query modality: {modality!r}")
104+
105+
return prompt, image
106+
107+
108+
def run_vlm2vec_phi3v(query: Query) -> ModelRequestData:
109+
prompt, image = _get_vlm2vec_prompt_image(query, "<|image_1|>")
102110

103111
engine_args = EngineArgs(
104112
model="TIGER-Lab/VLM2Vec-Full",
@@ -116,6 +124,66 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
116124
)
117125

118126

127+
def run_vlm2vec_qwen2vl(query: Query) -> ModelRequestData:
128+
# vLLM does not support LoRA adapters on multi-modal encoder,
129+
# so we merge the weights first
130+
from huggingface_hub.constants import HF_HUB_CACHE
131+
from peft import PeftConfig, PeftModel
132+
from transformers import AutoModelForImageTextToText, AutoProcessor
133+
134+
from vllm.entrypoints.chat_utils import load_chat_template
135+
136+
model_id = "TIGER-Lab/VLM2Vec-Qwen2VL-2B"
137+
138+
base_model = AutoModelForImageTextToText.from_pretrained(model_id)
139+
lora_model = PeftModel.from_pretrained(
140+
base_model,
141+
model_id,
142+
config=PeftConfig.from_pretrained(model_id),
143+
)
144+
model = lora_model.merge_and_unload().to(dtype=base_model.dtype)
145+
model._hf_peft_config_loaded = False # Needed to save the merged model
146+
147+
processor = AutoProcessor.from_pretrained(
148+
model_id,
149+
# `min_pixels` and `max_pixels` are deprecated
150+
size={"shortest_edge": 3136, "longest_edge": 12845056},
151+
)
152+
processor.chat_template = load_chat_template(
153+
# The original chat template is not correct
154+
EXAMPLES_DIR / "template_vlm2vec_qwen2vl.jinja",
155+
)
156+
157+
merged_path = str(
158+
Path(HF_HUB_CACHE) / ("models--" + model_id.replace("/", "--") + "-vllm")
159+
)
160+
print(f"Saving merged model to {merged_path}...")
161+
print(
162+
"NOTE: This directory is not tracked by `huggingface_hub` "
163+
"so you have to delete this manually if you don't want it anymore."
164+
)
165+
model.save_pretrained(merged_path)
166+
processor.save_pretrained(merged_path)
167+
print("Done!")
168+
169+
prompt, image = _get_vlm2vec_prompt_image(query, "<|image_pad|>")
170+
171+
engine_args = EngineArgs(
172+
model=merged_path,
173+
runner="pooling",
174+
max_model_len=4096,
175+
trust_remote_code=True,
176+
mm_processor_kwargs={"num_crops": 4},
177+
limit_mm_per_prompt={"image": 1},
178+
)
179+
180+
return ModelRequestData(
181+
engine_args=engine_args,
182+
prompt=prompt,
183+
image=image,
184+
)
185+
186+
119187
def run_jinavl_reranker(query: Query) -> ModelRequestData:
120188
if query["modality"] != "text+images":
121189
raise ValueError(f"Unsupported query modality: '{query['modality']}'")
@@ -232,7 +300,8 @@ def run_score(model: str, modality: QueryModality, seed: Optional[int]):
232300

233301
model_example_map = {
234302
"e5_v": run_e5_v,
235-
"vlm2vec": run_vlm2vec,
303+
"vlm2vec_phi3v": run_vlm2vec_phi3v,
304+
"vlm2vec_qwen2vl": run_vlm2vec_qwen2vl,
236305
"jinavl_reranker": run_jinavl_reranker,
237306
}
238307

@@ -246,7 +315,7 @@ def parse_args():
246315
"--model-name",
247316
"-m",
248317
type=str,
249-
default="vlm2vec",
318+
default="vlm2vec_phi3v",
250319
choices=model_example_map.keys(),
251320
help="The name of the embedding model.",
252321
)

0 commit comments

Comments
 (0)