vllm-project
diff --git a/‎docs/features/multimodal_inputs.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/features/multimodal_inputs.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/models/supported_models.md‎
Lines changed: 25 additions & 3 deletions b/‎docs/models/supported_models.md‎
Lines changed: 25 additions & 3 deletions
diff --git a/‎docs/serving/openai_compatible_server.md‎
Lines changed: 41 additions & 17 deletions b/‎docs/serving/openai_compatible_server.md‎
Lines changed: 41 additions & 17 deletions
diff --git a/‎examples/offline_inference/vision_language_pooling.py‎
Lines changed: 77 additions & 8 deletions b/‎examples/offline_inference/vision_language_pooling.py‎
Lines changed: 77 additions & 8 deletions
@@ -428,7 +428,7 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
     If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
 
     For certain models, we provide alternative chat templates inside <gh-dir:examples>.
-    For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
+    For example, VLM2Vec uses <gh-file:examples/template_vlm2vec_phi3v.jinja> which is different from the default one for Phi-3-Vision.
 
 ### Image Inputs
 
 
@@ -626,7 +626,29 @@ See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inp
     For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache.
 
 !!! note
-    vLLM currently only supports adding LoRA to the language backbone of multimodal models.
+    vLLM currently only supports dynamic LoRA adapters on the language backbone of multimodal models.
+    If you wish to use a model with LoRA in the multi-modal encoder,
+    please merge the weights into the base model first before running it in vLLM like a regular model.
+
+    ```python
+    from peft import PeftConfig, PeftModel
+    from transformers import AutoModelForImageTextToText, AutoProcessor
+
+    def merge_and_save(model_id: str, output_dir: str):
+        base_model = AutoModelForImageTextToText.from_pretrained(model_id)
+        lora_model = PeftModel.from_pretrained(
+            base_model,
+            model_id,
+            config=PeftConfig.from_pretrained(model_id),
+        )
+        model = lora_model.merge_and_unload().to(dtype=base_model.dtype)
+        model._hf_peft_config_loaded = False  # Needed to save the merged model
+
+        processor = AutoProcessor.from_pretrained(model_id)
+
+        model.save_pretrained(output_dir)
+        processor.save_pretrained(output_dir)
+    ```
 
 ### Generative Models
 
@@ -805,8 +827,8 @@ The following table lists those that are tested in vLLM.
 
 | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | [V1](gh-issue:8779) |
 |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
-| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
-| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
+| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | ✅︎ | ✅︎ |
+| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | | ✅︎ | ✅︎ |
 | `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
 
 <sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))  
 
@@ -236,10 +236,32 @@ The following extra parameters are supported:
 Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
 you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
 
+Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py>
+
 If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
-which will be treated as a single prompt to the model.
+which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
 
-Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py>
+??? code
+
+    ```python
+    from openai import OpenAI
+    from openai._types import NOT_GIVEN, NotGiven
+    from openai.types.chat import ChatCompletionMessageParam
+    from openai.types.create_embedding_response import CreateEmbeddingResponse
+
+    def create_chat_embeddings(
+        client: OpenAI,
+        *,
+        messages: list[ChatCompletionMessageParam],
+        model: str,
+        encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
+    ) -> CreateEmbeddingResponse:
+        return client.post(
+            "/embeddings",
+            cast_to=CreateEmbeddingResponse,
+            body={"messages": messages, "model": model, "encoding_format": encoding_format},
+        )
+    ```
 
 #### Multi-modal inputs
 
@@ -254,42 +276,44 @@ and passing a list of `messages` in the request. Refer to the examples below for
     vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
       --trust-remote-code \
       --max-model-len 4096 \
-      --chat-template examples/template_vlm2vec.jinja
+      --chat-template examples/template_vlm2vec_phi3v.jinja
     ```
 
     !!! important
         Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
         to run this model in embedding mode instead of text generation mode.
 
         The custom chat template is completely different from the original one for this model,
-        and can be found here: <gh-file:examples/template_vlm2vec.jinja>
+        and can be found here: <gh-file:examples/template_vlm2vec_phi3v.jinja>
 
     Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
 
     ??? code
 
         ```python
-        import requests
-
+        from openai import OpenAI
+        client = OpenAI(
+            base_url="http://localhost:8000/v1",
+            api_key="EMPTY",
+        )
         image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
 
-        response = requests.post(
-            "http://localhost:8000/v1/embeddings",
-            json={
-                "model": "TIGER-Lab/VLM2Vec-Full",
-                "messages": [{
+        response = create_chat_embeddings(
+            client,
+            model="TIGER-Lab/VLM2Vec-Full",
+            messages=[
+                {
                     "role": "user",
                     "content": [
                         {"type": "image_url", "image_url": {"url": image_url}},
                         {"type": "text", "text": "Represent the given image."},
                     ],
-                }],
-                "encoding_format": "float",
-            },
+                }
+            ],
+            encoding_format="float",
         )
-        response.raise_for_status()
-        response_json = response.json()
-        print("Embedding output:", response_json["data"][0]["embedding"])
+
+        print("Image embedding output:", response.data[0].embedding)
         ```
 
 === "DSE-Qwen2-MRL"
 
@@ -10,6 +10,7 @@
 
 from argparse import Namespace
 from dataclasses import asdict
+from pathlib import Path
 from typing import Literal, NamedTuple, Optional, TypedDict, Union, get_args
 
 from PIL.Image import Image
@@ -19,6 +20,9 @@
 from vllm.multimodal.utils import fetch_image
 from vllm.utils import FlexibleArgumentParser
 
+ROOT_DIR = Path(__file__).parent.parent.parent
+EXAMPLES_DIR = ROOT_DIR / "examples"
+
 
 class TextQuery(TypedDict):
     modality: Literal["text"]
@@ -82,23 +86,27 @@ def run_e5_v(query: Query) -> ModelRequestData:
     )
 
 
-def run_vlm2vec(query: Query) -> ModelRequestData:
+def _get_vlm2vec_prompt_image(query: Query, image_token: str):
     if query["modality"] == "text":
         text = query["text"]
         prompt = f"Find me an everyday image that matches the given caption: {text}"  # noqa: E501
         image = None
     elif query["modality"] == "image":
-        prompt = "<|image_1|> Find a day-to-day image that looks similar to the provided image."  # noqa: E501
+        prompt = f"{image_token} Find a day-to-day image that looks similar to the provided image."  # noqa: E501
         image = query["image"]
     elif query["modality"] == "text+image":
         text = query["text"]
-        prompt = (
-            f"<|image_1|> Represent the given image with the following question: {text}"  # noqa: E501
-        )
+        prompt = f"{image_token} Represent the given image with the following question: {text}"  # noqa: E501
         image = query["image"]
     else:
         modality = query["modality"]
-        raise ValueError(f"Unsupported query modality: '{modality}'")
+        raise ValueError(f"Unsupported query modality: {modality!r}")
+
+    return prompt, image
+
+
+def run_vlm2vec_phi3v(query: Query) -> ModelRequestData:
+    prompt, image = _get_vlm2vec_prompt_image(query, "<|image_1|>")
 
     engine_args = EngineArgs(
         model="TIGER-Lab/VLM2Vec-Full",
@@ -116,6 +124,66 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
     )
 
 
+def run_vlm2vec_qwen2vl(query: Query) -> ModelRequestData:
+    # vLLM does not support LoRA adapters on multi-modal encoder,
+    # so we merge the weights first
+    from huggingface_hub.constants import HF_HUB_CACHE
+    from peft import PeftConfig, PeftModel
+    from transformers import AutoModelForImageTextToText, AutoProcessor
+
+    from vllm.entrypoints.chat_utils import load_chat_template
+
+    model_id = "TIGER-Lab/VLM2Vec-Qwen2VL-2B"
+
+    base_model = AutoModelForImageTextToText.from_pretrained(model_id)
+    lora_model = PeftModel.from_pretrained(
+        base_model,
+        model_id,
+        config=PeftConfig.from_pretrained(model_id),
+    )
+    model = lora_model.merge_and_unload().to(dtype=base_model.dtype)
+    model._hf_peft_config_loaded = False  # Needed to save the merged model
+
+    processor = AutoProcessor.from_pretrained(
+        model_id,
+        # `min_pixels` and `max_pixels` are deprecated
+        size={"shortest_edge": 3136, "longest_edge": 12845056},
+    )
+    processor.chat_template = load_chat_template(
+        # The original chat template is not correct
+        EXAMPLES_DIR / "template_vlm2vec_qwen2vl.jinja",
+    )
+
+    merged_path = str(
+        Path(HF_HUB_CACHE) / ("models--" + model_id.replace("/", "--") + "-vllm")
+    )
+    print(f"Saving merged model to {merged_path}...")
+    print(
+        "NOTE: This directory is not tracked by `huggingface_hub` "
+        "so you have to delete this manually if you don't want it anymore."
+    )
+    model.save_pretrained(merged_path)
+    processor.save_pretrained(merged_path)
+    print("Done!")
+
+    prompt, image = _get_vlm2vec_prompt_image(query, "<|image_pad|>")
+
+    engine_args = EngineArgs(
+        model=merged_path,
+        runner="pooling",
+        max_model_len=4096,
+        trust_remote_code=True,
+        mm_processor_kwargs={"num_crops": 4},
+        limit_mm_per_prompt={"image": 1},
+    )
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompt,
+        image=image,
+    )
+
+
 def run_jinavl_reranker(query: Query) -> ModelRequestData:
     if query["modality"] != "text+images":
         raise ValueError(f"Unsupported query modality: '{query['modality']}'")
@@ -232,7 +300,8 @@ def run_score(model: str, modality: QueryModality, seed: Optional[int]):
 
 model_example_map = {
     "e5_v": run_e5_v,
-    "vlm2vec": run_vlm2vec,
+    "vlm2vec_phi3v": run_vlm2vec_phi3v,
+    "vlm2vec_qwen2vl": run_vlm2vec_qwen2vl,
     "jinavl_reranker": run_jinavl_reranker,
 }
 
@@ -246,7 +315,7 @@ def parse_args():
         "--model-name",
         "-m",
         type=str,
-        default="vlm2vec",
+        default="vlm2vec_phi3v",
         choices=model_example_map.keys(),
         help="The name of the embedding model.",
     )