Skip to content

Commit 0ce7b95

Browse files
[Doc] Update LLaVA docs (#5437)
Co-authored-by: Roger Wang <[email protected]>
1 parent 3987347 commit 0ce7b95

File tree

3 files changed

+29
-38
lines changed

3 files changed

+29
-38
lines changed

docs/source/models/vlm.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
2020
Currently, the support for vision language models on vLLM has the following limitations:
2121

2222
* Only single image input is supported per text prompt.
23-
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
23+
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.
2424

25-
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
25+
We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
2626

2727
Offline Batched Inference
2828
-------------------------

vllm/model_executor/models/llava.py

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ def forward(
227227
attn_metadata: AttentionMetadata,
228228
**kwargs: object,
229229
) -> SamplerOutput:
230-
"""Run forward pass for Llava 1.5.
230+
"""Run forward pass for LLaVA-1.5.
231231
232232
One key thing to understand is the `input_ids` already accounts for the
233233
positions of the to-be-inserted image embeddings.
@@ -247,22 +247,25 @@ def forward(
247247
This way, the `positions` and `attn_metadata` are consistent
248248
with the `input_ids`.
249249
250-
The model takes two types of image inputs:
251-
PIXEL_VALUES and IMAGE_FEATURES.
252-
The following shows how each maps to huggingface implementation.
253-
PIXEL_VALUES:
254-
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
255-
IMAGE_FEATURES:
256-
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
257-
before going through the multi modal projector.
250+
This model has two modes of image inputs:
251+
`PIXEL_VALUES` and `IMAGE_FEATURES`.
258252
259253
Args:
260254
input_ids: Flattened (concatenated) input_ids corresponding to a
261255
batch.
262-
pixel_values: For PIXEL_VALUES, expects a batch with shape
263-
[1, 3, 336, 336].
264-
image_features: For IMAGE_FEATURES, expects a batch with shape
265-
[1, 576, 1024].
256+
pixel_values: The pixels in each input image.
257+
Expects a batch with shape `[1, 3, 336, 336]`.
258+
(Only applicable to `PIXEL_VALUES` mode)
259+
image_features: The image features for each input image outputted by
260+
the vision tower before passing to the multi-modal projector.
261+
Expects a batch with shape `[1, 576, 1024]`.
262+
(Only applicable to `IMAGE_FEATURES` mode)
263+
264+
See also:
265+
Each input maps to huggingface implementation, as follows:
266+
267+
- `pixel_values`: https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/models/llava/modeling_llava.py#L360
268+
- `image_features`: https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/models/llava/modeling_llava.py#L437
266269
"""
267270
image_input = self._parse_and_validate_image_input(**kwargs)
268271

vllm/model_executor/models/llava_next.py

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -108,15 +108,6 @@ def _image_pixel_processor(
108108
@MULTIMODAL_REGISTRY.register_image_pixel_input(_image_pixel_processor)
109109
@MULTIMODAL_REGISTRY.register_dummy_data(_get_dummy_image_data)
110110
class LlavaNextForConditionalGeneration(VisionLanguageModelBase):
111-
"""
112-
Args to `forward()`:
113-
input_ids: Flattened (concatenated) input_ids corresponding to a
114-
batch.
115-
pixel_values: For PIXEL_VALUES, expects a batch with shape
116-
[1, num_patches, 3, 336, 336].
117-
image_features: For IMAGE_FEATURES, expects a batch with shape
118-
[1, num_patches, 1176, 1024].
119-
"""
120111

121112
def __init__(self,
122113
config: LlavaNextConfig,
@@ -355,7 +346,7 @@ def forward(
355346
attn_metadata: AttentionMetadata,
356347
**kwargs: object,
357348
) -> SamplerOutput:
358-
"""Run forward pass for Llava 1.5.
349+
"""Run forward pass for LlaVA-NeXT.
359350
360351
One key thing to understand is the `input_ids` already accounts for the
361352
positions of the to-be-inserted image embeddings.
@@ -375,22 +366,19 @@ def forward(
375366
This way, the `positions` and `attn_metadata` are consistent
376367
with the `input_ids`.
377368
378-
The model takes two types of image inputs:
379-
PIXEL_VALUES and IMAGE_FEATURES.
380-
The following shows how each maps to huggingface implementation.
381-
PIXEL_VALUES:
382-
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
383-
IMAGE_FEATURES:
384-
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
385-
before going through the multi modal projector.
386-
387369
Args:
388370
input_ids: Flattened (concatenated) input_ids corresponding to a
389371
batch.
390-
pixel_values: For PIXEL_VALUES, expects a batch with shape
391-
[1, 3, 336, 336].
392-
image_features: For IMAGE_FEATURES, expects a batch with shape
393-
[1, 576, 1024].
372+
pixel_values: The pixels in each grid patch for each input image.
373+
Expects a batch with shape `[1, num_patches, 3, 336, 336]`.
374+
image_sizes: The original `(width, height)` for each input image.
375+
Expects a batch with shape `[1, 2]`.
376+
377+
See also:
378+
Each input maps to huggingface implementation, as follows:
379+
380+
- `pixel_values`: https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/models/llava_next/modeling_llava_next.py#L690
381+
- `image_sizes`: https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/models/llava_next/modeling_llava_next.py#L691
394382
"""
395383
image_input = self._parse_and_validate_image_input(**kwargs)
396384

0 commit comments

Comments
 (0)