Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions examples/models/llava_onevision1_5.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
export HF_HOME="~/.cache/huggingface"

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this link correct?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link should be fixed.

# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
--batch_size=1
Comment on lines +5 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Sanity check: ensure flash_attention_2 is available on the target machines.

If FA2 isn’t installed or supported for the GPU/driver stack, runs will fail. Verify ahead of time.


🏁 Script executed:

#!/bin/bash
# Verify flash-attn import and CUDA compatibility.
python - <<'PY'
try:
    import flash_attn # type: ignore
    print("flash_attn import: OK")
except Exception as e:
    print("flash_attn import failed:", e)
PY

Length of output: 54


Critical: FlashAttention2 missing — install or guard attn_implementation usage (examples/models/llava_onevision1_5.sh: lines 6–10)

Local check failed: python import error "No module named 'flash_attn'". Runs with attn_implementation=flash_attention_2 will fail.

  • Fix A: Install FlashAttention2 on target machines and document required CUDA/PyTorch/driver versions.
  • Fix B: Detect availability at runtime in the script and fall back to a supported/default attention implementation or remove the flag.
🤖 Prompt for AI Agents
In examples/models/llava_onevision1_5.sh around lines 6 to 10, the script forces
attn_implementation=flash_attention_2 which will fail if FlashAttention2 is not
installed (python import error: No module named 'flash_attn'); either document
and instruct installing FlashAttention2 with required CUDA/PyTorch/driver
versions, or modify the script to detect FlashAttention2 availability at runtime
and if absent remove or replace the attn_implementation flag with a supported
default (e.g., None or a safe implementation) so the launch command won’t crash
on machines without flash_attn.

1 change: 1 addition & 0 deletions lmms_eval/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
"llava": "Llava",
"llava_hf": "LlavaHf",
"llava_onevision": "Llava_OneVision",
"llava_onevision1_5": "Llava_OneVision1_5",
"llava_onevision_moviechat": "Llava_OneVision_MovieChat",
"llava_sglang": "LlavaSglang",
"llava_vid": "LlavaVid",
Expand Down
344 changes: 344 additions & 0 deletions lmms_eval/models/simple/llava_onevision1_5.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
import base64
import re
from io import BytesIO
from typing import List, Optional, Tuple, Union

import decord
import numpy as np
import torch
from accelerate import Accelerator, DistributedType
from loguru import logger as eval_logger
from PIL import Image
from tqdm import tqdm
from transformers import (
AutoProcessor,
AutoTokenizer,
AutoModelForCausalLM
)

from lmms_eval import utils
from lmms_eval.api.instance import Instance
from lmms_eval.api.model import lmms
from lmms_eval.api.registry import register_model

try:
from qwen_vl_utils import process_vision_info
except ImportError:
eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")

Comment on lines +20 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix hard dependency on qwen_vl_utils; add safe fallback for images and guard videos.

As written, ImportError leaves process_vision_info undefined, causing a NameError at runtime. Provide a guarded import and a fallback path that collects PIL images; explicitly raise for videos when qwen_vl_utils is unavailable.

@@
-try:
-    from qwen_vl_utils import process_vision_info
-except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+try:
+    from qwen_vl_utils import process_vision_info  # type: ignore
+except ImportError:
+    process_vision_info = None  # fallback in generate_until
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only image handling."
+    )
@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images only; videos unsupported without qwen_vl_utils
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    if any(isinstance(part, dict) and part.get("type") == "video" for part in user):
+                        raise NotImplementedError(
+                            "Video inputs require qwen_vl_utils or a native LLaVA video path."
+                        )
+                    image_inputs.append(imgs or None)

Also applies to: 279-289


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add required type hints; avoid shadowing builtins in flatten.

Properties lack return annotations; flatten lacks types and uses input name.

@@
+T = TypeVar("T")
@@
-    def config(self):
+    def config(self) -> Any:
@@
-    def tokenizer(self):
+    def tokenizer(self) -> Any:
@@
-    def model(self):
+    def model(self) -> torch.nn.Module:
@@
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
@@
-    def max_length(self):
+    def max_length(self) -> int:
@@
-    def batch_size(self):
+    def batch_size(self) -> int:
@@
-    def device(self):
+    def device(self) -> torch.device:
@@
-    def rank(self):
+    def rank(self) -> int:
@@
-    def world_size(self):
+    def world_size(self) -> int:
@@
-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    def flatten(self, items: Iterable[Iterable[T]]) -> List[T]:
+        return [j for i in items for j in i]

Also applies to: 130-169, 174-179

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around line 29 (and also for
the functions spanning lines 130-169 and 174-179), add explicit type hints for
property return types and function signatures and stop shadowing builtins:
change the flatten function parameter name from input to inputs (or similar),
add parameter and return types (e.g., inputs: Iterable[Any] -> List[Any] or
Sequence[Any] -> List[Any]), and annotate all properties with their concrete
return types (for example -> torch.Tensor or -> Sequence[torch.Tensor] as
appropriate) and add typing imports (from typing import Any, Iterable, Sequence,
List, Optional) at the top. Ensure signature types for the functions at 130-169
and 174-179 are updated consistently (parameters typed and return types
declared) to satisfy static type checkers.

@register_model("llava_onevision1_5")
class Llava_OneVision1_5(lmms):
"""
Llava_OneVision1_5 Model
"https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct"
"""

def __init__(
self,
pretrained: str = "lmms-lab/LLaVA-OneVision-1.5-8B-Instruct",
device: Optional[str] = "cuda",
device_map: Optional[str] = "auto",
batch_size: Optional[Union[int, str]] = 1,
use_cache=True,
attn_implementation: Optional[str] = None,
min_pixels: int = 256 * 28 * 28,
max_pixels: int = 1605632,
max_num_frames: int = 32,
use_custom_video_loader: Optional[bool] = False,
fps: Optional[float] = None, # Only applicable if use_custom_video_loader is True
max_image_size: Optional[int] = None, # Only applicable if use_custom_video_loader is True
system_prompt: Optional[str] = "You are a helpful assistant.",
interleave_visuals: Optional[bool] = False,
reasoning_prompt: Optional[str] = None,
max_length: int = 2048,
**kwargs,
) -> None:
super().__init__()
if kwargs:
eval_logger.warning(f"Ignoring unexpected kwargs: {list(kwargs.keys())}")

# Validate attention implementation
valid_attn_implementations = [None, "flash_attention_2", "sdpa", "eager"]
if attn_implementation not in valid_attn_implementations:
raise ValueError(f"attn_implementation must be one of {valid_attn_implementations}, got {attn_implementation}")

self.use_custom_video_loader = use_custom_video_loader
self.fps = fps
# if self.fps and not self.use_custom_video_loader:
# raise ValueError("FPS is only applicable if use_custom_video_loader is True")
self.max_image_size = max_image_size
if self.max_image_size and not self.use_custom_video_loader:
raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

Comment on lines +62 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enforce fps/max_image_size coupling with use_custom_video_loader.

The fps check is commented out; enforce both constraints.

         self.use_custom_video_loader = use_custom_video_loader
         self.fps = fps
-        # if self.fps and not self.use_custom_video_loader:
-        #     raise ValueError("FPS is only applicable if use_custom_video_loader is True")
+        if self.fps is not None and not self.use_custom_video_loader:
+            raise ValueError(
+                "fps is only applicable if use_custom_video_loader is True"
+            )
         self.max_image_size = max_image_size
         if self.max_image_size and not self.use_custom_video_loader:
             raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.use_custom_video_loader = use_custom_video_loader
self.fps = fps
# if self.fps and not self.use_custom_video_loader:
# raise ValueError("FPS is only applicable if use_custom_video_loader is True")
self.max_image_size = max_image_size
if self.max_image_size and not self.use_custom_video_loader:
raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
self.use_custom_video_loader = use_custom_video_loader
self.fps = fps
if self.fps is not None and not self.use_custom_video_loader:
raise ValueError(
"fps is only applicable if use_custom_video_loader is True"
)
self.max_image_size = max_image_size
if self.max_image_size and not self.use_custom_video_loader:
raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
🧰 Tools
🪛 Ruff (0.12.2)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 66 to 73, the fps
validation is currently commented out while max_image_size is enforced;
re-enable and enforce the same coupling for fps: if self.fps is set and not
self.use_custom_video_loader, raise a ValueError with a clear message like "fps
is only applicable if use_custom_video_loader is True"; keep the existing
max_image_size check as-is so both parameters require use_custom_video_loader to
be True.

accelerator = Accelerator()
if accelerator.num_processes > 1:
self._device = torch.device(f"cuda:{accelerator.local_process_index}")
self.device_map = f"cuda:{accelerator.local_process_index}"
else:
self._device = torch.device(device)
self.device_map = device_map if device_map else device

# Prepare model loading arguments
model_kwargs = {
"torch_dtype": "auto",
"device_map": self.device_map,
"trust_remote_code": True
}

# Add attention implementation if specified
if attn_implementation is not None:
model_kwargs["attn_implementation"] = attn_implementation

self._model = AutoModelForCausalLM.from_pretrained(pretrained, **model_kwargs).eval()
self.max_pixels = max_pixels
self.min_pixels = min_pixels
self.max_num_frames = max_num_frames

if reasoning_prompt:
self.reasoning_prompt = reasoning_prompt.replace("\\n", "\n")
else:
self.reasoning_prompt = None
self.processor = AutoProcessor.from_pretrained(pretrained, max_pixels=max_pixels, min_pixels=min_pixels,trust_remote_code=True)
self._tokenizer = AutoTokenizer.from_pretrained(pretrained,trust_remote_code=True)
self.system_prompt = system_prompt
self.interleave_visuals = interleave_visuals

self._config = self.model.config
self._max_length = int(max_length)
self.batch_size_per_gpu = int(batch_size)
self.use_cache = use_cache

if accelerator.num_processes > 1:
assert accelerator.distributed_type in [
DistributedType.FSDP,
DistributedType.MULTI_GPU,
], "Unsupported distributed type provided. Only DDP and FSDP are supported."
if accelerator.distributed_type == DistributedType.FSDP:
self._model = accelerator.prepare(self.model)
else:
self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
self.accelerator = accelerator
if self.accelerator.is_local_main_process:
eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
self._rank = self.accelerator.local_process_index
self._world_size = self.accelerator.num_processes
else:
self._rank = 0
self._world_size = 1

@property
def config(self):
# return the associated transformers.AutoConfig for the given pretrained model.
return self._config

@property
def tokenizer(self):
return self._tokenizer

@property
def model(self):
# returns the model, unwrapping it if using Accelerate
if hasattr(self, "accelerator"):
return self.accelerator.unwrap_model(self._model)
else:
return self._model

@property
def eot_token_id(self):
return self.tokenizer.eos_token_id

@property
def max_length(self):
return self._max_length

@property
def batch_size(self):
return self.batch_size_per_gpu

@property
def device(self):
return self._device

@property
def rank(self):
return self._rank

@property
def world_size(self):
return self._world_size

Comment on lines +122 to +162
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Add return type annotations to public properties.

config/tokenizer/model/eot_token_id/max_length/batch_size/device/rank/world_size should declare return types to satisfy “Type hints required for all Python code.”

Run this to list untyped defs/props:


🏁 Script executed:

#!/usr/bin/env bash
rg -nP '^\s*(def|@property)\s+\w+\([^)]*\)(?!\s*->)' -n lmms_eval/models/simple/llava_onevision1_5.py

Length of output: 83


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

python3 - <<'PY'
import ast, sys, os

path = "lmms_eval/models/simple/llava_onevision1_5.py"
if not os.path.exists(path):
    print(f"ERROR: {path} not found", file=sys.stderr)
    sys.exit(2)

with open(path, "r", encoding="utf-8") as f:
    src = f.read()

tree = ast.parse(src, filename=path)

def ann_to_str(n):
    if n is None:
        return None
    try:
        return ast.unparse(n)
    except Exception:
        try:
            import astunparse
            return astunparse.unparse(n).strip()
        except Exception:
            return repr(n)

entries = []
for node in tree.body:
    if isinstance(node, ast.FunctionDef):
        entries.append((node.name, node.lineno, any(isinstance(d, ast.Name) and d.id == "property" for d in node.decorator_list), ann_to_str(node.returns)))
    if isinstance(node, ast.ClassDef):
        for n in node.body:
            if isinstance(n, ast.FunctionDef):
                has_prop = any(
                    (isinstance(d, ast.Name) and d.id == "property")
                    or (isinstance(d, ast.Attribute) and getattr(d, "attr", None) == "property")
                    for d in n.decorator_list
                )
                entries.append((f"{node.name}.{n.name}", n.lineno, has_prop, ann_to_str(n.returns)))

entries.sort(key=lambda x: x[1])
for name, lineno, is_prop, ann in entries:
    print(f"{name} (line {lineno}) {'@property' if is_prop else 'def'} -> return: {ann}")

missing = [e for e in entries if e[3] is None]
print()
print(f"Total functions/properties: {len(entries)}")
print(f"Without return annotations: {len(missing)}")
for name, lineno, is_prop, ann in missing:
    print(f"  - {name} (line {lineno}) {'@property' if is_prop else 'def'}")
PY

Length of output: 1522


Add return type annotations to public properties (required)

These @Property methods and the flatten function in lmms_eval/models/simple/llava_onevision1_5.py lack return type annotations — add explicit return types to satisfy the repo rule "Type hints required for all Python code."

  • Llava_OneVision1_5.config — line 130 — e.g. transformers.PretrainedConfig / transformers.AutoConfig
  • Llava_OneVision1_5.tokenizer — line 135 — e.g. transformers.PreTrainedTokenizerBase
  • Llava_OneVision1_5.model — line 139 — e.g. torch.nn.Module | transformers.PreTrainedModel
  • Llava_OneVision1_5.eot_token_id — line 147 — int
  • Llava_OneVision1_5.max_length — line 151 — int
  • Llava_OneVision1_5.batch_size — line 155 — int
  • Llava_OneVision1_5.device — line 159 — torch.device | str
  • Llava_OneVision1_5.rank — line 163 — int
  • Llava_OneVision1_5.world_size — line 167 — int
  • Llava_OneVision1_5.flatten — line 173 — add explicit return annotation (pick the precise type from implementation)

def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")

def flatten(self, input):
new_list = []
for i in input:
for j in i:
new_list.append(j)
return new_list

def generate_until(self, requests: List[Instance]) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add docstrings for public APIs.

Repo rule: public APIs must have docstrings.

 def generate_until(self, requests: List[Instance]) -> List[str]:
+        """
+        Generate responses until stop conditions for a batch of requests.
+        Expects each Instance to provide (context, gen_kwargs, doc_to_visual, doc_id, task, split).
+        """
@@
 def generate_until_multi_round(self, requests) -> List[str]:
-        raise NotImplementedError("TODO: Implement multi-round generation")
+        """
+        Multi-round chat generation for LLaVA-One-Vision.
+        """
+        raise NotImplementedError("TODO: Implement multi-round generation")

Also applies to: 343-345

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around line 181 (method
generate_until) and lines 343-345 (other public APIs), add proper docstrings for
these public methods: include a one-line summary, parameter descriptions (types
and meanings for requests and any other args), the return value (type and
meaning), and any raised exceptions or side effects; follow the repo's docstring
style (e.g., Google or NumPy) used elsewhere in the project and keep it concise
and informative.

res = []

def _collate(x):
# the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning
# - to know the size of a batch when going through the list, you know the first one is always the batch
# padded context length. this is useful to simplify the batching logic and more importantly to make
# automatic adaptive batches much much easier to implement
# - any OOMs will happen right away rather than near the end
toks = self.tokenizer.encode(x[0])
return -len(toks), x[0]

pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
# we group requests by their generation_kwargs,
# so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
# in the same batch.
re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True)
chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
for chunk in chunks:
contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split = zip(*chunk)
task = task[0]
split = split[0]
visual_list = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id]
gen_kwargs = all_gen_kwargs[0]

# Set default until or update values from gen_kwargs if present
until = gen_kwargs.get("until", [self.tokenizer.decode(self.eot_token_id)])

if isinstance(until, str):
until = [until]
elif not isinstance(until, list):
raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")

# Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
until = [item for item in until if item != "\n\n"]

if isinstance(contexts, tuple):
contexts = list(contexts)

for i in range(len(contexts)):
if "<image>" in contexts[i]:
contexts[i] = contexts[i].replace("<image>", "")

batched_messages = []
for i, context in enumerate(contexts):
if "<image>" in context:
context = context.replace("<image>", "")

message = [{"role": "system", "content": self.system_prompt}]
if self.reasoning_prompt:
context = context.strip() + self.reasoning_prompt
contexts[i] = context

processed_visuals = []
for visual in visual_list[i]:
if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")): # Video file
vr = decord.VideoReader(visual)
first_frame = vr[0].asnumpy()
height, width = first_frame.shape[:2]
# max_pixels = height * width
processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
elif isinstance(visual, Image.Image):
processed_visuals.append(
{"type": "image", "image": visual.convert("RGB")}
)

if self.interleave_visuals is False:
message.append(
{
"role": "user",
"content": processed_visuals + [{"type": "text", "text": context}],
}
)
else: # currently support find <image x> in the context
image_placeholders = re.findall(r"<image \d+>", context)
content_parts = []
text_parts = re.split(r"<image \d+>", context)
if text_parts[0]:
content_parts.append({"type": "text", "text": text_parts[0]})

for i, placeholder in enumerate(image_placeholders):
img_idx = int(re.search(r"<image (\d+)>", placeholder).group(1)) - 1
image_idx = min(img_idx, len(processed_visuals) - 1) if processed_visuals else 0
if processed_visuals and image_idx < len(processed_visuals):
content_parts.append(processed_visuals[image_idx])
if i + 1 < len(text_parts) and text_parts[i + 1]:
content_parts.append({"type": "text", "text": text_parts[i + 1]})

message.append(
{
"role": "user",
"content": content_parts,
}
)

batched_messages.append(message)

texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
image_inputs, video_inputs = process_vision_info(batched_messages)
if video_inputs is not None:
total_frames = video_inputs[0].shape[0]
indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
# Append the last frame index if not already included
if total_frames - 1 not in indices:
indices = np.append(indices, total_frames - 1)
video_inputs[0] = video_inputs[0][indices]
Comment on lines +271 to +277
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Process all videos, not just index 0.

Only the first video is subsampled; others are ignored.

-            if video_inputs is not None:
-                total_frames = video_inputs[0].shape[0]
-                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
-                # Append the last frame index if not already included
-                if total_frames - 1 not in indices:
-                    indices = np.append(indices, total_frames - 1)
-                video_inputs[0] = video_inputs[0][indices]
+            if video_inputs is not None:
+                for vi in range(len(video_inputs)):
+                    total_frames = video_inputs[vi].shape[0]
+                    indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
+                    if total_frames - 1 not in indices:
+                        indices = np.append(indices, total_frames - 1)
+                    video_inputs[vi] = video_inputs[vi][indices]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if video_inputs is not None:
total_frames = video_inputs[0].shape[0]
indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
# Append the last frame index if not already included
if total_frames - 1 not in indices:
indices = np.append(indices, total_frames - 1)
video_inputs[0] = video_inputs[0][indices]
if video_inputs is not None:
for vi in range(len(video_inputs)):
total_frames = video_inputs[vi].shape[0]
indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
# Append the last frame index if not already included
if total_frames - 1 not in indices:
indices = np.append(indices, total_frames - 1)
video_inputs[vi] = video_inputs[vi][indices]
🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 281 to 287, the
current logic only subsamples video_inputs[0], ignoring other videos; update it
to iterate over all entries in video_inputs (e.g., for i in
range(len(video_inputs)) or for idx, vid in enumerate(video_inputs)) and for
each video compute total_frames = vid.shape[0], create indices = np.linspace(0,
total_frames - 1, self.max_num_frames, dtype=int), append total_frames - 1 if
not present, and then assign video_inputs[i] = vid[indices] so every video is
subsampled consistently. Ensure you handle None and empty videos the same way as
before.

inputs = self.processor(text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")

if self.device_map == "auto":
inputs = inputs.to("cuda")
else:
inputs = inputs.to(self.device)
Comment on lines +281 to +283
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Verify CPU path when device_map='auto'.

inputs.to("cuda") on the auto path may break CPU-only runs. Consider always using inputs.to(self.device) or detect the first model device.


🏁 Script executed:

#!/usr/bin/env bash
# Grep for other generate_until implementations to confirm device handling patterns.
rg -n -C2 'inputs\.to\(' lmms_eval/models | sed -n '1,120p'

Length of output: 5931


Do not unconditionally use "cuda" when device_map == 'auto'.

inputs.to("cuda") breaks CPU-only runs — use inputs = inputs.to(self.device) or determine the model device (e.g. self.model.device or next(self.model.parameters()).device) and preserve dtype with .to(self.model.dtype) if needed. Affected: lmms_eval/models/simple/llava_onevision1_5.py:281-283 (same pattern found in whisper.py:180, qwen2_vl.py:353, qwen2_audio.py:242, qwen2_5_vl_interleave.py:355, qwen2_5_vl.py:300, chat/thyme.py:135, chat/qwen2_5_vl.py:81, chat/huggingface.py:242).

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 281-283, the code
unconditionally moves inputs to "cuda" when device_map == 'auto', which breaks
CPU-only runs; change this to move inputs to the actual model/device (e.g. use
inputs = inputs.to(self.device) or determine the model device via
self.model.device or next(self.model.parameters()).device) and preserve dtype if
needed by using .to(device, dtype=self.model.dtype) so CPU-only environments
work and tensor dtype is consistent with the model.


# Set default generation kwargs
default_gen_kwargs = {
"max_new_tokens": 128,
"temperature": 0.0, # Set to 0 for greedy default
"top_p": None,
"num_beams": 1,
}
# Update with provided kwargs
current_gen_kwargs = {**default_gen_kwargs, **gen_kwargs}
pad_token_id = self.tokenizer.pad_token_id

if current_gen_kwargs["temperature"] > 0:
current_gen_kwargs["do_sample"] = True
else:
current_gen_kwargs["do_sample"] = False
current_gen_kwargs["temperature"] = None
current_gen_kwargs["top_p"] = None

cont = self.model.generate(
**inputs,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=pad_token_id,
do_sample=current_gen_kwargs["do_sample"],
temperature=current_gen_kwargs["temperature"],
top_p=current_gen_kwargs["top_p"],
num_beams=current_gen_kwargs["num_beams"],
max_new_tokens=current_gen_kwargs["max_new_tokens"],
use_cache=self.use_cache,
)

generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, cont)]
answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for i, ans in enumerate(answers):
Comment on lines +314 to +315
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Decode with tokenizer, not processor (avoids AttributeError on some processors).

AutoProcessor may not expose batch_decode; tokenizer does.

-            answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+            answers = self.tokenizer.batch_decode(
+                generated_ids_trimmed,
+                skip_special_tokens=True,
+                clean_up_tokenization_spaces=False,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for i, ans in enumerate(answers):
answers = self.tokenizer.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
for i, ans in enumerate(answers):
🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 314-315, the code
calls self.processor.batch_decode which can raise AttributeError because
AutoProcessor may not implement batch_decode; replace this call to use the
tokenizer's batch_decode (e.g.,
self.tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=True,
clean_up_tokenization_spaces=False)) or, if tokenizer might be missing, guard
with a fallback to processor by checking hasattr(self, "tokenizer") and
hasattr(self.tokenizer, "batch_decode") and using processor only as fallback.

for term in until:
if len(term) > 0:
ans = ans.split(term)[0]
answers[i] = ans

for ans, context in zip(answers, contexts):
res.append(ans)
self.cache_hook.add_partial("generate_until", (context, gen_kwargs), ans)
pbar.update(1)
# reorder this group of results back to original unsorted form
res = re_ords.get_original(res)

pbar.close()
return res

def generate_until_multi_round(self, requests) -> List[str]:
raise NotImplementedError("TODO: Implement multi-round generation")