feat: add llava_onevision1_5 #825

mathCrazyy · 2025-09-17T07:55:45Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

New Features
- Added support for LLaVA-One-Vision 1.5 (8B Instruct) for multimodal generation: image & video inputs, interleaved or non‑interleaved visuals, configurable system/reasoning prompts, batching, distributed acceleration, attention implementation options, and large-image handling; includes evaluation-ready generation on common vision benchmarks.
Chores
- Added an example script to run Accelerate-based evaluations across standard vision tasks.
- Removed a CI workflow step invoking an external automated code action.

coderabbitai · 2025-09-17T07:55:53Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a new Llava_OneVision1_5 simple model wrapper (multi-modal generation, image/video handling, generate_until), registers it in the model registry, adds an example Accelerate launch script exporting HF_HOME for running lmms_eval with that model, and removes the Claude Code Action step from .github/workflows/claude.yml.

Changes

Cohort / File(s)	Summary
Example script `examples/models/llava_onevision1_5.sh`	New shell script exporting `HF_HOME="~/.cache/huggingface"`, includes a commented `lmms-eval` install line, and runs `accelerate launch` (8 procs, `main_process_port=12399`) to invoke `lmms_eval --model=llava_onevision1_5` with `model_args` (`pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct`, `attn_implementation=flash_attention_2`, `max_pixels=3240000`) plus tasks and `--batch_size=1`.
Model registry `lmms_eval/models/__init__.py`	Adds mapping `"llava_onevision1_5": "Llava_OneVision1_5"` to `AVAILABLE_SIMPLE_MODELS`.
LLaVA-One-Vision 1.5 wrapper `lmms_eval/models/simple/llava_onevision1_5.py`	New `Llava_OneVision1_5` class registered as `llava_onevision1_5`. Loads pretrained via `from_pretrained` (device_map/Accelerate/trust_remote_code), builds processor/tokenizer, validates `attn_implementation` and `max_image_size` usage, handles Accelerator device mapping/rank/world_size, exposes properties (config, tokenizer, model, eot_token_id, max_length, batch_size, device, rank, world_size), and implements `generate_until` (batched multi-request multi-modal generation with image/video processing, frame sampling, interleaved vs non-interleaved visuals). Includes `flatten` helper and NotImplemented placeholders for `loglikelihood` and `generate_until_multi_round`.
CI workflow `.github/workflows/claude.yml`	Removes the workflow step named "Claude Code Action Official" that used `anthropics/claude-code-action@beta`, eliminating that CI action invocation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Shell as examples/.../llava_onevision1_5.sh
  participant Accelerate
  participant CLI as lmms_eval CLI
  participant Registry as Model Registry
  participant Model as Llava_OneVision1_5
  participant HF as HuggingFace Hub
  participant Proc as AutoProcessor/Tokenizer

  User->>Shell: run script
  Shell->>Accelerate: launch (8 procs, port=12399)
  Accelerate->>CLI: invoke lmms_eval with model & tasks
  CLI->>Registry: resolve "llava_onevision1_5"
  Registry-->>CLI: Llava_OneVision1_5
  CLI->>Model: construct with model_args
  Model->>HF: from_pretrained (weights/config & processor/tokenizer)
  HF-->>Model: model + processor/tokenizer assets
  CLI->>Model: generate_until(requests)
  Model->>Proc: preprocess messages + visuals
  Model->>Model: model.generate(...)
  Model-->>CLI: decoded outputs
  CLI-->>User: evaluation results

sequenceDiagram
  autonumber
  participant Evaluator
  participant Model as Llava_OneVision1_5
  participant Collator
  participant Vision
  participant Proc as Processor
  participant HFModel as CausalLM

  Evaluator->>Model: generate_until(requests)
  Model->>Collator: group/sort by token length
  loop per batch
    Model->>Vision: resolve images/videos & sample frames
    Model->>Proc: build inputs (messages + visuals)
    Model->>HFModel: generate (max_new_tokens, sampling/greedy)
    HFModel-->>Model: token ids
    Model->>Proc: decode + trim until-stop
    Model-->>Evaluator: batch outputs
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Remove Claude GitHub workflows for code review #757 — Also removes/disables Claude GitHub Actions workflows; touches similar CI workflow area.
Add claude GitHub actions 1752118403023 #749 — Related changes around Claude workflow modifications; overlaps in workflow definitions.

Suggested reviewers

Luodian

Poem

I nibble code and hop with glee,
Eight processes hum — frames dance free.
Prompts and pixels stitched in tune,
Tokens sprout beneath the moon.
A rabbit cheers — inference blooms. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is identical to the repository template and was not filled out with a summary of the actual changes, motivation, test instructions, or completed checklist items, so it does not provide the required detailed description. Because the template sections remain blank and no PR-specific information is given, the description fails the repository's description requirements.	Please replace the template text with a concise summary of what changed (files modified and the high-level intent), why the change is needed, testing performed or reproduction steps, and any breaking/migration notes, then mark the checklist items as completed and run pre-commit before updating the PR description. Including brief usage or verification commands and noting the removed workflow step will help reviewers validate the change quickly.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: add llava_onevision1_5" is concise and directly describes the primary change in the changeset — adding support for the llava_onevision1_5 model (new model wrapper, registry entry, and example script). It follows conventional commit style, avoids noise, and is clear enough for teammates scanning history. Therefore it meets the repository's title guidance.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7058fb and cef402c.

📒 Files selected for processing (1)

examples/models/llava_onevision1_5.sh (1 hunks)

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yiyexy · 2025-09-17T07:58:01Z

lmms_eval/models/simple/llava_onevision1_5.py

+class Llava_OneVision1_5(lmms):
+    """
+    Llava_OneVision1_5 Model
+    "https://huggingface.co/Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct"


yiyexy · 2025-09-17T07:58:27Z

lmms_eval/models/simple/llava_onevision1_5.py

+
+    def __init__(
+        self,
+        pretrained: str = "Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct",


same as above

yiyexy · 2025-09-17T07:59:19Z

lmms_eval/models/simple/llava_onevision1_5.py

+from transformers import (
+    AutoProcessor,
+    AutoTokenizer,
+    Qwen2_5_VLForConditionalGeneration,


useless class

yiyexy · 2025-09-17T08:01:46Z

examples/models/llava_onevision1_5.sh

+
+accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
+    --model=llava_onevision1_5 \
+    --model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \


interleave_visuals has been set False, don't set in this line again.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (8)

lmms_eval/models/simple/llava_onevision1_5.py (7)

238-243: Avoid decoding the first video frame just to read size.

This allocates unnecessarily and isn’t used. Remove to reduce I/O and memory pressure.

-                        vr = decord.VideoReader(visual)
-                        first_frame = vr[0].asnumpy()
-                        height, width = first_frame.shape[:2]
-                        # max_pixels = height * width
                         processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})

210-214: Raise TypeError for invalid 'until' type and trim message (Ruff TRY004/TRY003).

-            elif not isinstance(until, list):
-                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")
+            elif not isinstance(until, list):
+                raise TypeError("gen_kwargs['until'] must be str or list[str]")

251-257: Prefer iterable unpacking over concatenation (RUF005).

-                            "content": processed_visuals + [{"type": "text", "text": context}],
+                            "content": [*processed_visuals, {"type": "text", "text": context}],

221-229: Remove duplicate placeholder stripping.

You strip "" twice. Keep the per-item cleanup, drop the pre-loop.

-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
-
             batched_messages = []
             for i, context in enumerate(contexts):
                 if "<image>" in context:
                     context = context.replace("<image>", "")

231-234: Safer reasoning prompt join.

Preserve whitespace with an explicit newline.

-                if self.reasoning_prompt:
-                    context = context.strip() + self.reasoning_prompt
-                    contexts[i] = context
+                if self.reasoning_prompt:
+                    context = f"{context.strip()}\n{self.reasoning_prompt}"
+                    contexts[i] = context

4-4: Type hints for public APIs (Pyright).

Add Any import; type the flatten param/return and requests type for multi-round.

-from typing import List, Optional, Tuple, Union
+from typing import Any, List, Optional, Tuple, Union
@@
-    def flatten(self, input):
+    def flatten(self, input: List[List[Any]]) -> List[Any]:
@@
-    def generate_until_multi_round(self, requests) -> List[str]:
+    def generate_until_multi_round(self, requests: List[Instance]) -> List[str]:

Also consider short docstrings for public methods per guidelines.

Also applies to: 174-180, 346-347

61-65: Exception text length (TRY003).

Messages are long; consider shorter text or custom exception types if you want verbose context from their str.

examples/models/llava_onevision1_5.sh (1)

1-1: Add shebang and fix HF_HOME (SC2148, SC2088).

+#!/usr/bin/env bash
+set -euo pipefail
-export HF_HOME="~/.cache/huggingface"
+export HF_HOME="$HOME/.cache/huggingface"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8b44ff and 80097a2.

📒 Files selected for processing (3)

examples/models/llava_onevision1_5.sh (1 hunks)
lmms_eval/models/__init__.py (1 hunks)
lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/models/__init__.py
lmms_eval/models/simple/llava_onevision1_5.py

🧬 Code graph analysis (1)

lmms_eval/models/simple/llava_onevision1_5.py (4)

lmms_eval/api/instance.py (2)

Instance (6-29)

args (25-29)

lmms_eval/api/model.py (1)

lmms (26-331)

lmms_eval/api/registry.py (1)

register_model (11-24)

lmms_eval/utils.py (3)

Collator (865-1020)

chunks (136-171)

get_batched (891-913)

🪛 Shellcheck (0.10.0)

examples/models/llava_onevision1_5.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 1-1: Tilde does not expand in quotes. Use $HOME.

(SC2088)

🪛 Ruff (0.12.2)

lmms_eval/models/simple/llava_onevision1_5.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

213-213: Prefer TypeError exception for invalid type

(TRY004)

213-213: Avoid specifying long messages outside the exception class

(TRY003)

255-255: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (2)

lmms_eval/models/simple/llava_onevision1_5.py (1)

93-101: Accelerate/device-map usage looks sound.

Model init, device selection, and distributed prepare flows are consistent. Good use of unwrap in property.

If you plan to support CPU fallback, add a guard for device="cpu" when device_map="auto".

Also applies to: 102-106, 112-121

lmms_eval/models/__init__.py (1)

39-40: Approve — registry mapping verified

Found @register_model("llava_onevision1_5") and class Llava_OneVision1_5 in lmms_eval/models/simple/llava_onevision1_5.py, and lmms_eval/models/init.py contains "llava_onevision1_5": "Llava_OneVision1_5". No changes required.

coderabbitai · 2025-09-17T08:06:58Z

examples/models/llava_onevision1_5.sh

+accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
+    --model=llava_onevision1_5 \
+    --model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \
+    --tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
+    --batch_size=1


⚠️ Potential issue

Fix double comma and quote model_args to prevent shell splitting.

There’s an extra comma after pretrained that will break parsing.

- --model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \ + --model_args="pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000" \

Optional: make tasks a var and pass via --tasks "$TASKS" for readability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \

--model=llava_onevision1_5 \

--model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \

--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \

--batch_size=1

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \

--model=llava_onevision1_5 \

--model_args="pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000" \

--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \

--batch_size=1

🤖 Prompt for AI Agents

In examples/models/llava_onevision1_5.sh around lines 6 to 10, the --model_args string contains a double comma after pretrained which will break parsing and the whole argument should be quoted to avoid shell splitting; remove the extra comma so the list is valid (e.g., change "pretrained=Deep-VLM/...,,attn_implementation" to a single comma) and wrap the entire --model_args value in quotes, and optionally define TASKS="mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench" and pass it as --tasks "$TASKS" for readability and safe expansion.

coderabbitai · 2025-09-17T08:06:58Z

lmms_eval/models/simple/llava_onevision1_5.py

+    Qwen2_5_VLForConditionalGeneration,
+    AutoModelForCausalLM
+)


⚠️ Potential issue

Fix Qwen-specific leftovers and vision pipeline mismatch (will break LLaVA).

Qwen class import is unused, and qwen_vl_utils is optional but unguarded at callsite.

Error message/comments reference Qwen2.5 VL.

More importantly, process_vision_info and the base64 data-URL image schema are Qwen-specific. LLaVA-One-Vision expects raw PIL images (and its own chat/processor semantics). As-is, this will likely fail at runtime or silently degrade.

Action:

Remove Qwen import, guard qwen_vl_utils usage, and provide a processor-only fallback collecting PIL images. Update messages/comments to LLaVA.

Apply focused fixes:

@@ -from transformers import ( - AutoProcessor, - AutoTokenizer, - Qwen2_5_VLForConditionalGeneration, - AutoModelForCausalLM -) +from transformers import ( + AutoProcessor, + AutoTokenizer, + AutoModelForCausalLM, +) @@ -try: - from qwen_vl_utils import process_vision_info -except ImportError: - eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`") +try: + from qwen_vl_utils import process_vision_info # type: ignore +except ImportError: + process_vision_info = None + eval_logger.warning( + "qwen_vl_utils not installed; falling back to processor-only vision handling" + ) @@ - def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: - raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL") + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + raise NotImplementedError( + "Loglikelihood is not implemented for LLaVA-One-Vision yet" + ) @@ - # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results + # Avoid overly generic newline stoppers; they can truncate valuable content + # in some vision LMs. Prefer explicit task-specific stoppers. until = [item for item in until if item != "\n\n"]

Then replace the hard dependency on qwen_vl_utils during batching with a safe fallback (images only):

@@ - texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages] - image_inputs, video_inputs = process_vision_info(batched_messages) + texts = [ + self.processor.apply_chat_template( + msg, tokenize=False, add_generation_prompt=True + ) + for msg in batched_messages + ] + if process_vision_info is not None: + image_inputs, video_inputs = process_vision_info(batched_messages) + else: + # Fallback: collect PIL images from the assembled messages for LLaVA + image_inputs, video_inputs = [], None + for msg in batched_messages: + user = next(p for p in msg if p["role"] == "user")["content"] + imgs = [ + part["image"] + for part in user + if isinstance(part, dict) + and part.get("type") == "image" + and isinstance(part.get("image"), Image.Image) + ] + image_inputs.append(imgs or None)

Note: If video is required without qwen_vl_utils, wire a LLaVA-native video path (or raise for now). I can provide that if you want it in this PR.

Also applies to: 25-29, 171-173, 215-217

lmms_eval/models/simple/llava_onevision1_5.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

examples/models/llava_onevision1_5.sh (1)

6-10: Fix double comma and quote model_args; optionally factor TASKS.

Unquoted --model_args will split on spaces, and there’s a stray “,,” after pretrained.

 accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
     --model=llava_onevision1_5 \
-    --model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,max_pixels=3240000 \
-    --tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
+    --model_args="pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000" \
+    --tasks="${TASKS:-mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench}" \
     --batch_size=1

lmms_eval/models/simple/llava_onevision1_5.py (5)

170-172: Fix model name in error; it references Qwen.

Update to LLaVA to avoid confusion.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

24-28: Import fallback bug: process_vision_info may be undefined (NameError).

If qwen_vl_utils isn’t installed, later use at Line 282 will crash.

 try:
     from qwen_vl_utils import process_vision_info
 except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+    process_vision_info = None  # type: ignore[assignment]
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only vision handling"
+    )

281-291: Guard Qwen helper and add LLaVA-compatible fallback (PIL images).

Unconditionally calling process_vision_info will fail when absent and mismatches LLaVA’s expected PIL images.

-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images from message content for LLaVA
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    image_inputs.append(imgs or None)

234-249: Use raw PIL images; drop base64 data-URL path.

Base64 is Qwen-style and wastes memory; LLaVA processors accept PIL directly.

                 for visual in visual_list[i]:
@@
-                    elif isinstance(visual, Image.Image):  # Handle both single and multiple images
-                        base64_image = visual.convert("RGB")
-                        buffer = BytesIO()
-                        base64_image.save(buffer, format="JPEG")
-                        base64_bytes = base64.b64encode(buffer.getvalue())
-                        base64_string = base64_bytes.decode("utf-8")
-                        processed_visuals.append({"type": "image", "image": f"data:image/jpeg;base64,{base64_string}", "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
+                    elif isinstance(visual, Image.Image):
+                        processed_visuals.append(
+                            {"type": "image", "image": visual.convert("RGB")}
+                        )

297-325: Never pass None into generate; add inference_mode and pad fallback.

Passing temperature/top_p=None can break transformers; also ensure pad_token_id fallback and use inference mode.

-            pad_token_id = self.tokenizer.pad_token_id
-
-            if current_gen_kwargs["temperature"] > 0:
-                current_gen_kwargs["do_sample"] = True
-            else:
-                current_gen_kwargs["do_sample"] = False
-                current_gen_kwargs["temperature"] = None
-                current_gen_kwargs["top_p"] = None
-
-            cont = self.model.generate(
-                **inputs,
-                eos_token_id=self.tokenizer.eos_token_id,
-                pad_token_id=pad_token_id,
-                do_sample=current_gen_kwargs["do_sample"],
-                temperature=current_gen_kwargs["temperature"],
-                top_p=current_gen_kwargs["top_p"],
-                num_beams=current_gen_kwargs["num_beams"],
-                max_new_tokens=current_gen_kwargs["max_new_tokens"],
-                use_cache=self.use_cache,
-            )
+            pad_token_id = self.tokenizer.pad_token_id or self.tokenizer.eos_token_id
+            do_sample = bool(
+                current_gen_kwargs.get("temperature", 0)
+                and current_gen_kwargs["temperature"] > 0
+            )
+            gen_args = {
+                **inputs,
+                "eos_token_id": self.tokenizer.eos_token_id,
+                "pad_token_id": pad_token_id,
+                "num_beams": int(current_gen_kwargs["num_beams"]),
+                "max_new_tokens": int(current_gen_kwargs["max_new_tokens"]),
+                "use_cache": self.use_cache,
+            }
+            if do_sample:
+                gen_args.update(
+                    do_sample=True,
+                    temperature=float(current_gen_kwargs.get("temperature", 1.0)),
+                    top_p=float(current_gen_kwargs.get("top_p", 1.0)),
+                )
+            with torch.inference_mode():
+                cont = self.model.generate(**gen_args)

🧹 Nitpick comments (6)

examples/models/llava_onevision1_5.sh (1)

1-1: Add shebang; fix HF_HOME per shellcheck.

Declare bash, enable strict mode, and avoid quoted tilde which doesn’t expand.
+#!/usr/bin/env bash
+set -euo pipefail
-export HF_HOME="~/.cache/huggingface"
+export HF_HOME="${HF_HOME:-$HOME/.cache/huggingface}"

lmms_eval/models/simple/llava_onevision1_5.py (5)

173-179: Type hints + small cleanup for flatten; avoid shadowing builtins.

Rename param and add typing; list comprehension is simpler.

-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    def flatten(self, items: List[List[object]]) -> List[object]:
+        return [j for i in items for j in i]

214-216: Generalize the stopper comment (remove Qwen-specific wording).

Keep guidance model-agnostic.

-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate useful content

283-289: Only the first video stream is sampled.

If batch contains multiple videos, others aren’t frame-sampled. Iterate over all video_inputs.

-            if video_inputs is not None:
-                total_frames = video_inputs[0].shape[0]
-                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
-                if total_frames - 1 not in indices:
-                    indices = np.append(indices, total_frames - 1)
-                video_inputs[0] = video_inputs[0][indices]
+            if video_inputs is not None:
+                for vi in range(len(video_inputs)):
+                    total_frames = video_inputs[vi].shape[0]
+                    indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
+                    if total_frames - 1 not in indices:
+                        indices = np.append(indices, total_frames - 1)
+                    video_inputs[vi] = video_inputs[vi][indices]

229-233: Add a separator before reasoning_prompt.

Avoid concatenating words when context doesn’t end with whitespace.

-                if self.reasoning_prompt:
-                    context = context.strip() + self.reasoning_prompt
+                if self.reasoning_prompt:
+                    context = context.rstrip() + "\n" + self.reasoning_prompt.lstrip()

1-7: Drop now-unused imports if you switch to PIL-only images.

base64 and BytesIO become dead after the suggested change.

-import base64
@@
-from io import BytesIO

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80097a2 and 9cb96f8.

📒 Files selected for processing (2)

examples/models/llava_onevision1_5.sh (1 hunks)
lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/models/simple/llava_onevision1_5.py

🧬 Code graph analysis (1)

lmms_eval/models/simple/llava_onevision1_5.py (4)

lmms_eval/api/instance.py (2)

Instance (6-29)

args (25-29)

lmms_eval/api/model.py (2)

lmms (26-331)

add_partial (348-352)

lmms_eval/api/registry.py (1)

register_model (11-24)

lmms_eval/utils.py (3)

Collator (865-1020)

chunks (136-171)

get_batched (891-913)

🪛 Shellcheck (0.10.0)

examples/models/llava_onevision1_5.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 1-1: Tilde does not expand in quotes. Use $HOME.

(SC2088)

🪛 Ruff (0.12.2)

lmms_eval/models/simple/llava_onevision1_5.py

63-63: Avoid specifying long messages outside the exception class

(TRY003)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

212-212: Prefer TypeError exception for invalid type

(TRY004)

212-212: Avoid specifying long messages outside the exception class

(TRY003)

254-254: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

lmms_eval/models/simple/llava_onevision1_5.py

coderabbitai · 2025-09-17T08:23:33Z

lmms_eval/models/simple/llava_onevision1_5.py

+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def eot_token_id(self):
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        return self._max_length
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
+


🛠️ Refactor suggestion

🧩 Analysis chain

Add return type annotations to public properties.

config/tokenizer/model/eot_token_id/max_length/batch_size/device/rank/world_size should declare return types to satisfy “Type hints required for all Python code.”

Run this to list untyped defs/props:

🏁 Script executed:

#!/usr/bin/env bash rg -nP '^\s*(def|@property)\s+\w+$[^)]*$(?!\s*->)' -n lmms_eval/models/simple/llava_onevision1_5.py

Length of output: 83

🏁 Script executed:

#!/usr/bin/env bash set -euo pipefail python3 - <<'PY' import ast, sys, os path = "lmms_eval/models/simple/llava_onevision1_5.py" if not os.path.exists(path): print(f"ERROR: {path} not found", file=sys.stderr) sys.exit(2) with open(path, "r", encoding="utf-8") as f: src = f.read() tree = ast.parse(src, filename=path) def ann_to_str(n): if n is None: return None try: return ast.unparse(n) except Exception: try: import astunparse return astunparse.unparse(n).strip() except Exception: return repr(n) entries = [] for node in tree.body: if isinstance(node, ast.FunctionDef): entries.append((node.name, node.lineno, any(isinstance(d, ast.Name) and d.id == "property" for d in node.decorator_list), ann_to_str(node.returns))) if isinstance(node, ast.ClassDef): for n in node.body: if isinstance(n, ast.FunctionDef): has_prop = any( (isinstance(d, ast.Name) and d.id == "property") or (isinstance(d, ast.Attribute) and getattr(d, "attr", None) == "property") for d in n.decorator_list ) entries.append((f"{node.name}.{n.name}", n.lineno, has_prop, ann_to_str(n.returns))) entries.sort(key=lambda x: x[1]) for name, lineno, is_prop, ann in entries: print(f"{name} (line {lineno}) {'@property' if is_prop else 'def'} -> return: {ann}") missing = [e for e in entries if e[3] is None] print() print(f"Total functions/properties: {len(entries)}") print(f"Without return annotations: {len(missing)}") for name, lineno, is_prop, ann in missing: print(f" - {name} (line {lineno}) {'@property' if is_prop else 'def'}") PY

Length of output: 1522

Add return type annotations to public properties (required)

These @Property methods and the flatten function in lmms_eval/models/simple/llava_onevision1_5.py lack return type annotations — add explicit return types to satisfy the repo rule "Type hints required for all Python code."

Llava_OneVision1_5.config — line 130 — e.g. transformers.PretrainedConfig / transformers.AutoConfig

Llava_OneVision1_5.tokenizer — line 135 — e.g. transformers.PreTrainedTokenizerBase

Llava_OneVision1_5.model — line 139 — e.g. torch.nn.Module | transformers.PreTrainedModel

Llava_OneVision1_5.eot_token_id — line 147 — int

Llava_OneVision1_5.max_length — line 151 — int

Llava_OneVision1_5.batch_size — line 155 — int

Llava_OneVision1_5.device — line 159 — torch.device | str

Llava_OneVision1_5.rank — line 163 — int

Llava_OneVision1_5.world_size — line 167 — int

Llava_OneVision1_5.flatten — line 173 — add explicit return annotation (pick the precise type from implementation)

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

examples/models/llava_onevision1_5.sh (2)
8-8: Quote --model_args to avoid accidental word splitting; content otherwise looks good.

Quoting is safer and mirrors earlier feedback; your previous double-comma issue is resolved.
-    --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
+    --model_args="pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000" \
8-8: Resolved: removed redundant interleave_visuals override noted earlier.

No duplicate interleave_visuals flag anymore. Looks good.

🧹 Nitpick comments (3)

examples/models/llava_onevision1_5.sh (3)
1-1: Add a shebang and strict mode for predictable execution.

Without a shebang, the target shell is unknown (ShellCheck SC2148). Add bash shebang and strict flags.
+#!/usr/bin/env bash
+set -euo pipefail
 export HF_HOME="~/.cache/huggingface"
1-1: Fix tilde expansion in HF_HOME.

Quoted ~ does not expand (ShellCheck SC2088). Use $HOME.
-export HF_HOME="~/.cache/huggingface"
+export HF_HOME="$HOME/.cache/huggingface"
9-10: Optional: factor tasks into a variable and quote for readability and safe expansion.

Keeps the command cleaner and prevents surprises if a task name ever includes spaces.
-    --tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
+TASKS="mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench"
+accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
+    --model=llava_onevision1_5 \
+    --model_args="pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000" \
+    --tasks="$TASKS" \
     --batch_size=1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cb96f8 and d82f95d.

📒 Files selected for processing (1)

examples/models/llava_onevision1_5.sh (1 hunks)

🧰 Additional context used

🪛 Shellcheck (0.10.0)

examples/models/llava_onevision1_5.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 1-1: Tilde does not expand in quotes. Use $HOME.

(SC2088)

coderabbitai · 2025-09-17T08:30:18Z

examples/models/llava_onevision1_5.sh

+accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
+    --model=llava_onevision1_5 \
+    --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
+    --tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
+    --batch_size=1


⚠️ Potential issue

🧩 Analysis chain

Sanity check: ensure flash_attention_2 is available on the target machines.

If FA2 isn’t installed or supported for the GPU/driver stack, runs will fail. Verify ahead of time.

🏁 Script executed:

#!/bin/bash # Verify flash-attn import and CUDA compatibility. python - <<'PY' try: import flash_attn # type: ignore print("flash_attn import: OK") except Exception as e: print("flash_attn import failed:", e) PY

Length of output: 54

Critical: FlashAttention2 missing — install or guard attn_implementation usage (examples/models/llava_onevision1_5.sh: lines 6–10)

Local check failed: python import error "No module named 'flash_attn'". Runs with attn_implementation=flash_attention_2 will fail.

Fix A: Install FlashAttention2 on target machines and document required CUDA/PyTorch/driver versions.

Fix B: Detect availability at runtime in the script and fall back to a supported/default attention implementation or remove the flag.

🤖 Prompt for AI Agents

In examples/models/llava_onevision1_5.sh around lines 6 to 10, the script forces attn_implementation=flash_attention_2 which will fail if FlashAttention2 is not installed (python import error: No module named 'flash_attn'); either document and instruct installing FlashAttention2 with required CUDA/PyTorch/driver versions, or modify the script to detect FlashAttention2 availability at runtime and if absent remove or replace the attn_implementation flag with a supported default (e.g., None or a safe implementation) so the launch command won’t crash on machines without flash_attn.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

lmms_eval/models/simple/llava_onevision1_5.py (3)

215-217: Generalize Qwen-specific stopper comment.

Keep guidance but remove model-specific wording.

-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate useful content.

24-28: Remove Qwen dependency and add safe vision fallback (prevents NameError and LLaVA mismatch).

qwen_vl_utils is optional but later called unguarded; this will crash if not installed and is Qwen‑specific. Guard the import and provide a processor‑only PIL image fallback aligned with LLaVA.

Apply:

@@
 try:
     from qwen_vl_utils import process_vision_info
 except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+    process_vision_info = None  # type: ignore[assignment]
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only vision handling"
+    )
@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # LLaVA: collect PIL images directly; no video support without qwen_vl_utils
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    image_inputs.append(imgs or None)

Also applies to: 279-289

296-323: Never pass None to generate; add pad_token fallback and inference_mode.

Avoid None for temperature/top_p, ensure pad_token fallback, and wrap in inference mode.

@@
-            pad_token_id = self.tokenizer.pad_token_id
-
-            if current_gen_kwargs["temperature"] > 0:
-                current_gen_kwargs["do_sample"] = True
-            else:
-                current_gen_kwargs["do_sample"] = False
-                current_gen_kwargs["temperature"] = None
-                current_gen_kwargs["top_p"] = None
-
-            cont = self.model.generate(
-                **inputs,
-                eos_token_id=self.tokenizer.eos_token_id,
-                pad_token_id=pad_token_id,
-                do_sample=current_gen_kwargs["do_sample"],
-                temperature=current_gen_kwargs["temperature"],
-                top_p=current_gen_kwargs["top_p"],
-                num_beams=current_gen_kwargs["num_beams"],
-                max_new_tokens=current_gen_kwargs["max_new_tokens"],
-                use_cache=self.use_cache,
-            )
+            pad_token_id = self.tokenizer.pad_token_id or self.tokenizer.eos_token_id
+            do_sample = bool(current_gen_kwargs.get("temperature", 0) and current_gen_kwargs["temperature"] > 0)
+            gen_args = {
+                **inputs,
+                "eos_token_id": self.tokenizer.eos_token_id,
+                "pad_token_id": pad_token_id,
+                "num_beams": current_gen_kwargs["num_beams"],
+                "max_new_tokens": current_gen_kwargs["max_new_tokens"],
+                "use_cache": self.use_cache,
+            }
+            if do_sample:
+                gen_args.update(
+                    do_sample=True,
+                    temperature=float(current_gen_kwargs.get("temperature", 1.0)),
+                    top_p=float(current_gen_kwargs.get("top_p", 1.0)),
+                )
+            with torch.inference_mode():
+                cont = self.model.generate(**gen_args)

🧹 Nitpick comments (6)

lmms_eval/models/simple/llava_onevision1_5.py (6)

1-4: Remove unused imports.

base64 and BytesIO are unused.

-import base64
 import re
-from io import BytesIO
-from typing import List, Optional, Tuple, Union
+from typing import Any, Iterable, List, Optional, Tuple, Union, TypeVar

171-173: Fix stale Qwen reference in error.

Message should reference LLaVA.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

221-224: Remove duplicate '' placeholder stripping.

The context cleanup happens twice.

-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
-
             batched_messages = []
             for i, context in enumerate(contexts):
-                if "<image>" in context:
-                    context = context.replace("<image>", "")
-
                 message = [{"role": "system", "content": self.system_prompt}]

Also applies to: 226-230

237-246: Avoid unnecessary video I/O on hot path.

Reading the first frame only to compute dims is wasted; you don’t use it.

-                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):  # Video file
-                        vr = decord.VideoReader(visual)
-                        first_frame = vr[0].asnumpy()
-                        height, width = first_frame.shape[:2]
-                        # max_pixels = height * width
-                        processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
+                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):  # Video file
+                        processed_visuals.append({"type": "video", "video": visual})

252-253: Ruff RUF005: prefer iterable unpacking over list concat.

-                            "content": processed_visuals + [{"type": "text", "text": context}],
+                            "content": [*processed_visuals, {"type": "text", "text": context}],

30-35: Optional: rename class to PascalCase.

PEP 8: class names shouldn’t use underscores. Suggested: LlavaOneVision15. The registry alias can stay "llava_onevision1_5".

-@register_model("llava_onevision1_5")
-class Llava_OneVision1_5(lmms):
+@register_model("llava_onevision1_5")
+class LlavaOneVision15(lmms):

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d82f95d and 46f0ff0.

📒 Files selected for processing (1)

lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/models/simple/llava_onevision1_5.py

🧬 Code graph analysis (1)

lmms_eval/models/simple/llava_onevision1_5.py (4)

lmms_eval/api/instance.py (2)

Instance (6-29)

args (25-29)

lmms_eval/api/model.py (2)

lmms (26-331)

add_partial (348-352)

lmms_eval/api/registry.py (1)

register_model (11-24)

lmms_eval/utils.py (3)

Collator (865-1020)

chunks (136-171)

get_batched (891-913)

🪛 Ruff (0.12.2)

lmms_eval/models/simple/llava_onevision1_5.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

213-213: Prefer TypeError exception for invalid type

(TRY004)

213-213: Avoid specifying long messages outside the exception class

(TRY003)

252-252: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

coderabbitai · 2025-09-17T09:27:24Z

lmms_eval/models/simple/llava_onevision1_5.py

+except ImportError:
+    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+
+


🛠️ Refactor suggestion

Add required type hints; avoid shadowing builtins in flatten.

Properties lack return annotations; flatten lacks types and uses input name.

@@ +T = TypeVar("T") @@ - def config(self): + def config(self) -> Any: @@ - def tokenizer(self): + def tokenizer(self) -> Any: @@ - def model(self): + def model(self) -> torch.nn.Module: @@ - def eot_token_id(self): + def eot_token_id(self) -> int: @@ - def max_length(self): + def max_length(self) -> int: @@ - def batch_size(self): + def batch_size(self) -> int: @@ - def device(self): + def device(self) -> torch.device: @@ - def rank(self): + def rank(self) -> int: @@ - def world_size(self): + def world_size(self) -> int: @@ - def flatten(self, input): - new_list = [] - for i in input: - for j in i: - new_list.append(j) - return new_list + def flatten(self, items: Iterable[Iterable[T]]) -> List[T]: + return [j for i in items for j in i]

Also applies to: 130-169, 174-179

🤖 Prompt for AI Agents

In lmms_eval/models/simple/llava_onevision1_5.py around line 29 (and also for the functions spanning lines 130-169 and 174-179), add explicit type hints for property return types and function signatures and stop shadowing builtins: change the flatten function parameter name from input to inputs (or similar), add parameter and return types (e.g., inputs: Iterable[Any] -> List[Any] or Sequence[Any] -> List[Any]), and annotate all properties with their concrete return types (for example -> torch.Tensor or -> Sequence[torch.Tensor] as appropriate) and add typing imports (from typing import Any, Iterable, Sequence, List, Optional) at the top. Ensure signature types for the functions at 130-169 and 174-179 are updated consistently (parameters typed and return types declared) to satisfy static type checkers.

coderabbitai · 2025-09-17T09:27:24Z

lmms_eval/models/simple/llava_onevision1_5.py

+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests: List[Instance]) -> List[str]:


🛠️ Refactor suggestion

Add docstrings for public APIs.

Repo rule: public APIs must have docstrings.

def generate_until(self, requests: List[Instance]) -> List[str]: + """ + Generate responses until stop conditions for a batch of requests. + Expects each Instance to provide (context, gen_kwargs, doc_to_visual, doc_id, task, split). + """ @@ def generate_until_multi_round(self, requests) -> List[str]: - raise NotImplementedError("TODO: Implement multi-round generation") + """ + Multi-round chat generation for LLaVA-One-Vision. + """ + raise NotImplementedError("TODO: Implement multi-round generation")

Also applies to: 343-345

🤖 Prompt for AI Agents

In lmms_eval/models/simple/llava_onevision1_5.py around line 181 (method generate_until) and lines 343-345 (other public APIs), add proper docstrings for these public methods: include a one-line summary, parameter descriptions (types and meanings for requests and any other args), the return value (type and meaning), and any raised exceptions or side effects; follow the repo's docstring style (e.g., Google or NumPy) used elsewhere in the project and keep it concise and informative.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

lmms_eval/models/simple/llava_onevision1_5.py (2)

171-173: Fix model name in NotImplemented message.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

207-217: Generalize stopper comment; remove Qwen-specific wording.

-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate valuable content
+            # in some vision LMs. Prefer explicit, task-specific stoppers.
             until = [item for item in until if item != "\n\n"]

🧹 Nitpick comments (14)

lmms_eval/models/simple/llava_onevision1_5.py (14)

1-5: Remove unused imports (ruff F401).

base64 and BytesIO are unused.

-import base64
 import re
-from io import BytesIO
-from typing import List, Optional, Tuple, Union
+from typing import List, Optional, Tuple, Union

61-65: Promote valid attn implementations to a constant; wrap long message.

Improves readability and respects 88c limit.

+VALID_ATTN_IMPLEMENTATIONS = (None, "flash_attention_2", "sdpa", "eager")
@@
-        valid_attn_implementations = [None, "flash_attention_2", "sdpa", "eager"]
-        if attn_implementation not in valid_attn_implementations:
-            raise ValueError(f"attn_implementation must be one of {valid_attn_implementations}, got {attn_implementation}")
+        if attn_implementation not in VALID_ATTN_IMPLEMENTATIONS:
+            raise ValueError(
+                f"attn_implementation must be one of "
+                f"{VALID_ATTN_IMPLEMENTATIONS}, got {attn_implementation}"
+            )

74-81: Add CUDA availability fallback.

Avoids device errors on CPU-only hosts.

-        if accelerator.num_processes > 1:
+        if accelerator.num_processes > 1:
             self._device = torch.device(f"cuda:{accelerator.local_process_index}")
             self.device_map = f"cuda:{accelerator.local_process_index}"
         else:
-            self._device = torch.device(device)
+            if device == "cuda" and not torch.cuda.is_available():
+                eval_logger.warning("CUDA not available; falling back to CPU.")
+                device = "cpu"
+            self._device = torch.device(device)
             self.device_map = device_map if device_map else device

98-106: Avoid Optional for booleans; keep types tight.

interleave_visuals and use_custom_video_loader need not be Optional.

-        use_custom_video_loader: Optional[bool] = False,
+        use_custom_video_loader: bool = False,
@@
-        interleave_visuals: Optional[bool] = False,
+        interleave_visuals: bool = False,

130-170: Add required return type annotations for public properties.

Repo rule: type hints required. Add concrete return types.

-from transformers import (
+from transformers import (
     AutoProcessor,
     AutoTokenizer,
-    AutoModelForCausalLM
+    AutoModelForCausalLM,
+    PreTrainedModel,
+    PretrainedConfig,
+    PreTrainedTokenizerBase,
 )

     @property
-    def config(self):
+    def config(self) -> PretrainedConfig:
@@
     @property
-    def tokenizer(self):
+    def tokenizer(self) -> PreTrainedTokenizerBase:
@@
     @property
-    def model(self):
+    def model(self) -> PreTrainedModel:
@@
     @property
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
@@
     @property
-    def max_length(self):
+    def max_length(self) -> int:
@@
     @property
-    def batch_size(self):
+    def batch_size(self) -> int:
@@
     @property
-    def device(self):
+    def device(self) -> torch.device:
@@
     @property
-    def rank(self):
+    def rank(self) -> int:
@@
     @property
-    def world_size(self):
+    def world_size(self) -> int:

174-180: Avoid shadowing builtins; type the helper; simplify.

-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    from typing import Iterable, TypeVar, List
+    T = TypeVar("T")
+
+    def flatten(self, items: Iterable[Iterable[T]]) -> List[T]:
+        return [j for i in items for j in i]

181-183: Add docstring for public API.

-    def generate_until(self, requests: List[Instance]) -> List[str]:
+    def generate_until(self, requests: List[Instance]) -> List[str]:
+        """
+        Generate responses for a batch of Instances until stop tokens.
+        Args:
+            requests: List of Instances with (context, gen_kwargs, doc_to_visual,
+                doc_id, task, split).
+        Returns:
+            List[str]: Generated texts in the original request order.
+        """
         res = []

218-224: Remove redundant placeholder stripping loop.

Context replacement is already handled below.

-            if isinstance(contexts, tuple):
-                contexts = list(contexts)
-
-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
+            if isinstance(contexts, tuple):
+                contexts = list(contexts)

248-255: Prefer iterable unpacking over concatenation (ruff RUF005).

-                    message.append(
-                        {
-                            "role": "user",
-                            "content": processed_visuals + [{"type": "text", "text": context}],
-                        }
-                    )
+                    message.append(
+                        {
+                            "role": "user",
+                            "content": [*processed_visuals, {"type": "text", "text": context}],
+                        }
+                    )

323-330: Prefer tokenizer for decoding.

Processor may proxy, but tokenizer is clearer.

-            answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+            answers = self.tokenizer.batch_decode(
+                generated_ids_trimmed,
+                skip_special_tokens=True,
+                clean_up_tokenization_spaces=False,
+            )

341-343: Add docstring and type hints for generate_until_multi_round.

-    def generate_until_multi_round(self, requests) -> List[str]:
-        raise NotImplementedError("TODO: Implement multi-round generation")
+    def generate_until_multi_round(self, requests: List[Instance]) -> List[str]:
+        """
+        Multi-round chat generation for LLaVA-One-Vision.
+        """
+        raise NotImplementedError("TODO: Implement multi-round generation")

31-35: Class name style vs. repo ergonomics.

Guideline says PascalCase; current name includes underscores. Consider adding a PascalCase alias to avoid churn.

 @register_model("llava_onevision1_5")
 class Llava_OneVision1_5(lmms):
@@
     """

And optionally:

class LlavaOneVision15(Llava_OneVision1_5):
    """PascalCase alias for style compliance."""
    pass

210-214: Use TypeError for invalid type (ruff TRY004).

-            elif not isinstance(until, list):
-                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")
+            elif not isinstance(until, list):
+                raise TypeError(
+                    f"Expected gen_kwargs['until'] to be Union[str, list], "
+                    f"got {type(until)}"
+                )

288-294: Device move: use self.device consistently

self._device is set earlier (torch.device(...)); moving inputs to the literal "cuda" when device_map == "auto" can mismatch model placement. Use inputs = inputs.to(self.device) (or inputs.to(next(self.model.parameters()).device) for absolute alignment).

File: lmms_eval/models/simple/llava_onevision1_5.py lines 288-294

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46f0ff0 and 6d189cc.

📒 Files selected for processing (1)

lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/models/simple/llava_onevision1_5.py

🧬 Code graph analysis (1)

lmms_eval/models/simple/llava_onevision1_5.py (3)

lmms_eval/api/instance.py (2)

Instance (6-29)

args (25-29)

lmms_eval/api/model.py (2)

lmms (26-331)

add_partial (348-352)

lmms_eval/utils.py (3)

Collator (865-1020)

chunks (136-171)

get_batched (891-913)

🪛 Ruff (0.12.2)

lmms_eval/models/simple/llava_onevision1_5.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

213-213: Prefer TypeError exception for invalid type

(TRY004)

213-213: Avoid specifying long messages outside the exception class

(TRY003)

252-252: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (4)

lmms_eval/models/simple/llava_onevision1_5.py (4)
107-121: Accelerate wrapping looks correct.

Unwrap in property is handled; logging is helpful.

237-247: Drop unused video shape inspection; keep minimal video descriptor.

height/width are unused; opening the video just to read frame 0 is wasteful.
[suggest_minor_issue]
-                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):  # Video file
-                        vr = decord.VideoReader(visual)
-                        first_frame = vr[0].asnumpy()
-                        height, width = first_frame.shape[:2]
-                        # max_pixels = height * width
-                        processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
+                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):
+                        processed_visuals.append(
+                            {
+                                "type": "video",
+                                "video": visual,
+                                "max_pixels": self.max_pixels,
+                                "min_pixels": self.min_pixels,
+                            }
+                        )
295-321: Never pass None into generate; inference_mode added.

Good fix: do_sample gating, pad_token fallback, inference_mode use.

14-17: Sort/expand imports to include types used in type hints.

Import any types referenced by property/type hints (e.g. PreTrainedModel, PretrainedConfig, PreTrainedTokenizerBase, TypeVar, Iterable). The supplied search returned no matches in lmms_eval/models/simple/llava_onevision1_5.py — verify and add missing imports if you changed signatures.

Location: lmms_eval/models/simple/llava_onevision1_5.py (lines 14–17).

coderabbitai · 2025-09-17T09:50:17Z

lmms_eval/models/simple/llava_onevision1_5.py

+try:
+    from qwen_vl_utils import process_vision_info
+except ImportError:
+    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+


⚠️ Potential issue

Fix hard dependency on qwen_vl_utils; add safe fallback for images and guard videos.

As written, ImportError leaves process_vision_info undefined, causing a NameError at runtime. Provide a guarded import and a fallback path that collects PIL images; explicitly raise for videos when qwen_vl_utils is unavailable.

@@ -try: - from qwen_vl_utils import process_vision_info -except ImportError: - eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`") +try: + from qwen_vl_utils import process_vision_info # type: ignore +except ImportError: + process_vision_info = None # fallback in generate_until + eval_logger.warning( + "qwen_vl_utils not installed; falling back to processor-only image handling." + ) @@ - texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages] - image_inputs, video_inputs = process_vision_info(batched_messages) + texts = [ + self.processor.apply_chat_template( + msg, tokenize=False, add_generation_prompt=True + ) + for msg in batched_messages + ] + if process_vision_info is not None: + image_inputs, video_inputs = process_vision_info(batched_messages) + else: + # Fallback: collect PIL images only; videos unsupported without qwen_vl_utils + image_inputs, video_inputs = [], None + for msg in batched_messages: + user = next(p for p in msg if p["role"] == "user")["content"] + imgs = [ + part["image"] + for part in user + if isinstance(part, dict) + and part.get("type") == "image" + and isinstance(part.get("image"), Image.Image) + ] + if any(isinstance(part, dict) and part.get("type") == "video" for part in user): + raise NotImplementedError( + "Video inputs require qwen_vl_utils or a native LLaVA video path." + ) + image_inputs.append(imgs or None)

Also applies to: 279-289

coderabbitai · 2025-09-17T09:50:18Z

lmms_eval/models/simple/llava_onevision1_5.py

+        self.use_custom_video_loader = use_custom_video_loader
+        self.fps = fps
+        # if self.fps and not self.use_custom_video_loader:
+        #     raise ValueError("FPS is only applicable if use_custom_video_loader is True")
+        self.max_image_size = max_image_size
+        if self.max_image_size and not self.use_custom_video_loader:
+            raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
+


⚠️ Potential issue

Enforce fps/max_image_size coupling with use_custom_video_loader.

The fps check is commented out; enforce both constraints.

self.use_custom_video_loader = use_custom_video_loader self.fps = fps - # if self.fps and not self.use_custom_video_loader: - # raise ValueError("FPS is only applicable if use_custom_video_loader is True") + if self.fps is not None and not self.use_custom_video_loader: + raise ValueError( + "fps is only applicable if use_custom_video_loader is True" + ) self.max_image_size = max_image_size if self.max_image_size and not self.use_custom_video_loader: raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.use_custom_video_loader = use_custom_video_loader

self.fps = fps

# if self.fps and not self.use_custom_video_loader:

# raise ValueError("FPS is only applicable if use_custom_video_loader is True")

self.max_image_size = max_image_size

if self.max_image_size and not self.use_custom_video_loader:

raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

self.use_custom_video_loader = use_custom_video_loader

self.fps = fps

if self.fps is not None and not self.use_custom_video_loader:

raise ValueError(

"fps is only applicable if use_custom_video_loader is True"

)

self.max_image_size = max_image_size

if self.max_image_size and not self.use_custom_video_loader:

raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

🧰 Tools

🪛 Ruff (0.12.2)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In lmms_eval/models/simple/llava_onevision1_5.py around lines 66 to 73, the fps validation is currently commented out while max_image_size is enforced; re-enable and enforce the same coupling for fps: if self.fps is set and not self.use_custom_video_loader, raise a ValueError with a clear message like "fps is only applicable if use_custom_video_loader is True"; keep the existing max_image_size check as-is so both parameters require use_custom_video_loader to be True.

coderabbitai · 2025-09-17T09:50:18Z

lmms_eval/models/simple/llava_onevision1_5.py

+            if video_inputs is not None:
+                total_frames = video_inputs[0].shape[0]
+                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
+                # Append the last frame index if not already included
+                if total_frames - 1 not in indices:
+                    indices = np.append(indices, total_frames - 1)
+                video_inputs[0] = video_inputs[0][indices]


⚠️ Potential issue

Process all videos, not just index 0.

Only the first video is subsampled; others are ignored.

- if video_inputs is not None: - total_frames = video_inputs[0].shape[0] - indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int) - # Append the last frame index if not already included - if total_frames - 1 not in indices: - indices = np.append(indices, total_frames - 1) - video_inputs[0] = video_inputs[0][indices] + if video_inputs is not None: + for vi in range(len(video_inputs)): + total_frames = video_inputs[vi].shape[0] + indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int) + if total_frames - 1 not in indices: + indices = np.append(indices, total_frames - 1) + video_inputs[vi] = video_inputs[vi][indices]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if video_inputs is not None:

total_frames = video_inputs[0].shape[0]

indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)

# Append the last frame index if not already included

if total_frames - 1 not in indices:

indices = np.append(indices, total_frames - 1)

video_inputs[0] = video_inputs[0][indices]

if video_inputs is not None:

for vi in range(len(video_inputs)):

total_frames = video_inputs[vi].shape[0]

indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)

# Append the last frame index if not already included

if total_frames - 1 not in indices:

indices = np.append(indices, total_frames - 1)

video_inputs[vi] = video_inputs[vi][indices]

🤖 Prompt for AI Agents

In lmms_eval/models/simple/llava_onevision1_5.py around lines 281 to 287, the current logic only subsamples video_inputs[0], ignoring other videos; update it to iterate over all entries in video_inputs (e.g., for i in range(len(video_inputs)) or for idx, vid in enumerate(video_inputs)) and for each video compute total_frames = vid.shape[0], create indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int), append total_frames - 1 if not present, and then assign video_inputs[i] = vid[indices] so every video is subsampled consistently. Ensure you handle None and empty videos the same way as before.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

lmms_eval/models/simple/llava_onevision1_5.py (5)

62-69: Enforce fps/max_image_size coupling to custom loader.

Validate both only when use_custom_video_loader=True; use explicit None checks.

         self.use_custom_video_loader = use_custom_video_loader
         self.fps = fps
-        # if self.fps and not self.use_custom_video_loader:
-        #     raise ValueError("FPS is only applicable if use_custom_video_loader is True")
+        if self.fps is not None and not self.use_custom_video_loader:
+            raise ValueError("fps is only applicable if use_custom_video_loader is True")
         self.max_image_size = max_image_size
-        if self.max_image_size and not self.use_custom_video_loader:
+        if self.max_image_size is not None and not self.use_custom_video_loader:
             raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

285-312: LGTM: safer generate() args and inference_mode.

Pad fallback, conditional sampling args, and inference context resolve prior pitfalls.

20-24: Guard qwen_vl_utils and add image-only fallback (prevents NameError and mismatched vision path).

process_vision_info is unbound if qwen_vl_utils isn’t installed, and LLaVA can operate on raw PIL images without qwen helpers. Add a guarded import and a safe fallback that collects PIL images; explicitly raise for videos without a native path. Also keep processors’ video subsampling consistent.

Apply:

@@
-try:
-    from qwen_vl_utils import process_vision_info
-except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+try:
+    from qwen_vl_utils import process_vision_info  # type: ignore
+except ImportError:
+    process_vision_info = None  # fallback in generate_until
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only image handling."
+    )
@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
-            if video_inputs is not None:
-                total_frames = video_inputs[0].shape[0]
-                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
-                # Append the last frame index if not already included
-                if total_frames - 1 not in indices:
-                    indices = np.append(indices, total_frames - 1)
-                video_inputs[0] = video_inputs[0][indices]
-            inputs = self.processor(text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images only; videos unsupported without qwen_vl_utils
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    if any(
+                        isinstance(part, dict) and part.get("type") == "video"
+                        for part in user
+                    ):
+                        raise NotImplementedError(
+                            "Video inputs require qwen_vl_utils or a native LLaVA video path."
+                        )
+                    image_inputs.append(imgs or None)
+            if video_inputs is not None:
+                for vi in range(len(video_inputs)):
+                    total_frames = video_inputs[vi].shape[0]
+                    indices = np.linspace(
+                        0, total_frames - 1, self.max_num_frames, dtype=int
+                    )
+                    if total_frames - 1 not in indices:
+                        indices = np.append(indices, total_frames - 1)
+                    video_inputs[vi] = video_inputs[vi][indices]
+            inputs = self.processor(
+                text=texts,
+                images=image_inputs,
+                videos=video_inputs,
+                padding=True,
+                return_tensors="pt",
+            )

Also applies to: 269-279

122-162: Add required type hints for properties and fix flatten’s signature/name.

Repo rule: type hints required; avoid shadowing builtins with parameter name input.

     @property
-    def config(self):
+    def config(self) -> Any:
@@
     @property
-    def tokenizer(self):
+    def tokenizer(self) -> Any:
@@
     @property
-    def model(self):
+    def model(self) -> torch.nn.Module:
@@
     @property
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
@@
     @property
-    def max_length(self):
+    def max_length(self) -> int:
@@
     @property
-    def batch_size(self):
+    def batch_size(self) -> int:
@@
     @property
-    def device(self):
+    def device(self) -> torch.device:
@@
     @property
-    def rank(self):
+    def rank(self) -> int:
@@
     @property
-    def world_size(self):
+    def world_size(self) -> int:
         return self._world_size
@@
-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    def flatten(self, items: Iterable[Iterable[T]]) -> List[T]:
+        return [j for i in items for j in i]

Also applies to: 166-171

173-175: Add docstrings for public APIs (repo requirement).

Brief summaries and param/return docs are required.

 def generate_until(self, requests: List[Instance]) -> List[str]:
-        res = []
+        """
+        Generate text for a batch of Instances until stop conditions.
+
+        Args:
+            requests: Batch providing (context, gen_kwargs, doc_to_visual,
+                doc_id, task, split).
+        Returns:
+            Generated strings in the original request order.
+        """
+        res = []
@@
-    def generate_until_multi_round(self, requests) -> List[str]:
-        raise NotImplementedError("TODO: Implement multi-round generation")
+    def generate_until_multi_round(self, requests) -> List[str]:
+        """
+        Multi-round chat generation for LLaVA-One-Vision.
+        """
+        raise NotImplementedError("TODO: Implement multi-round generation")

Also applies to: 331-333

🧹 Nitpick comments (6)

lmms_eval/models/simple/llava_onevision1_5.py (6)

1-4: Remove unused imports; expand typing for upcoming hints.

base64 and BytesIO are unused; add Any/Iterable/TypeVar for annotations.

-import base64
-import re
-from io import BytesIO
-from typing import List, Optional, Tuple, Union
+import re
+from typing import Any, Iterable, List, Optional, Tuple, TypeVar, Union

Add near the imports:

+T = TypeVar("T")

163-165: Fix copy/paste in NotImplemented message (Qwen → LLaVA).

Message should reference this model.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

199-206: Raise TypeError for invalid ‘until’ type, keep message concise.

Aligns with Ruff TRY004 and clearer API erroring.

             if isinstance(until, str):
                 until = [until]
             elif not isinstance(until, list):
-                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")
+                raise TypeError(
+                    "gen_kwargs['until'] must be a str or List[str]; "
+                    f"got {type(until).__name__}"
+                )

238-244: Use iterable unpacking instead of list concatenation.

Cleaner and satisfies Ruff RUF005.

-                    message.append(
-                        {
-                            "role": "user",
-                            "content": processed_visuals + [{"type": "text", "text": context}],
-                        }
-                    )
+                    message.append(
+                        {
+                            "role": "user",
+                            "content": [*processed_visuals, {"type": "text", "text": context}],
+                        }
+                    )

210-217: Remove duplicated placeholder cleanup (done again per-message).

Avoid double replace; keep per-message cleanup only.

-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
+            # Defer <image> placeholder cleanup to the per-message loop below.

26-31: Class name style nit: consider PascalCase (no underscores).

Repo guideline says class names use PascalCase. If rename would break registry/imports, consider adding a PascalCase alias.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d189cc and 8ca3df2.

📒 Files selected for processing (1)

lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/models/simple/llava_onevision1_5.py

🧬 Code graph analysis (1)

lmms_eval/models/simple/llava_onevision1_5.py (4)

lmms_eval/api/instance.py (2)

Instance (6-29)

args (25-29)

lmms_eval/api/model.py (2)

lmms (26-331)

add_partial (348-352)

lmms_eval/api/registry.py (1)

register_model (11-24)

lmms_eval/utils.py (3)

Collator (865-1020)

chunks (136-171)

get_batched (891-913)

🪛 Ruff (0.12.2)

lmms_eval/models/simple/llava_onevision1_5.py

60-60: Avoid specifying long messages outside the exception class

(TRY003)

68-68: Avoid specifying long messages outside the exception class

(TRY003)

205-205: Prefer TypeError exception for invalid type

(TRY004)

205-205: Avoid specifying long messages outside the exception class

(TRY003)

242-242: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

coderabbitai · 2025-09-17T10:33:58Z

lmms_eval/models/simple/llava_onevision1_5.py

+                inputs = inputs.to("cuda")
+            else:
+                inputs = inputs.to(self.device)


⚠️ Potential issue

🧩 Analysis chain

Verify CPU path when device_map='auto'.

inputs.to("cuda") on the auto path may break CPU-only runs. Consider always using inputs.to(self.device) or detect the first model device.

🏁 Script executed:

#!/usr/bin/env bash # Grep for other generate_until implementations to confirm device handling patterns. rg -n -C2 'inputs\.to\(' lmms_eval/models | sed -n '1,120p'

Length of output: 5931

Do not unconditionally use "cuda" when device_map == 'auto'.

inputs.to("cuda") breaks CPU-only runs — use inputs = inputs.to(self.device) or determine the model device (e.g. self.model.device or next(self.model.parameters()).device) and preserve dtype with .to(self.model.dtype) if needed. Affected: lmms_eval/models/simple/llava_onevision1_5.py:281-283 (same pattern found in whisper.py:180, qwen2_vl.py:353, qwen2_audio.py:242, qwen2_5_vl_interleave.py:355, qwen2_5_vl.py:300, chat/thyme.py:135, chat/qwen2_5_vl.py:81, chat/huggingface.py:242).

🤖 Prompt for AI Agents

In lmms_eval/models/simple/llava_onevision1_5.py around lines 281-283, the code unconditionally moves inputs to "cuda" when device_map == 'auto', which breaks CPU-only runs; change this to move inputs to the actual model/device (e.g. use inputs = inputs.to(self.device) or determine the model device via self.model.device or next(self.model.parameters()).device) and preserve dtype if needed by using .to(device, dtype=self.model.dtype) so CPU-only environments work and tensor dtype is consistent with the model.

coderabbitai · 2025-09-17T10:33:58Z

lmms_eval/models/simple/llava_onevision1_5.py

+            answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+            for i, ans in enumerate(answers):


⚠️ Potential issue

Decode with tokenizer, not processor (avoids AttributeError on some processors).

AutoProcessor may not expose batch_decode; tokenizer does.

- answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False) + answers = self.tokenizer.batch_decode( + generated_ids_trimmed, + skip_special_tokens=True, + clean_up_tokenization_spaces=False, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

for i, ans in enumerate(answers):

answers = self.tokenizer.batch_decode(

generated_ids_trimmed,

skip_special_tokens=True,

clean_up_tokenization_spaces=False,

)

for i, ans in enumerate(answers):

🤖 Prompt for AI Agents

In lmms_eval/models/simple/llava_onevision1_5.py around lines 314-315, the code calls self.processor.batch_decode which can raise AttributeError because AutoProcessor may not implement batch_decode; replace this call to use the tokenizer's batch_decode (e.g., self.tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)) or, if tokenizer might be missing, guard with a fallback to processor by checking hasattr(self, "tokenizer") and hasattr(self.tokenizer, "batch_decode") and using processor only as fallback.

Luodian · 2025-09-17T10:04:18Z

examples/models/llava_onevision1_5.sh

@@ -0,0 +1,10 @@
+export HF_HOME="~/.cache/huggingface"
+
+# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git


is this link correct?

The link should be fixed.

KevinZeng08 · 2025-09-24T08:53:20Z

Dear authors, I notice that Huggingface transformers does not have Llava-OneVision-1.5 model currently. Is this PR ready to use for evaluation?
Thanks for your wonderful models!

yiyexy · 2025-09-24T12:18:41Z

Dear authors, I notice that Huggingface transformers does not have Llava-OneVision-1.5 model currently. Is this PR ready to use for evaluation? Thanks for your wonderful models!

Sure! We are working hard on it

feat: add llava_onevision1_5

80097a2

yiyexy suggested changes Sep 17, 2025

View reviewed changes

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

mathCrazyy added 3 commits September 17, 2025 16:07

unset interleave_visuals

52fa957

fix: useless class, error link,

9cb96f8

fix: change repo

d82f95d

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

improve: bot advice

46f0ff0

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

improve: bot advice.2

6d189cc

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

fix format

8ca3df2

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

Remove invalid claude.yml workflow file

e7058fb

Luodian approved these changes Sep 18, 2025

View reviewed changes

mathCrazyy added 3 commits September 18, 2025 14:45

保存当前工作进度

f100b5a

fix link

ae62c59

fix link

cef402c

Luodian merged commit 1676c57 into EvolvingLMMs-Lab:main Sep 18, 2025
1 of 2 checks passed

		except ImportError:
		eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")

		answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
		for i, ans in enumerate(answers):

		@@ -0,0 +1,10 @@
		export HF_HOME="~/.cache/huggingface"

		# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

feat: add llava_onevision1_5 #825

feat: add llava_onevision1_5 #825

Uh oh!

Conversation

mathCrazyy commented Sep 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When you open a pull-request, please be sure to include the following

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

yiyexy Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

yiyexy Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

yiyexy Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

yiyexy Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

mathCrazyy commented Sep 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 17, 2025 •

edited

Loading