Skip to content

Conversation

@mathCrazyy
Copy link
Contributor

@mathCrazyy mathCrazyy commented Sep 17, 2025

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

  • A descriptive title: [xxx] XXXX
  • A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

  • New Features

    • Added support for LLaVA-One-Vision 1.5 (8B Instruct) for multimodal generation: image & video inputs, interleaved or non‑interleaved visuals, configurable system/reasoning prompts, batching, distributed acceleration, attention implementation options, and large-image handling; includes evaluation-ready generation on common vision benchmarks.
  • Chores

    • Added an example script to run Accelerate-based evaluations across standard vision tasks.
    • Removed a CI workflow step invoking an external automated code action.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 17, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a new Llava_OneVision1_5 simple model wrapper (multi-modal generation, image/video handling, generate_until), registers it in the model registry, adds an example Accelerate launch script exporting HF_HOME for running lmms_eval with that model, and removes the Claude Code Action step from .github/workflows/claude.yml.

Changes

Cohort / File(s) Summary
Example script
examples/models/llava_onevision1_5.sh
New shell script exporting HF_HOME="~/.cache/huggingface", includes a commented lmms-eval install line, and runs accelerate launch (8 procs, main_process_port=12399) to invoke lmms_eval --model=llava_onevision1_5 with model_args (pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct, attn_implementation=flash_attention_2, max_pixels=3240000) plus tasks and --batch_size=1.
Model registry
lmms_eval/models/__init__.py
Adds mapping "llava_onevision1_5": "Llava_OneVision1_5" to AVAILABLE_SIMPLE_MODELS.
LLaVA-One-Vision 1.5 wrapper
lmms_eval/models/simple/llava_onevision1_5.py
New Llava_OneVision1_5 class registered as llava_onevision1_5. Loads pretrained via from_pretrained (device_map/Accelerate/trust_remote_code), builds processor/tokenizer, validates attn_implementation and max_image_size usage, handles Accelerator device mapping/rank/world_size, exposes properties (config, tokenizer, model, eot_token_id, max_length, batch_size, device, rank, world_size), and implements generate_until (batched multi-request multi-modal generation with image/video processing, frame sampling, interleaved vs non-interleaved visuals). Includes flatten helper and NotImplemented placeholders for loglikelihood and generate_until_multi_round.
CI workflow
.github/workflows/claude.yml
Removes the workflow step named "Claude Code Action Official" that used anthropics/claude-code-action@beta, eliminating that CI action invocation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Shell as examples/.../llava_onevision1_5.sh
  participant Accelerate
  participant CLI as lmms_eval CLI
  participant Registry as Model Registry
  participant Model as Llava_OneVision1_5
  participant HF as HuggingFace Hub
  participant Proc as AutoProcessor/Tokenizer

  User->>Shell: run script
  Shell->>Accelerate: launch (8 procs, port=12399)
  Accelerate->>CLI: invoke lmms_eval with model & tasks
  CLI->>Registry: resolve "llava_onevision1_5"
  Registry-->>CLI: Llava_OneVision1_5
  CLI->>Model: construct with model_args
  Model->>HF: from_pretrained (weights/config & processor/tokenizer)
  HF-->>Model: model + processor/tokenizer assets
  CLI->>Model: generate_until(requests)
  Model->>Proc: preprocess messages + visuals
  Model->>Model: model.generate(...)
  Model-->>CLI: decoded outputs
  CLI-->>User: evaluation results
Loading
sequenceDiagram
  autonumber
  participant Evaluator
  participant Model as Llava_OneVision1_5
  participant Collator
  participant Vision
  participant Proc as Processor
  participant HFModel as CausalLM

  Evaluator->>Model: generate_until(requests)
  Model->>Collator: group/sort by token length
  loop per batch
    Model->>Vision: resolve images/videos & sample frames
    Model->>Proc: build inputs (messages + visuals)
    Model->>HFModel: generate (max_new_tokens, sampling/greedy)
    HFModel-->>Model: token ids
    Model->>Proc: decode + trim until-stop
    Model-->>Evaluator: batch outputs
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Luodian

Poem

I nibble code and hop with glee,
Eight processes hum — frames dance free.
Prompts and pixels stitched in tune,
Tokens sprout beneath the moon.
A rabbit cheers — inference blooms. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is identical to the repository template and was not filled out with a summary of the actual changes, motivation, test instructions, or completed checklist items, so it does not provide the required detailed description. Because the template sections remain blank and no PR-specific information is given, the description fails the repository's description requirements. Please replace the template text with a concise summary of what changed (files modified and the high-level intent), why the change is needed, testing performed or reproduction steps, and any breaking/migration notes, then mark the checklist items as completed and run pre-commit before updating the PR description. Including brief usage or verification commands and noting the removed workflow step will help reviewers validate the change quickly.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: add llava_onevision1_5" is concise and directly describes the primary change in the changeset — adding support for the llava_onevision1_5 model (new model wrapper, registry entry, and example script). It follows conventional commit style, avoids noise, and is clear enough for teammates scanning history. Therefore it meets the repository's title guidance.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7058fb and cef402c.

📒 Files selected for processing (1)
  • examples/models/llava_onevision1_5.sh (1 hunks)

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

class Llava_OneVision1_5(lmms):
"""
Llava_OneVision1_5 Model
"https://huggingface.co/Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error link


def __init__(
self,
pretrained: str = "Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

from transformers import (
AutoProcessor,
AutoTokenizer,
Qwen2_5_VLForConditionalGeneration,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless class


accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interleave_visuals has been set False, don't set in this line again.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
lmms_eval/models/simple/llava_onevision1_5.py (7)

238-243: Avoid decoding the first video frame just to read size.

This allocates unnecessarily and isn’t used. Remove to reduce I/O and memory pressure.

-                        vr = decord.VideoReader(visual)
-                        first_frame = vr[0].asnumpy()
-                        height, width = first_frame.shape[:2]
-                        # max_pixels = height * width
                         processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})

210-214: Raise TypeError for invalid 'until' type and trim message (Ruff TRY004/TRY003).

-            elif not isinstance(until, list):
-                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")
+            elif not isinstance(until, list):
+                raise TypeError("gen_kwargs['until'] must be str or list[str]")

251-257: Prefer iterable unpacking over concatenation (RUF005).

-                            "content": processed_visuals + [{"type": "text", "text": context}],
+                            "content": [*processed_visuals, {"type": "text", "text": context}],

221-229: Remove duplicate placeholder stripping.

You strip "" twice. Keep the per-item cleanup, drop the pre-loop.

-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
-
             batched_messages = []
             for i, context in enumerate(contexts):
                 if "<image>" in context:
                     context = context.replace("<image>", "")

231-234: Safer reasoning prompt join.

Preserve whitespace with an explicit newline.

-                if self.reasoning_prompt:
-                    context = context.strip() + self.reasoning_prompt
-                    contexts[i] = context
+                if self.reasoning_prompt:
+                    context = f"{context.strip()}\n{self.reasoning_prompt}"
+                    contexts[i] = context

4-4: Type hints for public APIs (Pyright).

Add Any import; type the flatten param/return and requests type for multi-round.

-from typing import List, Optional, Tuple, Union
+from typing import Any, List, Optional, Tuple, Union
@@
-    def flatten(self, input):
+    def flatten(self, input: List[List[Any]]) -> List[Any]:
@@
-    def generate_until_multi_round(self, requests) -> List[str]:
+    def generate_until_multi_round(self, requests: List[Instance]) -> List[str]:

Also consider short docstrings for public methods per guidelines.

Also applies to: 174-180, 346-347


61-65: Exception text length (TRY003).

Messages are long; consider shorter text or custom exception types if you want verbose context from their str.

examples/models/llava_onevision1_5.sh (1)

1-1: Add shebang and fix HF_HOME (SC2148, SC2088).

+#!/usr/bin/env bash
+set -euo pipefail
-export HF_HOME="~/.cache/huggingface"
+export HF_HOME="$HOME/.cache/huggingface"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8b44ff and 80097a2.

📒 Files selected for processing (3)
  • examples/models/llava_onevision1_5.sh (1 hunks)
  • lmms_eval/models/__init__.py (1 hunks)
  • lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/models/__init__.py
  • lmms_eval/models/simple/llava_onevision1_5.py
🧬 Code graph analysis (1)
lmms_eval/models/simple/llava_onevision1_5.py (4)
lmms_eval/api/instance.py (2)
  • Instance (6-29)
  • args (25-29)
lmms_eval/api/model.py (1)
  • lmms (26-331)
lmms_eval/api/registry.py (1)
  • register_model (11-24)
lmms_eval/utils.py (3)
  • Collator (865-1020)
  • chunks (136-171)
  • get_batched (891-913)
🪛 Shellcheck (0.10.0)
examples/models/llava_onevision1_5.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 1-1: Tilde does not expand in quotes. Use $HOME.

(SC2088)

🪛 Ruff (0.12.2)
lmms_eval/models/simple/llava_onevision1_5.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)


72-72: Avoid specifying long messages outside the exception class

(TRY003)


213-213: Prefer TypeError exception for invalid type

(TRY004)


213-213: Avoid specifying long messages outside the exception class

(TRY003)


255-255: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (2)
lmms_eval/models/simple/llava_onevision1_5.py (1)

93-101: Accelerate/device-map usage looks sound.

Model init, device selection, and distributed prepare flows are consistent. Good use of unwrap in property.

If you plan to support CPU fallback, add a guard for device="cpu" when device_map="auto".

Also applies to: 102-106, 112-121

lmms_eval/models/__init__.py (1)

39-40: Approve — registry mapping verified

Found @register_model("llava_onevision1_5") and class Llava_OneVision1_5 in lmms_eval/models/simple/llava_onevision1_5.py, and lmms_eval/models/init.py contains "llava_onevision1_5": "Llava_OneVision1_5". No changes required.

Comment on lines 6 to 10
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \
--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
--batch_size=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix double comma and quote model_args to prevent shell splitting.

There’s an extra comma after pretrained that will break parsing.

-    --model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \
+    --model_args="pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000" \

Optional: make tasks a var and pass via --tasks "$TASKS" for readability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000 \
--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
--batch_size=1
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args="pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,attn_implementation=flash_attention_2,interleave_visuals=False,max_pixels=3240000" \
--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
--batch_size=1
🤖 Prompt for AI Agents
In examples/models/llava_onevision1_5.sh around lines 6 to 10, the --model_args
string contains a double comma after pretrained which will break parsing and the
whole argument should be quoted to avoid shell splitting; remove the extra comma
so the list is valid (e.g., change
"pretrained=Deep-VLM/...,,attn_implementation" to a single comma) and wrap the
entire --model_args value in quotes, and optionally define
TASKS="mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench"
and pass it as --tasks "$TASKS" for readability and safe expansion.

Comment on lines 16 to 18
Qwen2_5_VLForConditionalGeneration,
AutoModelForCausalLM
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix Qwen-specific leftovers and vision pipeline mismatch (will break LLaVA).

  • Qwen class import is unused, and qwen_vl_utils is optional but unguarded at callsite.
  • Error message/comments reference Qwen2.5 VL.
  • More importantly, process_vision_info and the base64 data-URL image schema are Qwen-specific. LLaVA-One-Vision expects raw PIL images (and its own chat/processor semantics). As-is, this will likely fail at runtime or silently degrade.

Action:

  • Remove Qwen import, guard qwen_vl_utils usage, and provide a processor-only fallback collecting PIL images. Update messages/comments to LLaVA.

Apply focused fixes:

@@
-from transformers import (
-    AutoProcessor,
-    AutoTokenizer,
-    Qwen2_5_VLForConditionalGeneration,
-    AutoModelForCausalLM
-)
+from transformers import (
+    AutoProcessor,
+    AutoTokenizer,
+    AutoModelForCausalLM,
+)
@@
-try:
-    from qwen_vl_utils import process_vision_info
-except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+try:
+    from qwen_vl_utils import process_vision_info  # type: ignore
+except ImportError:
+    process_vision_info = None
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only vision handling"
+    )
@@
-    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )
@@
-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate valuable content
+            # in some vision LMs. Prefer explicit task-specific stoppers.
             until = [item for item in until if item != "\n\n"]

Then replace the hard dependency on qwen_vl_utils during batching with a safe fallback (images only):

@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images from the assembled messages for LLaVA
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    image_inputs.append(imgs or None)

Note: If video is required without qwen_vl_utils, wire a LLaVA-native video path (or raise for now). I can provide that if you want it in this PR.

Also applies to: 25-29, 171-173, 215-217

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
examples/models/llava_onevision1_5.sh (1)

6-10: Fix double comma and quote model_args; optionally factor TASKS.

Unquoted --model_args will split on spaces, and there’s a stray “,,” after pretrained.

 accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
     --model=llava_onevision1_5 \
-    --model_args=pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,,attn_implementation=flash_attention_2,max_pixels=3240000 \
-    --tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
+    --model_args="pretrained=Deep-VLM/LLaVA-One-Vision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000" \
+    --tasks="${TASKS:-mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench}" \
     --batch_size=1
lmms_eval/models/simple/llava_onevision1_5.py (5)

170-172: Fix model name in error; it references Qwen.

Update to LLaVA to avoid confusion.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

24-28: Import fallback bug: process_vision_info may be undefined (NameError).

If qwen_vl_utils isn’t installed, later use at Line 282 will crash.

 try:
     from qwen_vl_utils import process_vision_info
 except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+    process_vision_info = None  # type: ignore[assignment]
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only vision handling"
+    )

281-291: Guard Qwen helper and add LLaVA-compatible fallback (PIL images).

Unconditionally calling process_vision_info will fail when absent and mismatches LLaVA’s expected PIL images.

-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images from message content for LLaVA
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    image_inputs.append(imgs or None)

234-249: Use raw PIL images; drop base64 data-URL path.

Base64 is Qwen-style and wastes memory; LLaVA processors accept PIL directly.

                 for visual in visual_list[i]:
@@
-                    elif isinstance(visual, Image.Image):  # Handle both single and multiple images
-                        base64_image = visual.convert("RGB")
-                        buffer = BytesIO()
-                        base64_image.save(buffer, format="JPEG")
-                        base64_bytes = base64.b64encode(buffer.getvalue())
-                        base64_string = base64_bytes.decode("utf-8")
-                        processed_visuals.append({"type": "image", "image": f"data:image/jpeg;base64,{base64_string}", "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
+                    elif isinstance(visual, Image.Image):
+                        processed_visuals.append(
+                            {"type": "image", "image": visual.convert("RGB")}
+                        )

297-325: Never pass None into generate; add inference_mode and pad fallback.

Passing temperature/top_p=None can break transformers; also ensure pad_token_id fallback and use inference mode.

-            pad_token_id = self.tokenizer.pad_token_id
-
-            if current_gen_kwargs["temperature"] > 0:
-                current_gen_kwargs["do_sample"] = True
-            else:
-                current_gen_kwargs["do_sample"] = False
-                current_gen_kwargs["temperature"] = None
-                current_gen_kwargs["top_p"] = None
-
-            cont = self.model.generate(
-                **inputs,
-                eos_token_id=self.tokenizer.eos_token_id,
-                pad_token_id=pad_token_id,
-                do_sample=current_gen_kwargs["do_sample"],
-                temperature=current_gen_kwargs["temperature"],
-                top_p=current_gen_kwargs["top_p"],
-                num_beams=current_gen_kwargs["num_beams"],
-                max_new_tokens=current_gen_kwargs["max_new_tokens"],
-                use_cache=self.use_cache,
-            )
+            pad_token_id = self.tokenizer.pad_token_id or self.tokenizer.eos_token_id
+            do_sample = bool(
+                current_gen_kwargs.get("temperature", 0)
+                and current_gen_kwargs["temperature"] > 0
+            )
+            gen_args = {
+                **inputs,
+                "eos_token_id": self.tokenizer.eos_token_id,
+                "pad_token_id": pad_token_id,
+                "num_beams": int(current_gen_kwargs["num_beams"]),
+                "max_new_tokens": int(current_gen_kwargs["max_new_tokens"]),
+                "use_cache": self.use_cache,
+            }
+            if do_sample:
+                gen_args.update(
+                    do_sample=True,
+                    temperature=float(current_gen_kwargs.get("temperature", 1.0)),
+                    top_p=float(current_gen_kwargs.get("top_p", 1.0)),
+                )
+            with torch.inference_mode():
+                cont = self.model.generate(**gen_args)
🧹 Nitpick comments (6)
examples/models/llava_onevision1_5.sh (1)

1-1: Add shebang; fix HF_HOME per shellcheck.

Declare bash, enable strict mode, and avoid quoted tilde which doesn’t expand.

+#!/usr/bin/env bash
+set -euo pipefail
-export HF_HOME="~/.cache/huggingface"
+export HF_HOME="${HF_HOME:-$HOME/.cache/huggingface}"
lmms_eval/models/simple/llava_onevision1_5.py (5)

173-179: Type hints + small cleanup for flatten; avoid shadowing builtins.

Rename param and add typing; list comprehension is simpler.

-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    def flatten(self, items: List[List[object]]) -> List[object]:
+        return [j for i in items for j in i]

214-216: Generalize the stopper comment (remove Qwen-specific wording).

Keep guidance model-agnostic.

-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate useful content

283-289: Only the first video stream is sampled.

If batch contains multiple videos, others aren’t frame-sampled. Iterate over all video_inputs.

-            if video_inputs is not None:
-                total_frames = video_inputs[0].shape[0]
-                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
-                if total_frames - 1 not in indices:
-                    indices = np.append(indices, total_frames - 1)
-                video_inputs[0] = video_inputs[0][indices]
+            if video_inputs is not None:
+                for vi in range(len(video_inputs)):
+                    total_frames = video_inputs[vi].shape[0]
+                    indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
+                    if total_frames - 1 not in indices:
+                        indices = np.append(indices, total_frames - 1)
+                    video_inputs[vi] = video_inputs[vi][indices]

229-233: Add a separator before reasoning_prompt.

Avoid concatenating words when context doesn’t end with whitespace.

-                if self.reasoning_prompt:
-                    context = context.strip() + self.reasoning_prompt
+                if self.reasoning_prompt:
+                    context = context.rstrip() + "\n" + self.reasoning_prompt.lstrip()

1-7: Drop now-unused imports if you switch to PIL-only images.

base64 and BytesIO become dead after the suggested change.

-import base64
@@
-from io import BytesIO
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80097a2 and 9cb96f8.

📒 Files selected for processing (2)
  • examples/models/llava_onevision1_5.sh (1 hunks)
  • lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/models/simple/llava_onevision1_5.py
🧬 Code graph analysis (1)
lmms_eval/models/simple/llava_onevision1_5.py (4)
lmms_eval/api/instance.py (2)
  • Instance (6-29)
  • args (25-29)
lmms_eval/api/model.py (2)
  • lmms (26-331)
  • add_partial (348-352)
lmms_eval/api/registry.py (1)
  • register_model (11-24)
lmms_eval/utils.py (3)
  • Collator (865-1020)
  • chunks (136-171)
  • get_batched (891-913)
🪛 Shellcheck (0.10.0)
examples/models/llava_onevision1_5.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 1-1: Tilde does not expand in quotes. Use $HOME.

(SC2088)

🪛 Ruff (0.12.2)
lmms_eval/models/simple/llava_onevision1_5.py

63-63: Avoid specifying long messages outside the exception class

(TRY003)


71-71: Avoid specifying long messages outside the exception class

(TRY003)


212-212: Prefer TypeError exception for invalid type

(TRY004)


212-212: Avoid specifying long messages outside the exception class

(TRY003)


254-254: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

Comment on lines +129 to +169
@property
def config(self):
# return the associated transformers.AutoConfig for the given pretrained model.
return self._config

@property
def tokenizer(self):
return self._tokenizer

@property
def model(self):
# returns the model, unwrapping it if using Accelerate
if hasattr(self, "accelerator"):
return self.accelerator.unwrap_model(self._model)
else:
return self._model

@property
def eot_token_id(self):
return self.tokenizer.eos_token_id

@property
def max_length(self):
return self._max_length

@property
def batch_size(self):
return self.batch_size_per_gpu

@property
def device(self):
return self._device

@property
def rank(self):
return self._rank

@property
def world_size(self):
return self._world_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Add return type annotations to public properties.

config/tokenizer/model/eot_token_id/max_length/batch_size/device/rank/world_size should declare return types to satisfy “Type hints required for all Python code.”

Run this to list untyped defs/props:


🏁 Script executed:

#!/usr/bin/env bash
rg -nP '^\s*(def|@property)\s+\w+\([^)]*\)(?!\s*->)' -n lmms_eval/models/simple/llava_onevision1_5.py

Length of output: 83


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

python3 - <<'PY'
import ast, sys, os

path = "lmms_eval/models/simple/llava_onevision1_5.py"
if not os.path.exists(path):
    print(f"ERROR: {path} not found", file=sys.stderr)
    sys.exit(2)

with open(path, "r", encoding="utf-8") as f:
    src = f.read()

tree = ast.parse(src, filename=path)

def ann_to_str(n):
    if n is None:
        return None
    try:
        return ast.unparse(n)
    except Exception:
        try:
            import astunparse
            return astunparse.unparse(n).strip()
        except Exception:
            return repr(n)

entries = []
for node in tree.body:
    if isinstance(node, ast.FunctionDef):
        entries.append((node.name, node.lineno, any(isinstance(d, ast.Name) and d.id == "property" for d in node.decorator_list), ann_to_str(node.returns)))
    if isinstance(node, ast.ClassDef):
        for n in node.body:
            if isinstance(n, ast.FunctionDef):
                has_prop = any(
                    (isinstance(d, ast.Name) and d.id == "property")
                    or (isinstance(d, ast.Attribute) and getattr(d, "attr", None) == "property")
                    for d in n.decorator_list
                )
                entries.append((f"{node.name}.{n.name}", n.lineno, has_prop, ann_to_str(n.returns)))

entries.sort(key=lambda x: x[1])
for name, lineno, is_prop, ann in entries:
    print(f"{name} (line {lineno}) {'@property' if is_prop else 'def'} -> return: {ann}")

missing = [e for e in entries if e[3] is None]
print()
print(f"Total functions/properties: {len(entries)}")
print(f"Without return annotations: {len(missing)}")
for name, lineno, is_prop, ann in missing:
    print(f"  - {name} (line {lineno}) {'@property' if is_prop else 'def'}")
PY

Length of output: 1522


Add return type annotations to public properties (required)

These @Property methods and the flatten function in lmms_eval/models/simple/llava_onevision1_5.py lack return type annotations — add explicit return types to satisfy the repo rule "Type hints required for all Python code."

  • Llava_OneVision1_5.config — line 130 — e.g. transformers.PretrainedConfig / transformers.AutoConfig
  • Llava_OneVision1_5.tokenizer — line 135 — e.g. transformers.PreTrainedTokenizerBase
  • Llava_OneVision1_5.model — line 139 — e.g. torch.nn.Module | transformers.PreTrainedModel
  • Llava_OneVision1_5.eot_token_id — line 147 — int
  • Llava_OneVision1_5.max_length — line 151 — int
  • Llava_OneVision1_5.batch_size — line 155 — int
  • Llava_OneVision1_5.device — line 159 — torch.device | str
  • Llava_OneVision1_5.rank — line 163 — int
  • Llava_OneVision1_5.world_size — line 167 — int
  • Llava_OneVision1_5.flatten — line 173 — add explicit return annotation (pick the precise type from implementation)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
examples/models/llava_onevision1_5.sh (2)

8-8: Quote --model_args to avoid accidental word splitting; content otherwise looks good.

Quoting is safer and mirrors earlier feedback; your previous double-comma issue is resolved.

-    --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
+    --model_args="pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000" \

8-8: Resolved: removed redundant interleave_visuals override noted earlier.

No duplicate interleave_visuals flag anymore. Looks good.

🧹 Nitpick comments (3)
examples/models/llava_onevision1_5.sh (3)

1-1: Add a shebang and strict mode for predictable execution.

Without a shebang, the target shell is unknown (ShellCheck SC2148). Add bash shebang and strict flags.

+#!/usr/bin/env bash
+set -euo pipefail
 export HF_HOME="~/.cache/huggingface"

1-1: Fix tilde expansion in HF_HOME.

Quoted ~ does not expand (ShellCheck SC2088). Use $HOME.

-export HF_HOME="~/.cache/huggingface"
+export HF_HOME="$HOME/.cache/huggingface"

9-10: Optional: factor tasks into a variable and quote for readability and safe expansion.

Keeps the command cleaner and prevents surprises if a task name ever includes spaces.

-    --tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
+TASKS="mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench"
+accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
+    --model=llava_onevision1_5 \
+    --model_args="pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000" \
+    --tasks="$TASKS" \
     --batch_size=1
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cb96f8 and d82f95d.

📒 Files selected for processing (1)
  • examples/models/llava_onevision1_5.sh (1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/models/llava_onevision1_5.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 1-1: Tilde does not expand in quotes. Use $HOME.

(SC2088)

Comment on lines +6 to +10
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
--tasks=mmerealworld,mmerealworld_cn,chartqa,docvqa_val,infovqa_val,mmstar,ocrbench \
--batch_size=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Sanity check: ensure flash_attention_2 is available on the target machines.

If FA2 isn’t installed or supported for the GPU/driver stack, runs will fail. Verify ahead of time.


🏁 Script executed:

#!/bin/bash
# Verify flash-attn import and CUDA compatibility.
python - <<'PY'
try:
    import flash_attn # type: ignore
    print("flash_attn import: OK")
except Exception as e:
    print("flash_attn import failed:", e)
PY

Length of output: 54


Critical: FlashAttention2 missing — install or guard attn_implementation usage (examples/models/llava_onevision1_5.sh: lines 6–10)

Local check failed: python import error "No module named 'flash_attn'". Runs with attn_implementation=flash_attention_2 will fail.

  • Fix A: Install FlashAttention2 on target machines and document required CUDA/PyTorch/driver versions.
  • Fix B: Detect availability at runtime in the script and fall back to a supported/default attention implementation or remove the flag.
🤖 Prompt for AI Agents
In examples/models/llava_onevision1_5.sh around lines 6 to 10, the script forces
attn_implementation=flash_attention_2 which will fail if FlashAttention2 is not
installed (python import error: No module named 'flash_attn'); either document
and instruct installing FlashAttention2 with required CUDA/PyTorch/driver
versions, or modify the script to detect FlashAttention2 availability at runtime
and if absent remove or replace the attn_implementation flag with a supported
default (e.g., None or a safe implementation) so the launch command won’t crash
on machines without flash_attn.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
lmms_eval/models/simple/llava_onevision1_5.py (3)

215-217: Generalize Qwen-specific stopper comment.

Keep guidance but remove model-specific wording.

-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate useful content.

24-28: Remove Qwen dependency and add safe vision fallback (prevents NameError and LLaVA mismatch).

qwen_vl_utils is optional but later called unguarded; this will crash if not installed and is Qwen‑specific. Guard the import and provide a processor‑only PIL image fallback aligned with LLaVA.

Apply:

@@
 try:
     from qwen_vl_utils import process_vision_info
 except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+    process_vision_info = None  # type: ignore[assignment]
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only vision handling"
+    )
@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # LLaVA: collect PIL images directly; no video support without qwen_vl_utils
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    image_inputs.append(imgs or None)

Also applies to: 279-289


296-323: Never pass None to generate; add pad_token fallback and inference_mode.

Avoid None for temperature/top_p, ensure pad_token fallback, and wrap in inference mode.

@@
-            pad_token_id = self.tokenizer.pad_token_id
-
-            if current_gen_kwargs["temperature"] > 0:
-                current_gen_kwargs["do_sample"] = True
-            else:
-                current_gen_kwargs["do_sample"] = False
-                current_gen_kwargs["temperature"] = None
-                current_gen_kwargs["top_p"] = None
-
-            cont = self.model.generate(
-                **inputs,
-                eos_token_id=self.tokenizer.eos_token_id,
-                pad_token_id=pad_token_id,
-                do_sample=current_gen_kwargs["do_sample"],
-                temperature=current_gen_kwargs["temperature"],
-                top_p=current_gen_kwargs["top_p"],
-                num_beams=current_gen_kwargs["num_beams"],
-                max_new_tokens=current_gen_kwargs["max_new_tokens"],
-                use_cache=self.use_cache,
-            )
+            pad_token_id = self.tokenizer.pad_token_id or self.tokenizer.eos_token_id
+            do_sample = bool(current_gen_kwargs.get("temperature", 0) and current_gen_kwargs["temperature"] > 0)
+            gen_args = {
+                **inputs,
+                "eos_token_id": self.tokenizer.eos_token_id,
+                "pad_token_id": pad_token_id,
+                "num_beams": current_gen_kwargs["num_beams"],
+                "max_new_tokens": current_gen_kwargs["max_new_tokens"],
+                "use_cache": self.use_cache,
+            }
+            if do_sample:
+                gen_args.update(
+                    do_sample=True,
+                    temperature=float(current_gen_kwargs.get("temperature", 1.0)),
+                    top_p=float(current_gen_kwargs.get("top_p", 1.0)),
+                )
+            with torch.inference_mode():
+                cont = self.model.generate(**gen_args)
🧹 Nitpick comments (6)
lmms_eval/models/simple/llava_onevision1_5.py (6)

1-4: Remove unused imports.

base64 and BytesIO are unused.

-import base64
 import re
-from io import BytesIO
-from typing import List, Optional, Tuple, Union
+from typing import Any, Iterable, List, Optional, Tuple, Union, TypeVar

171-173: Fix stale Qwen reference in error.

Message should reference LLaVA.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

221-224: Remove duplicate '' placeholder stripping.

The context cleanup happens twice.

-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
-
             batched_messages = []
             for i, context in enumerate(contexts):
-                if "<image>" in context:
-                    context = context.replace("<image>", "")
-
                 message = [{"role": "system", "content": self.system_prompt}]

Also applies to: 226-230


237-246: Avoid unnecessary video I/O on hot path.

Reading the first frame only to compute dims is wasted; you don’t use it.

-                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):  # Video file
-                        vr = decord.VideoReader(visual)
-                        first_frame = vr[0].asnumpy()
-                        height, width = first_frame.shape[:2]
-                        # max_pixels = height * width
-                        processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
+                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):  # Video file
+                        processed_visuals.append({"type": "video", "video": visual})

252-253: Ruff RUF005: prefer iterable unpacking over list concat.

-                            "content": processed_visuals + [{"type": "text", "text": context}],
+                            "content": [*processed_visuals, {"type": "text", "text": context}],

30-35: Optional: rename class to PascalCase.

PEP 8: class names shouldn’t use underscores. Suggested: LlavaOneVision15. The registry alias can stay "llava_onevision1_5".

-@register_model("llava_onevision1_5")
-class Llava_OneVision1_5(lmms):
+@register_model("llava_onevision1_5")
+class LlavaOneVision15(lmms):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d82f95d and 46f0ff0.

📒 Files selected for processing (1)
  • lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/models/simple/llava_onevision1_5.py
🧬 Code graph analysis (1)
lmms_eval/models/simple/llava_onevision1_5.py (4)
lmms_eval/api/instance.py (2)
  • Instance (6-29)
  • args (25-29)
lmms_eval/api/model.py (2)
  • lmms (26-331)
  • add_partial (348-352)
lmms_eval/api/registry.py (1)
  • register_model (11-24)
lmms_eval/utils.py (3)
  • Collator (865-1020)
  • chunks (136-171)
  • get_batched (891-913)
🪛 Ruff (0.12.2)
lmms_eval/models/simple/llava_onevision1_5.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)


72-72: Avoid specifying long messages outside the exception class

(TRY003)


213-213: Prefer TypeError exception for invalid type

(TRY004)


213-213: Avoid specifying long messages outside the exception class

(TRY003)


252-252: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

except ImportError:
eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add required type hints; avoid shadowing builtins in flatten.

Properties lack return annotations; flatten lacks types and uses input name.

@@
+T = TypeVar("T")
@@
-    def config(self):
+    def config(self) -> Any:
@@
-    def tokenizer(self):
+    def tokenizer(self) -> Any:
@@
-    def model(self):
+    def model(self) -> torch.nn.Module:
@@
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
@@
-    def max_length(self):
+    def max_length(self) -> int:
@@
-    def batch_size(self):
+    def batch_size(self) -> int:
@@
-    def device(self):
+    def device(self) -> torch.device:
@@
-    def rank(self):
+    def rank(self) -> int:
@@
-    def world_size(self):
+    def world_size(self) -> int:
@@
-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    def flatten(self, items: Iterable[Iterable[T]]) -> List[T]:
+        return [j for i in items for j in i]

Also applies to: 130-169, 174-179

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around line 29 (and also for
the functions spanning lines 130-169 and 174-179), add explicit type hints for
property return types and function signatures and stop shadowing builtins:
change the flatten function parameter name from input to inputs (or similar),
add parameter and return types (e.g., inputs: Iterable[Any] -> List[Any] or
Sequence[Any] -> List[Any]), and annotate all properties with their concrete
return types (for example -> torch.Tensor or -> Sequence[torch.Tensor] as
appropriate) and add typing imports (from typing import Any, Iterable, Sequence,
List, Optional) at the top. Ensure signature types for the functions at 130-169
and 174-179 are updated consistently (parameters typed and return types
declared) to satisfy static type checkers.

new_list.append(j)
return new_list

def generate_until(self, requests: List[Instance]) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add docstrings for public APIs.

Repo rule: public APIs must have docstrings.

 def generate_until(self, requests: List[Instance]) -> List[str]:
+        """
+        Generate responses until stop conditions for a batch of requests.
+        Expects each Instance to provide (context, gen_kwargs, doc_to_visual, doc_id, task, split).
+        """
@@
 def generate_until_multi_round(self, requests) -> List[str]:
-        raise NotImplementedError("TODO: Implement multi-round generation")
+        """
+        Multi-round chat generation for LLaVA-One-Vision.
+        """
+        raise NotImplementedError("TODO: Implement multi-round generation")

Also applies to: 343-345

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around line 181 (method
generate_until) and lines 343-345 (other public APIs), add proper docstrings for
these public methods: include a one-line summary, parameter descriptions (types
and meanings for requests and any other args), the return value (type and
meaning), and any raised exceptions or side effects; follow the repo's docstring
style (e.g., Google or NumPy) used elsewhere in the project and keep it concise
and informative.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
lmms_eval/models/simple/llava_onevision1_5.py (2)

171-173: Fix model name in NotImplemented message.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

207-217: Generalize stopper comment; remove Qwen-specific wording.

-            # Avoid using '\n\n' as a stopper for Qwen2.5VL to prevent truncation, which can lead to incorrect results
+            # Avoid overly generic newline stoppers; they can truncate valuable content
+            # in some vision LMs. Prefer explicit, task-specific stoppers.
             until = [item for item in until if item != "\n\n"]
🧹 Nitpick comments (14)
lmms_eval/models/simple/llava_onevision1_5.py (14)

1-5: Remove unused imports (ruff F401).

base64 and BytesIO are unused.

-import base64
 import re
-from io import BytesIO
-from typing import List, Optional, Tuple, Union
+from typing import List, Optional, Tuple, Union

61-65: Promote valid attn implementations to a constant; wrap long message.

Improves readability and respects 88c limit.

+VALID_ATTN_IMPLEMENTATIONS = (None, "flash_attention_2", "sdpa", "eager")
@@
-        valid_attn_implementations = [None, "flash_attention_2", "sdpa", "eager"]
-        if attn_implementation not in valid_attn_implementations:
-            raise ValueError(f"attn_implementation must be one of {valid_attn_implementations}, got {attn_implementation}")
+        if attn_implementation not in VALID_ATTN_IMPLEMENTATIONS:
+            raise ValueError(
+                f"attn_implementation must be one of "
+                f"{VALID_ATTN_IMPLEMENTATIONS}, got {attn_implementation}"
+            )

74-81: Add CUDA availability fallback.

Avoids device errors on CPU-only hosts.

-        if accelerator.num_processes > 1:
+        if accelerator.num_processes > 1:
             self._device = torch.device(f"cuda:{accelerator.local_process_index}")
             self.device_map = f"cuda:{accelerator.local_process_index}"
         else:
-            self._device = torch.device(device)
+            if device == "cuda" and not torch.cuda.is_available():
+                eval_logger.warning("CUDA not available; falling back to CPU.")
+                device = "cpu"
+            self._device = torch.device(device)
             self.device_map = device_map if device_map else device

98-106: Avoid Optional for booleans; keep types tight.

interleave_visuals and use_custom_video_loader need not be Optional.

-        use_custom_video_loader: Optional[bool] = False,
+        use_custom_video_loader: bool = False,
@@
-        interleave_visuals: Optional[bool] = False,
+        interleave_visuals: bool = False,

130-170: Add required return type annotations for public properties.

Repo rule: type hints required. Add concrete return types.

-from transformers import (
+from transformers import (
     AutoProcessor,
     AutoTokenizer,
-    AutoModelForCausalLM
+    AutoModelForCausalLM,
+    PreTrainedModel,
+    PretrainedConfig,
+    PreTrainedTokenizerBase,
 )
     @property
-    def config(self):
+    def config(self) -> PretrainedConfig:
@@
     @property
-    def tokenizer(self):
+    def tokenizer(self) -> PreTrainedTokenizerBase:
@@
     @property
-    def model(self):
+    def model(self) -> PreTrainedModel:
@@
     @property
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
@@
     @property
-    def max_length(self):
+    def max_length(self) -> int:
@@
     @property
-    def batch_size(self):
+    def batch_size(self) -> int:
@@
     @property
-    def device(self):
+    def device(self) -> torch.device:
@@
     @property
-    def rank(self):
+    def rank(self) -> int:
@@
     @property
-    def world_size(self):
+    def world_size(self) -> int:

174-180: Avoid shadowing builtins; type the helper; simplify.

-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    from typing import Iterable, TypeVar, List
+    T = TypeVar("T")
+
+    def flatten(self, items: Iterable[Iterable[T]]) -> List[T]:
+        return [j for i in items for j in i]

181-183: Add docstring for public API.

-    def generate_until(self, requests: List[Instance]) -> List[str]:
+    def generate_until(self, requests: List[Instance]) -> List[str]:
+        """
+        Generate responses for a batch of Instances until stop tokens.
+        Args:
+            requests: List of Instances with (context, gen_kwargs, doc_to_visual,
+                doc_id, task, split).
+        Returns:
+            List[str]: Generated texts in the original request order.
+        """
         res = []

218-224: Remove redundant placeholder stripping loop.

Context replacement is already handled below.

-            if isinstance(contexts, tuple):
-                contexts = list(contexts)
-
-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
+            if isinstance(contexts, tuple):
+                contexts = list(contexts)

248-255: Prefer iterable unpacking over concatenation (ruff RUF005).

-                    message.append(
-                        {
-                            "role": "user",
-                            "content": processed_visuals + [{"type": "text", "text": context}],
-                        }
-                    )
+                    message.append(
+                        {
+                            "role": "user",
+                            "content": [*processed_visuals, {"type": "text", "text": context}],
+                        }
+                    )

323-330: Prefer tokenizer for decoding.

Processor may proxy, but tokenizer is clearer.

-            answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+            answers = self.tokenizer.batch_decode(
+                generated_ids_trimmed,
+                skip_special_tokens=True,
+                clean_up_tokenization_spaces=False,
+            )

341-343: Add docstring and type hints for generate_until_multi_round.

-    def generate_until_multi_round(self, requests) -> List[str]:
-        raise NotImplementedError("TODO: Implement multi-round generation")
+    def generate_until_multi_round(self, requests: List[Instance]) -> List[str]:
+        """
+        Multi-round chat generation for LLaVA-One-Vision.
+        """
+        raise NotImplementedError("TODO: Implement multi-round generation")

31-35: Class name style vs. repo ergonomics.

Guideline says PascalCase; current name includes underscores. Consider adding a PascalCase alias to avoid churn.

 @register_model("llava_onevision1_5")
 class Llava_OneVision1_5(lmms):
@@
     """

And optionally:

class LlavaOneVision15(Llava_OneVision1_5):
    """PascalCase alias for style compliance."""
    pass

210-214: Use TypeError for invalid type (ruff TRY004).

-            elif not isinstance(until, list):
-                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")
+            elif not isinstance(until, list):
+                raise TypeError(
+                    f"Expected gen_kwargs['until'] to be Union[str, list], "
+                    f"got {type(until)}"
+                )

288-294: Device move: use self.device consistently

self._device is set earlier (torch.device(...)); moving inputs to the literal "cuda" when device_map == "auto" can mismatch model placement. Use inputs = inputs.to(self.device) (or inputs.to(next(self.model.parameters()).device) for absolute alignment).

File: lmms_eval/models/simple/llava_onevision1_5.py lines 288-294

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46f0ff0 and 6d189cc.

📒 Files selected for processing (1)
  • lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/models/simple/llava_onevision1_5.py
🧬 Code graph analysis (1)
lmms_eval/models/simple/llava_onevision1_5.py (3)
lmms_eval/api/instance.py (2)
  • Instance (6-29)
  • args (25-29)
lmms_eval/api/model.py (2)
  • lmms (26-331)
  • add_partial (348-352)
lmms_eval/utils.py (3)
  • Collator (865-1020)
  • chunks (136-171)
  • get_batched (891-913)
🪛 Ruff (0.12.2)
lmms_eval/models/simple/llava_onevision1_5.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)


72-72: Avoid specifying long messages outside the exception class

(TRY003)


213-213: Prefer TypeError exception for invalid type

(TRY004)


213-213: Avoid specifying long messages outside the exception class

(TRY003)


252-252: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (4)
lmms_eval/models/simple/llava_onevision1_5.py (4)

107-121: Accelerate wrapping looks correct.

Unwrap in property is handled; logging is helpful.


237-247: Drop unused video shape inspection; keep minimal video descriptor.

height/width are unused; opening the video just to read frame 0 is wasteful.
[suggest_minor_issue]

-                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):  # Video file
-                        vr = decord.VideoReader(visual)
-                        first_frame = vr[0].asnumpy()
-                        height, width = first_frame.shape[:2]
-                        # max_pixels = height * width
-                        processed_visuals.append({"type": "video", "video": visual, "max_pixels": self.max_pixels, "min_pixels": self.min_pixels})
+                    if isinstance(visual, str) and visual.endswith((".mp4", ".avi", ".mov")):
+                        processed_visuals.append(
+                            {
+                                "type": "video",
+                                "video": visual,
+                                "max_pixels": self.max_pixels,
+                                "min_pixels": self.min_pixels,
+                            }
+                        )

295-321: Never pass None into generate; inference_mode added.

Good fix: do_sample gating, pad_token fallback, inference_mode use.


14-17: Sort/expand imports to include types used in type hints.

Import any types referenced by property/type hints (e.g. PreTrainedModel, PretrainedConfig, PreTrainedTokenizerBase, TypeVar, Iterable). The supplied search returned no matches in lmms_eval/models/simple/llava_onevision1_5.py — verify and add missing imports if you changed signatures.

Location: lmms_eval/models/simple/llava_onevision1_5.py (lines 14–17).

Comment on lines +24 to +28
try:
from qwen_vl_utils import process_vision_info
except ImportError:
eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix hard dependency on qwen_vl_utils; add safe fallback for images and guard videos.

As written, ImportError leaves process_vision_info undefined, causing a NameError at runtime. Provide a guarded import and a fallback path that collects PIL images; explicitly raise for videos when qwen_vl_utils is unavailable.

@@
-try:
-    from qwen_vl_utils import process_vision_info
-except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+try:
+    from qwen_vl_utils import process_vision_info  # type: ignore
+except ImportError:
+    process_vision_info = None  # fallback in generate_until
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only image handling."
+    )
@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images only; videos unsupported without qwen_vl_utils
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    if any(isinstance(part, dict) and part.get("type") == "video" for part in user):
+                        raise NotImplementedError(
+                            "Video inputs require qwen_vl_utils or a native LLaVA video path."
+                        )
+                    image_inputs.append(imgs or None)

Also applies to: 279-289

Comment on lines +66 to +73
self.use_custom_video_loader = use_custom_video_loader
self.fps = fps
# if self.fps and not self.use_custom_video_loader:
# raise ValueError("FPS is only applicable if use_custom_video_loader is True")
self.max_image_size = max_image_size
if self.max_image_size and not self.use_custom_video_loader:
raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enforce fps/max_image_size coupling with use_custom_video_loader.

The fps check is commented out; enforce both constraints.

         self.use_custom_video_loader = use_custom_video_loader
         self.fps = fps
-        # if self.fps and not self.use_custom_video_loader:
-        #     raise ValueError("FPS is only applicable if use_custom_video_loader is True")
+        if self.fps is not None and not self.use_custom_video_loader:
+            raise ValueError(
+                "fps is only applicable if use_custom_video_loader is True"
+            )
         self.max_image_size = max_image_size
         if self.max_image_size and not self.use_custom_video_loader:
             raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.use_custom_video_loader = use_custom_video_loader
self.fps = fps
# if self.fps and not self.use_custom_video_loader:
# raise ValueError("FPS is only applicable if use_custom_video_loader is True")
self.max_image_size = max_image_size
if self.max_image_size and not self.use_custom_video_loader:
raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
self.use_custom_video_loader = use_custom_video_loader
self.fps = fps
if self.fps is not None and not self.use_custom_video_loader:
raise ValueError(
"fps is only applicable if use_custom_video_loader is True"
)
self.max_image_size = max_image_size
if self.max_image_size and not self.use_custom_video_loader:
raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")
🧰 Tools
🪛 Ruff (0.12.2)

72-72: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 66 to 73, the fps
validation is currently commented out while max_image_size is enforced;
re-enable and enforce the same coupling for fps: if self.fps is set and not
self.use_custom_video_loader, raise a ValueError with a clear message like "fps
is only applicable if use_custom_video_loader is True"; keep the existing
max_image_size check as-is so both parameters require use_custom_video_loader to
be True.

Comment on lines +281 to +287
if video_inputs is not None:
total_frames = video_inputs[0].shape[0]
indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
# Append the last frame index if not already included
if total_frames - 1 not in indices:
indices = np.append(indices, total_frames - 1)
video_inputs[0] = video_inputs[0][indices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Process all videos, not just index 0.

Only the first video is subsampled; others are ignored.

-            if video_inputs is not None:
-                total_frames = video_inputs[0].shape[0]
-                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
-                # Append the last frame index if not already included
-                if total_frames - 1 not in indices:
-                    indices = np.append(indices, total_frames - 1)
-                video_inputs[0] = video_inputs[0][indices]
+            if video_inputs is not None:
+                for vi in range(len(video_inputs)):
+                    total_frames = video_inputs[vi].shape[0]
+                    indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
+                    if total_frames - 1 not in indices:
+                        indices = np.append(indices, total_frames - 1)
+                    video_inputs[vi] = video_inputs[vi][indices]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if video_inputs is not None:
total_frames = video_inputs[0].shape[0]
indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
# Append the last frame index if not already included
if total_frames - 1 not in indices:
indices = np.append(indices, total_frames - 1)
video_inputs[0] = video_inputs[0][indices]
if video_inputs is not None:
for vi in range(len(video_inputs)):
total_frames = video_inputs[vi].shape[0]
indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
# Append the last frame index if not already included
if total_frames - 1 not in indices:
indices = np.append(indices, total_frames - 1)
video_inputs[vi] = video_inputs[vi][indices]
🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 281 to 287, the
current logic only subsamples video_inputs[0], ignoring other videos; update it
to iterate over all entries in video_inputs (e.g., for i in
range(len(video_inputs)) or for idx, vid in enumerate(video_inputs)) and for
each video compute total_frames = vid.shape[0], create indices = np.linspace(0,
total_frames - 1, self.max_num_frames, dtype=int), append total_frames - 1 if
not present, and then assign video_inputs[i] = vid[indices] so every video is
subsampled consistently. Ensure you handle None and empty videos the same way as
before.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
lmms_eval/models/simple/llava_onevision1_5.py (5)

62-69: Enforce fps/max_image_size coupling to custom loader.

Validate both only when use_custom_video_loader=True; use explicit None checks.

         self.use_custom_video_loader = use_custom_video_loader
         self.fps = fps
-        # if self.fps and not self.use_custom_video_loader:
-        #     raise ValueError("FPS is only applicable if use_custom_video_loader is True")
+        if self.fps is not None and not self.use_custom_video_loader:
+            raise ValueError("fps is only applicable if use_custom_video_loader is True")
         self.max_image_size = max_image_size
-        if self.max_image_size and not self.use_custom_video_loader:
+        if self.max_image_size is not None and not self.use_custom_video_loader:
             raise ValueError("max_image_size is only applicable if use_custom_video_loader is True")

285-312: LGTM: safer generate() args and inference_mode.

Pad fallback, conditional sampling args, and inference context resolve prior pitfalls.


20-24: Guard qwen_vl_utils and add image-only fallback (prevents NameError and mismatched vision path).

process_vision_info is unbound if qwen_vl_utils isn’t installed, and LLaVA can operate on raw PIL images without qwen helpers. Add a guarded import and a safe fallback that collects PIL images; explicitly raise for videos without a native path. Also keep processors’ video subsampling consistent.

Apply:

@@
-try:
-    from qwen_vl_utils import process_vision_info
-except ImportError:
-    eval_logger.warning("Failed to import qwen_vl_utils; Please install it via `pip install qwen-vl-utils`")
+try:
+    from qwen_vl_utils import process_vision_info  # type: ignore
+except ImportError:
+    process_vision_info = None  # fallback in generate_until
+    eval_logger.warning(
+        "qwen_vl_utils not installed; falling back to processor-only image handling."
+    )
@@
-            texts = [self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batched_messages]
-            image_inputs, video_inputs = process_vision_info(batched_messages)
-            if video_inputs is not None:
-                total_frames = video_inputs[0].shape[0]
-                indices = np.linspace(0, total_frames - 1, self.max_num_frames, dtype=int)
-                # Append the last frame index if not already included
-                if total_frames - 1 not in indices:
-                    indices = np.append(indices, total_frames - 1)
-                video_inputs[0] = video_inputs[0][indices]
-            inputs = self.processor(text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
+            texts = [
+                self.processor.apply_chat_template(
+                    msg, tokenize=False, add_generation_prompt=True
+                )
+                for msg in batched_messages
+            ]
+            if process_vision_info is not None:
+                image_inputs, video_inputs = process_vision_info(batched_messages)
+            else:
+                # Fallback: collect PIL images only; videos unsupported without qwen_vl_utils
+                image_inputs, video_inputs = [], None
+                for msg in batched_messages:
+                    user = next(p for p in msg if p["role"] == "user")["content"]
+                    imgs = [
+                        part["image"]
+                        for part in user
+                        if isinstance(part, dict)
+                        and part.get("type") == "image"
+                        and isinstance(part.get("image"), Image.Image)
+                    ]
+                    if any(
+                        isinstance(part, dict) and part.get("type") == "video"
+                        for part in user
+                    ):
+                        raise NotImplementedError(
+                            "Video inputs require qwen_vl_utils or a native LLaVA video path."
+                        )
+                    image_inputs.append(imgs or None)
+            if video_inputs is not None:
+                for vi in range(len(video_inputs)):
+                    total_frames = video_inputs[vi].shape[0]
+                    indices = np.linspace(
+                        0, total_frames - 1, self.max_num_frames, dtype=int
+                    )
+                    if total_frames - 1 not in indices:
+                        indices = np.append(indices, total_frames - 1)
+                    video_inputs[vi] = video_inputs[vi][indices]
+            inputs = self.processor(
+                text=texts,
+                images=image_inputs,
+                videos=video_inputs,
+                padding=True,
+                return_tensors="pt",
+            )

Also applies to: 269-279


122-162: Add required type hints for properties and fix flatten’s signature/name.

Repo rule: type hints required; avoid shadowing builtins with parameter name input.

     @property
-    def config(self):
+    def config(self) -> Any:
@@
     @property
-    def tokenizer(self):
+    def tokenizer(self) -> Any:
@@
     @property
-    def model(self):
+    def model(self) -> torch.nn.Module:
@@
     @property
-    def eot_token_id(self):
+    def eot_token_id(self) -> int:
@@
     @property
-    def max_length(self):
+    def max_length(self) -> int:
@@
     @property
-    def batch_size(self):
+    def batch_size(self) -> int:
@@
     @property
-    def device(self):
+    def device(self) -> torch.device:
@@
     @property
-    def rank(self):
+    def rank(self) -> int:
@@
     @property
-    def world_size(self):
+    def world_size(self) -> int:
         return self._world_size
@@
-    def flatten(self, input):
-        new_list = []
-        for i in input:
-            for j in i:
-                new_list.append(j)
-        return new_list
+    def flatten(self, items: Iterable[Iterable[T]]) -> List[T]:
+        return [j for i in items for j in i]

Also applies to: 166-171


173-175: Add docstrings for public APIs (repo requirement).

Brief summaries and param/return docs are required.

 def generate_until(self, requests: List[Instance]) -> List[str]:
-        res = []
+        """
+        Generate text for a batch of Instances until stop conditions.
+
+        Args:
+            requests: Batch providing (context, gen_kwargs, doc_to_visual,
+                doc_id, task, split).
+        Returns:
+            Generated strings in the original request order.
+        """
+        res = []
@@
-    def generate_until_multi_round(self, requests) -> List[str]:
-        raise NotImplementedError("TODO: Implement multi-round generation")
+    def generate_until_multi_round(self, requests) -> List[str]:
+        """
+        Multi-round chat generation for LLaVA-One-Vision.
+        """
+        raise NotImplementedError("TODO: Implement multi-round generation")

Also applies to: 331-333

🧹 Nitpick comments (6)
lmms_eval/models/simple/llava_onevision1_5.py (6)

1-4: Remove unused imports; expand typing for upcoming hints.

base64 and BytesIO are unused; add Any/Iterable/TypeVar for annotations.

-import base64
-import re
-from io import BytesIO
-from typing import List, Optional, Tuple, Union
+import re
+from typing import Any, Iterable, List, Optional, Tuple, TypeVar, Union

Add near the imports:

+T = TypeVar("T")

163-165: Fix copy/paste in NotImplemented message (Qwen → LLaVA).

Message should reference this model.

-        raise NotImplementedError("Loglikelihood is not implemented for Qwen2.5_VL")
+        raise NotImplementedError(
+            "Loglikelihood is not implemented for LLaVA-One-Vision yet"
+        )

199-206: Raise TypeError for invalid ‘until’ type, keep message concise.

Aligns with Ruff TRY004 and clearer API erroring.

             if isinstance(until, str):
                 until = [until]
             elif not isinstance(until, list):
-                raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str, list], but got {type(until)}")
+                raise TypeError(
+                    "gen_kwargs['until'] must be a str or List[str]; "
+                    f"got {type(until).__name__}"
+                )

238-244: Use iterable unpacking instead of list concatenation.

Cleaner and satisfies Ruff RUF005.

-                    message.append(
-                        {
-                            "role": "user",
-                            "content": processed_visuals + [{"type": "text", "text": context}],
-                        }
-                    )
+                    message.append(
+                        {
+                            "role": "user",
+                            "content": [*processed_visuals, {"type": "text", "text": context}],
+                        }
+                    )

210-217: Remove duplicated placeholder cleanup (done again per-message).

Avoid double replace; keep per-message cleanup only.

-            for i in range(len(contexts)):
-                if "<image>" in contexts[i]:
-                    contexts[i] = contexts[i].replace("<image>", "")
+            # Defer <image> placeholder cleanup to the per-message loop below.

26-31: Class name style nit: consider PascalCase (no underscores).

Repo guideline says class names use PascalCase. If rename would break registry/imports, consider adding a PascalCase alias.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d189cc and 8ca3df2.

📒 Files selected for processing (1)
  • lmms_eval/models/simple/llava_onevision1_5.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/models/simple/llava_onevision1_5.py
🧬 Code graph analysis (1)
lmms_eval/models/simple/llava_onevision1_5.py (4)
lmms_eval/api/instance.py (2)
  • Instance (6-29)
  • args (25-29)
lmms_eval/api/model.py (2)
  • lmms (26-331)
  • add_partial (348-352)
lmms_eval/api/registry.py (1)
  • register_model (11-24)
lmms_eval/utils.py (3)
  • Collator (865-1020)
  • chunks (136-171)
  • get_batched (891-913)
🪛 Ruff (0.12.2)
lmms_eval/models/simple/llava_onevision1_5.py

60-60: Avoid specifying long messages outside the exception class

(TRY003)


68-68: Avoid specifying long messages outside the exception class

(TRY003)


205-205: Prefer TypeError exception for invalid type

(TRY004)


205-205: Avoid specifying long messages outside the exception class

(TRY003)


242-242: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

Comment on lines +281 to +283
inputs = inputs.to("cuda")
else:
inputs = inputs.to(self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Verify CPU path when device_map='auto'.

inputs.to("cuda") on the auto path may break CPU-only runs. Consider always using inputs.to(self.device) or detect the first model device.


🏁 Script executed:

#!/usr/bin/env bash
# Grep for other generate_until implementations to confirm device handling patterns.
rg -n -C2 'inputs\.to\(' lmms_eval/models | sed -n '1,120p'

Length of output: 5931


Do not unconditionally use "cuda" when device_map == 'auto'.

inputs.to("cuda") breaks CPU-only runs — use inputs = inputs.to(self.device) or determine the model device (e.g. self.model.device or next(self.model.parameters()).device) and preserve dtype with .to(self.model.dtype) if needed. Affected: lmms_eval/models/simple/llava_onevision1_5.py:281-283 (same pattern found in whisper.py:180, qwen2_vl.py:353, qwen2_audio.py:242, qwen2_5_vl_interleave.py:355, qwen2_5_vl.py:300, chat/thyme.py:135, chat/qwen2_5_vl.py:81, chat/huggingface.py:242).

🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 281-283, the code
unconditionally moves inputs to "cuda" when device_map == 'auto', which breaks
CPU-only runs; change this to move inputs to the actual model/device (e.g. use
inputs = inputs.to(self.device) or determine the model device via
self.model.device or next(self.model.parameters()).device) and preserve dtype if
needed by using .to(device, dtype=self.model.dtype) so CPU-only environments
work and tensor dtype is consistent with the model.

Comment on lines +314 to +315
answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for i, ans in enumerate(answers):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Decode with tokenizer, not processor (avoids AttributeError on some processors).

AutoProcessor may not expose batch_decode; tokenizer does.

-            answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+            answers = self.tokenizer.batch_decode(
+                generated_ids_trimmed,
+                skip_special_tokens=True,
+                clean_up_tokenization_spaces=False,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for i, ans in enumerate(answers):
answers = self.tokenizer.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
for i, ans in enumerate(answers):
🤖 Prompt for AI Agents
In lmms_eval/models/simple/llava_onevision1_5.py around lines 314-315, the code
calls self.processor.batch_decode which can raise AttributeError because
AutoProcessor may not implement batch_decode; replace this call to use the
tokenizer's batch_decode (e.g.,
self.tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=True,
clean_up_tokenization_spaces=False)) or, if tokenizer might be missing, guard
with a fallback to processor by checking hasattr(self, "tokenizer") and
hasattr(self.tokenizer, "batch_decode") and using processor only as fallback.

@@ -0,0 +1,10 @@
export HF_HOME="~/.cache/huggingface"

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this link correct?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link should be fixed.

@Luodian Luodian merged commit 1676c57 into EvolvingLMMs-Lab:main Sep 18, 2025
1 of 2 checks passed
@KevinZeng08
Copy link

Dear authors, I notice that Huggingface transformers does not have Llava-OneVision-1.5 model currently. Is this PR ready to use for evaluation?
Thanks for your wonderful models!

@yiyexy
Copy link

yiyexy commented Sep 24, 2025

Dear authors, I notice that Huggingface transformers does not have Llava-OneVision-1.5 model currently. Is this PR ready to use for evaluation? Thanks for your wonderful models!

Sure! We are working hard on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants