Question about metrics when evaluating LVBench

Thanks for your hard work.

I recently tried to reproduce the evaluation of your LongVT-RFT model on LVBench, but for some reason unknown, the results deviate from the metric reported in the paper.

|    Tasks     |Version|Filter|n-shot|   Metric    |   |Value |   |Stderr|
|--------------|-------|------|-----:|-------------|---|-----:|---|-----:|
|lvbench-custom|Yaml   |none  |     0|lvbench_score|↑  |0.2479|±  | 0.011|

Following README, I referred to lmms-eval (https://github.com/EvolvingLMMs-Lab/lmms-eval) and modified the task yaml and `utils.py` files (https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/tasks/lvbench) to obtain the correct local path and make `max_new_tokens` usable

```yaml
dataset_path: lmms-lab/LVBench
dataset_kwargs:
  token: False
  cache_dir: /path/to/data/LVBench
  video: True
  # From_YouTube: True
test_split: train
task: lvbench-custom
output_type: generate_until
doc_to_visual: !function utils.lvbench_doc_to_visual
doc_to_text: !function utils.lvbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
  max_new_tokens: 4096
# The return value of process_results will be used by metrics
process_results: !function utils.lvbench_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
  - metric: lvbench_score
    aggregation: mean
    higher_is_better: true
lmms_eval_specific_kwargs:
  default:
    pre_prompt: ""
    post_prompt: "\nAnswer the question with the option letter"
metadata:
  - version: 0.0
```

```python
import os
import re
from pathlib import Path

import yaml

with open(Path(__file__).parent / "lmms_eval.yaml", "r") as f:
    raw_data = f.readlines()
    safe_data = []
    for i, line in enumerate(raw_data):
        # remove function definition since yaml load cannot handle it
        if "!function" not in line:
            safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]


def lvbench_doc_to_visual(doc):
    cache_dir = cache_name
    video_path = doc["video_path"]
    assert os.path.exists(os.path.join(cache_dir, "video_chunks", video_path))
    video_path = os.path.join(cache_dir, "video_chunks", video_path)
    return [video_path]


def lvbench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
    if lmms_eval_specific_kwargs is None:
        lmms_eval_specific_kwargs = {}
    if "pre_prompt" not in lmms_eval_specific_kwargs:
        lmms_eval_specific_kwargs["pre_prompt"] = ""
    if "post_prompt" not in lmms_eval_specific_kwargs:
        lmms_eval_specific_kwargs["post_prompt"] = "\nAnswer the question with the option letter"
    return lmms_eval_specific_kwargs["pre_prompt"] + doc["question"] + lmms_eval_specific_kwargs["post_prompt"]


def extract_characters_regex(s):
    s = s.strip()
    answer_prefixes = [
        "The best answer is",
        "The correct answer is",
        "The answer is",
        "The answer",
        "The best option is" "The correct option is",
        "Best answer:" "Best option:",
    ]
    for answer_prefix in answer_prefixes:
        s = s.replace(answer_prefix, "")

    if len(s.split()) > 10 and not re.search("[ABCD]", s):
        return ""

    matches = re.search(r"[ABCD]", s)
    if matches is None:
        return ""
    return matches[0]


def lvbench_process_results(doc, results):
    """
    Args:
        doc: a instance of the eval dataset
        results: [pred]
    Returns:
        a dictionary with key: metric name (in this case videomme score), value: metric value
    """
    pred = results[0]
    pred_ans = extract_characters_regex(pred)
    # gt_ans = doc["answer"].lower().strip().replace(".", "")
    gt_ans = doc["answer"]
    score = pred_ans == gt_ans

    # return {f"videomme_perception_score": data_dict for metric in matrices}
    return {f"lvbench_score": score}
```

The vllm version used is 0.13.0, and the command used is

```shell
CUDA_VISIBLE_DEVICES=0 vllm serve /path/to/weights/LongVT-RFT --port 28100 --tool-call-parser hermes --enable-auto-tool-choice --trust-remote-code --chat-template /path/to/LongVT/examples/eval/tool_call_qwen2_5_vl.jinja
```

And the script used

```shell
#!/bin/bash

export OMP_NUM_THREADS=16
export TORCHCODEC_NUM_THREADS=16

# Environment variables for LLM Judge
export OPENAI_API_BASE="http://<server_ip>:28100/v1"
export OPENAI_API_KEY="EMPTY"
export OPENAI_MODEL_NAME="/path/to/weights/LongVT-RFT"

# Cache settings
export DECORD_EOF_RETRY_MAX=409600
export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="./cache"  # Set your cache directory

# Arguments
NUM_CPUS=1
TASK_NAME=lvbench-custom
IS_QWEN3_VL=False
MAX_FRAME_NUM=768

# Path to MCP server for tool calling
MCP_PATH="/path/to/LongVT/examples/video_tools/mcp_server.py"

# Run evaluation
accelerate launch --cpu --num_processes=1 --num_machines=1 --mixed_precision=no --dynamo_backend=no -m lmms_eval \
    --model async_openai \
    --model_args model_version=$OPENAI_MODEL_NAME,mcp_server_path=$MCP_PATH,fps=1,max_frames=$MAX_FRAME_NUM,max_pixels=50176,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY,num_cpus=$NUM_CPUS,timeout=12000,is_qwen3_vl=$IS_QWEN3_VL \
    --tasks $TASK_NAME \
    --batch_size 1 \
    --output_path ./eval_logs \
    --log_samples \
    --include_path .
```

It may be worth noting that the cached response contains duplicate <|vision_start|> <|vision_end|> tokens. Here is the cache file https://drive.google.com/file/d/1VlwJxTBe33_2v5QGYMyzQSKUfynLSOAh.

I would be extremely grateful if you could help me solve this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about metrics when evaluating LVBench #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about metrics when evaluating LVBench #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions