-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Thanks for your hard work.
I recently tried to reproduce the evaluation of your LongVT-RFT model on LVBench, but for some reason unknown, the results deviate from the metric reported in the paper.
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| lvbench-custom | Yaml | none | 0 | lvbench_score | ↑ | 0.2479 | ± | 0.011 |
Following README, I referred to lmms-eval (https://github.com/EvolvingLMMs-Lab/lmms-eval) and modified the task yaml and utils.py files (https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/tasks/lvbench) to obtain the correct local path and make max_new_tokens usable
dataset_path: lmms-lab/LVBench
dataset_kwargs:
token: False
cache_dir: /path/to/data/LVBench
video: True
# From_YouTube: True
test_split: train
task: lvbench-custom
output_type: generate_until
doc_to_visual: !function utils.lvbench_doc_to_visual
doc_to_text: !function utils.lvbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 4096
# The return value of process_results will be used by metrics
process_results: !function utils.lvbench_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: lvbench_score
aggregation: mean
higher_is_better: true
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "\nAnswer the question with the option letter"
metadata:
- version: 0.0import os
import re
from pathlib import Path
import yaml
with open(Path(__file__).parent / "lmms_eval.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
def lvbench_doc_to_visual(doc):
cache_dir = cache_name
video_path = doc["video_path"]
assert os.path.exists(os.path.join(cache_dir, "video_chunks", video_path))
video_path = os.path.join(cache_dir, "video_chunks", video_path)
return [video_path]
def lvbench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
if lmms_eval_specific_kwargs is None:
lmms_eval_specific_kwargs = {}
if "pre_prompt" not in lmms_eval_specific_kwargs:
lmms_eval_specific_kwargs["pre_prompt"] = ""
if "post_prompt" not in lmms_eval_specific_kwargs:
lmms_eval_specific_kwargs["post_prompt"] = "\nAnswer the question with the option letter"
return lmms_eval_specific_kwargs["pre_prompt"] + doc["question"] + lmms_eval_specific_kwargs["post_prompt"]
def extract_characters_regex(s):
s = s.strip()
answer_prefixes = [
"The best answer is",
"The correct answer is",
"The answer is",
"The answer",
"The best option is" "The correct option is",
"Best answer:" "Best option:",
]
for answer_prefix in answer_prefixes:
s = s.replace(answer_prefix, "")
if len(s.split()) > 10 and not re.search("[ABCD]", s):
return ""
matches = re.search(r"[ABCD]", s)
if matches is None:
return ""
return matches[0]
def lvbench_process_results(doc, results):
"""
Args:
doc: a instance of the eval dataset
results: [pred]
Returns:
a dictionary with key: metric name (in this case videomme score), value: metric value
"""
pred = results[0]
pred_ans = extract_characters_regex(pred)
# gt_ans = doc["answer"].lower().strip().replace(".", "")
gt_ans = doc["answer"]
score = pred_ans == gt_ans
# return {f"videomme_perception_score": data_dict for metric in matrices}
return {f"lvbench_score": score}The vllm version used is 0.13.0, and the command used is
CUDA_VISIBLE_DEVICES=0 vllm serve /path/to/weights/LongVT-RFT --port 28100 --tool-call-parser hermes --enable-auto-tool-choice --trust-remote-code --chat-template /path/to/LongVT/examples/eval/tool_call_qwen2_5_vl.jinjaAnd the script used
#!/bin/bash
export OMP_NUM_THREADS=16
export TORCHCODEC_NUM_THREADS=16
# Environment variables for LLM Judge
export OPENAI_API_BASE="http://<server_ip>:28100/v1"
export OPENAI_API_KEY="EMPTY"
export OPENAI_MODEL_NAME="/path/to/weights/LongVT-RFT"
# Cache settings
export DECORD_EOF_RETRY_MAX=409600
export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="./cache" # Set your cache directory
# Arguments
NUM_CPUS=1
TASK_NAME=lvbench-custom
IS_QWEN3_VL=False
MAX_FRAME_NUM=768
# Path to MCP server for tool calling
MCP_PATH="/path/to/LongVT/examples/video_tools/mcp_server.py"
# Run evaluation
accelerate launch --cpu --num_processes=1 --num_machines=1 --mixed_precision=no --dynamo_backend=no -m lmms_eval \
--model async_openai \
--model_args model_version=$OPENAI_MODEL_NAME,mcp_server_path=$MCP_PATH,fps=1,max_frames=$MAX_FRAME_NUM,max_pixels=50176,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY,num_cpus=$NUM_CPUS,timeout=12000,is_qwen3_vl=$IS_QWEN3_VL \
--tasks $TASK_NAME \
--batch_size 1 \
--output_path ./eval_logs \
--log_samples \
--include_path .It may be worth noting that the cached response contains duplicate <|vision_start|> <|vision_end|> tokens. Here is the cache file https://drive.google.com/file/d/1VlwJxTBe33_2v5QGYMyzQSKUfynLSOAh.
I would be extremely grateful if you could help me solve this problem.