The test results for Qwen3VL show significant discrepancies compared to the official report.

I ran the scripts provided by lmms_eval directly: vllm_qwen3vl.sh or vllm_generate_qwen3vl.sh to test qwen3vl-8b-instruct and thinking, using the tasks videomme, videommmu, and mvbench. I found that the test results differed significantly from the official report, being more than 5%, or even 10%, lower. For example, for videommmu, I ran vllm_qwen3vl.sh, only modifying the model and task.  The model used is qwen3vl_8b_thinking.

<img width="1266" height="388" alt="Image" src="https://github.com/user-attachments/assets/d75dbdb1-ae41-4a7d-8a20-37206d4fe6b4" />

The official report states 72.8, and even excluding parameters like the maximum frame rate, the difference shouldn't be more than 10%. Have you, the author, encountered this problem before?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The test results for Qwen3VL show significant discrepancies compared to the official report. #932

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The test results for Qwen3VL show significant discrepancies compared to the official report. #932

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions