Skip to content

The test results for Qwen3VL show significant discrepancies compared to the official report. #932

@Fu-Fu-Fu-Fu

Description

@Fu-Fu-Fu-Fu

I ran the scripts provided by lmms_eval directly: vllm_qwen3vl.sh or vllm_generate_qwen3vl.sh to test qwen3vl-8b-instruct and thinking, using the tasks videomme, videommmu, and mvbench. I found that the test results differed significantly from the official report, being more than 5%, or even 10%, lower. For example, for videommmu, I ran vllm_qwen3vl.sh, only modifying the model and task. The model used is qwen3vl_8b_thinking.

Image

The official report states 72.8, and even excluding parameters like the maximum frame rate, the difference shouldn't be more than 10%. Have you, the author, encountered this problem before?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions