I ran the scripts provided by lmms_eval directly: vllm_qwen3vl.sh or vllm_generate_qwen3vl.sh to test qwen3vl-8b-instruct and thinking, using the tasks videomme, videommmu, and mvbench. I found that the test results differed significantly from the official report, being more than 5%, or even 10%, lower. For example, for videommmu, I ran vllm_qwen3vl.sh, only modifying the model and task. The model used is qwen3vl_8b_thinking.
The official report states 72.8, and even excluding parameters like the maximum frame rate, the difference shouldn't be more than 10%. Have you, the author, encountered this problem before?