Thank you for this awesome work! however, I cannot reproduce results of Qwen2.5-VL 3b/7b baseline results on ScreenSpot and ScreenSpot-v2 as reported in the paper. I have used the prompt at here. Can you share the prompt used for evaluation on these benchmarks? Thank you!