[Paper][ACL 2026 Findings] What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
- Our paper was accepted by ACL 2026 Findings. 🎉🎉🎉
- Our code has been released.
- Our UI Comprehension Bench is being finalized and will be available soon. 🚧
Left: Evaluation of existing methods on UI element localization, semantic function description, and practical usage. Middle: Performance gains with correct vs. misleading UI info compared to without UI info. Right: Comparison of UILoop against existing "Screen-to-Action" methods on SR metric for Android Control-High.
We demonstrate that comprehensive UI understanding significantly enhances reasoning in existing GUI agents. Building on this insight, we propose the innovative UILoop paradigm, which moves beyond conventional "Screen-to-Action" approaches by reframing GUI reasoning as cyclic "Screen–UI Elements–Action" loop. Through UI Element–Driven Reinforcement Fine-Tuning, UILoop improves model comprehension of interface elements, thereby advancing mutimodal GUI reasoning and interpretability.
Statistics of Our UI Comprehension-Bench. Left: Proportion and distribution of GT UI elements; token length of their semantic descriptions. Right: Proportion of GT UI elements effectively used in action inference.
We introduce the more challenging UI Comprehension task with three dedicated evaluation metrics (UI Locate, Lingualize, Leverage) to assess how existing methods master UI elements. To support this, we advance community research by contributing UI Comprehension-Bench, a 26K benchmark for comprehensive UI capability assessment.
conda create -n uiloop python=3.10
conda activate uiloop
pip install -r requirements.txt
Our repository supports the Qwen 2.5 VL series models (including 3B and 7B).
bash ./examples/qwen2_5_vl_gui_grpo.sh
Inference and evaluation of AndroidControl-High and Screenspot Pro.
bash ./uiloop/inference.sh
bash ./uiloop/eval.sh
Inference and evaluation of Our UI Comprehension Bench. Running this script will give you scores for UI Locate, Lingualize, and leverage.
bash ./uiloop/uiloop_bench_inference.sh
bash ./uiloop/eval_uiloop.sh
We would like to express our sincere gratitude to QwenVL, EasyR1, Verl and GUI-R1 for providing open-source resources that contributed to the development of this project.
If you find this repo useful for your research, please consider citing the paper.
@article{li2026s,
title={What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning},
author={Li, Songze and Guo, Xiaoke and Liu, Tianqi and Yi, Biao and Gong, Zhaoyan and Liu, Zhiqiang and Chen, Huajun and Zhang, Wen},
journal={arXiv preprint arXiv:2604.06995},
year={2026}
}

