Skip to content

zjukg/UILoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UILoop

[Paper][ACL 2026 Findings] What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

🔥 News

  • Our paper was accepted by ACL 2026 Findings. 🎉🎉🎉
  • Our code has been released.
  • Our UI Comprehension Bench is being finalized and will be available soon. 🚧

✨ Our Findings

Left: Evaluation of existing methods on UI element localization, semantic function description, and practical usage. Middle: Performance gains with correct vs. misleading UI info compared to without UI info. Right: Comparison of UILoop against existing "Screen-to-Action" methods on SR metric for Android Control-High.

We demonstrate that comprehensive UI understanding significantly enhances reasoning in existing GUI agents. Building on this insight, we propose the innovative UILoop paradigm, which moves beyond conventional "Screen-to-Action" approaches by reframing GUI reasoning as cyclic "Screen–UI Elements–Action" loop. Through UI Element–Driven Reinforcement Fine-Tuning, UILoop improves model comprehension of interface elements, thereby advancing mutimodal GUI reasoning and interpretability.

🌱 UI Comprehension Bench

Statistics of Our UI Comprehension-Bench. Left: Proportion and distribution of GT UI elements; token length of their semantic descriptions. Right: Proportion of GT UI elements effectively used in action inference.

We introduce the more challenging UI Comprehension task with three dedicated evaluation metrics (UI Locate, Lingualize, Leverage) to assess how existing methods master UI elements. To support this, we advance community research by contributing UI Comprehension-Bench, a 26K benchmark for comprehensive UI capability assessment.

📦 Enviroment

conda create -n uiloop python=3.10
conda activate uiloop
pip install -r requirements.txt

🚀 UI Element-Driven RFT

Our repository supports the Qwen 2.5 VL series models (including 3B and 7B).

bash ./examples/qwen2_5_vl_gui_grpo.sh

📊 Inference and Evaluation

Inference and evaluation of AndroidControl-High and Screenspot Pro.

bash ./uiloop/inference.sh
bash ./uiloop/eval.sh

Inference and evaluation of Our UI Comprehension Bench. Running this script will give you scores for UI Locate, Lingualize, and leverage.

bash ./uiloop/uiloop_bench_inference.sh
bash ./uiloop/eval_uiloop.sh

💐 Acknowledgements

We would like to express our sincere gratitude to QwenVL, EasyR1, Verl and GUI-R1 for providing open-source resources that contributed to the development of this project.

⭐ Citation

If you find this repo useful for your research, please consider citing the paper.

@article{li2026s,
  title={What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning},
  author={Li, Songze and Guo, Xiaoke and Liu, Tianqi and Yi, Biao and Gong, Zhaoyan and Liu, Zhiqiang and Chen, Huajun and Zhang, Wen},
  journal={arXiv preprint arXiv:2604.06995},
  year={2026}
}

About

[Paper][ACL 2026 Findings] What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors