This is the official repository for the paper InfiGUI-R1.
In this work, we develop InfiGUI-R1, a multimodal large language model-based GUI agent primarily trained using Reinforcement Learning to enhance planning and error recovery skills for GUI tasks. Specifically, our agent is trained using a two-stage framework: first, we inject spatial reasoning capabilities by distilling reasoning trajectories from teacher models; second, we enhance the agent's planning and error recovery skills using Reinforcement Learning, employing techniques like rewarding accurate sub-goal generation and training on constructed error recovery scenarios.
- ๐ฅ
2025/05/15
Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" is accepted by ACL 2025. - ๐ฅ
2025/4/19
Our paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners" released. - ๐ฅ
2025/1/9
Our paper "InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection" released. - ๐ฅ
2024/12/12
Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" released. 2024/4/2
Our paper "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks" is accepted by ICML 2024.
2025/05/21
Added AndroidControl evaluation methods and uploaded the corresponding test data.2025/04/20
Model weights have been uploaded to Hugging Face.2025/04/19
Our preprint has been published on arXiv.
On the cross-platform ScreenSpot benchmark, InfiGUI-R1-3B achieves an average accuracy of 87.5%. It leads across text and icon localization tasks on Mobile, Desktop, and Web platforms, reaching the state-of-the-art level for models of similar parameter size (as of 2025/04/19):
Model | Mobile | Desktop | Web | Avg. | |||
---|---|---|---|---|---|---|---|
Text | Icon | Text | Icon | Text | Icon | ||
Proprietary Models | |||||||
GPT-4o | 30.5 | 23.2 | 20.6 | 19.4 | 11.1 | 7.8 | 18.8 |
Claude Computer Use | - | - | - | - | - | - | 83.0 |
Gemini 2.0 (Project Mariner) | - | - | - | - | - | - | 84.0 |
General Open-source Models | |||||||
Qwen2-VL-7B | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.9 |
Qwen2.5-VL-3B | - | - | - | - | - | - | 55.5 |
Qwen2.5-VL-7B | - | - | - | - | - | - | 84.7 |
GUI-specific Models | |||||||
CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 |
SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 82.5 |
ShowUI-2B | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
Aguvis-7B | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 84.4 |
UI-R1-3B | - | - | 90.2 | 59.3 | 85.2 | 73.3 | - |
GUI-R1-3B | - | - | 93.8 | 64.8 | 89.6 | 72.1 | - |
GUI-R1-7B | - | - | 91.8 | 73.6 | 91.3 | 75.7 | - |
UI-TARS-2B | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | 82.3 |
Ours | |||||||
InfiGUI-R1-3B | 97.1 | 81.2 | 94.3 | 77.1 | 91.7 | 77.6 | 87.5 |
On the more challenging ScreenSpot-Pro benchmark, which focuses on complex high-resolution desktop applications, InfiGUI-R1-3B achieves an average accuracy of 35.7%. This performance rivals that of larger 7B models with excellent performance, reaching the state-of-the-art level for models of similar parameter size (as of 2025/04/19).
Modelย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย | Avg. | CAD | Development | Creative | Scientific | Office | OS | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | Avg. | Acc | Avg. | Acc | Avg. | Acc | Avg. | Acc | Avg. | Acc | Avg. | Acc | Avg. | ||||||||
Text | Icon | Text | Icon | Text | Icon | Text | Icon | Text | Icon | Text | Icon | Text | Icon | ||||||||
Proprietary Models | |||||||||||||||||||||
GPT-4o | 1.3 | 0.0 | 0.8 | 2.0 | 0.0 | 1.5 | 1.3 | 0.0 | 0.7 | 1.0 | 0.0 | 0.6 | 2.1 | 0.0 | 1.2 | 1.1 | 0.0 | 0.9 | 0.0 | 0.0 | 0.0 |
Claude Computer Use | 23.4 | 7.1 | 17.1 | 14.5 | 3.7 | 11.9 | 22.0 | 3.9 | 12.6 | 25.9 | 3.4 | 16.8 | 33.9 | 15.8 | 25.8 | 30.1 | 16.3 | 26.9 | 11.0 | 4.5 | 8.1 |
General Open-source Models | |||||||||||||||||||||
Qwen2-VL-7B | 2.5 | 0.2 | 1.6 | 0.5 | 0.0 | 0.4 | 2.6 | 0.0 | 1.3 | 1.5 | 0.0 | 0.9 | 6.3 | 0.0 | 3.5 | 3.4 | 1.9 | 3.0 | 0.9 | 0.0 | 0.5 |
Qwen2.5-VL-3B | - | - | 23.9 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Qwen2.5-VL-7B | - | - | 29.0 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Kimi-VL | - | - | 34.5 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
GUI-specific Models | |||||||||||||||||||||
SeeClick | 1.8 | 0.0 | 1.1 | 2.5 | 0.0 | 1.9 | 0.6 | 0.0 | 0.3 | 1.0 | 0.0 | 0.6 | 3.5 | 0.0 | 2.0 | 1.1 | 0.0 | 0.9 | 2.8 | 0.0 | 1.5 |
CogAgent-18B | 12.0 | 0.8 | 7.7 | 7.1 | 3.1 | 6.1 | 14.9 | 0.7 | 8.0 | 9.6 | 0.0 | 5.6 | 22.2 | 1.8 | 13.4 | 13.0 | 0.0 | 10.0 | 5.6 | 0.0 | 3.1 |
Aria-UI | 17.1 | 2.0 | 11.3 | 7.6 | 1.6 | 6.1 | 16.2 | 0.0 | 8.4 | 23.7 | 2.1 | 14.7 | 27.1 | 6.4 | 18.1 | 20.3 | 1.9 | 16.1 | 4.7 | 0.0 | 2.6 |
OS-Atlas-4B | 5.0 | 1.7 | 3.7 | 2.0 | 0.0 | 1.5 | 7.1 | 0.0 | 3.7 | 3.0 | 1.4 | 2.3 | 9.0 | 5.5 | 7.5 | 5.1 | 3.8 | 4.8 | 5.6 | 0.0 | 3.1 |
OS-Atlas-7B | 28.1 | 4.0 | 18.9 | 12.2 | 4.7 | 10.3 | 33.1 | 1.4 | 17.7 | 28.8 | 2.8 | 17.9 | 37.5 | 7.3 | 24.4 | 33.9 | 5.7 | 27.4 | 27.1 | 4.5 | 16.8 |
ShowUI-2B | 10.8 | 2.6 | 7.7 | 2.5 | 0.0 | 1.9 | 16.9 | 1.4 | 9.4 | 9.1 | 0.0 | 5.3 | 13.2 | 7.3 | 10.6 | 15.3 | 7.5 | 13.5 | 10.3 | 2.2 | 6.6 |
UGround-7B | 25.0 | 2.8 | 16.5 | 14.2 | 1.6 | 11.1 | 26.6 | 2.1 | 14.7 | 27.3 | 2.8 | 17.0 | 31.9 | 2.7 | 19.3 | 31.6 | 11.3 | 27.0 | 17.8 | 0.0 | 9.7 |
UGround-V1-7B | - | - | 31.1 | - | - | 13.5 | - | - | 35.5 | - | - | 27.8 | - | - | 38.8 | - | - | 48.8 | - | - | 26.1 |
UI-R1-3B | - | - | 17.8 | 11.2 | 6.3 | - | 22.7 | 4.1 | - | 27.3 | 3.5 | - | 42.4 | 11.8 | - | 32.2 | 11.3 | - | 13.1 | 4.5 | - |
GUI-R1-3B | - | - | - | 26.4 | 7.8 | - | 33.8 | 4.8 | - | 40.9 | 5.6 | - | 61.8 | 17.3 | - | 53.6 | 17.0 | - | 28.1 | 5.6 | - |
GUI-R1-7B | - | - | - | 23.9 | 6.3 | - | 49.4 | 4.8 | - | 38.9 | 8.4 | - | 55.6 | 11.8 | - | 58.7 | 26.4 | - | 42.1 | 16.9 | - |
UI-TARS-2B | 39.6 | 8.4 | 27.7 | 17.8 | 4.7 | 14.6 | 47.4 | 4.1 | 26.4 | 42.9 | 6.3 | 27.6 | 56.9 | 17.3 | 39.8 | 50.3 | 17.0 | 42.6 | 21.5 | 5.6 | 14.3 |
UI-TARS-7B | 47.8 | 16.2 | 35.7 | 20.8 | 9.4 | 18.0 | 58.4 | 12.4 | 36.1 | 50.0 | 9.1 | 32.8 | 63.9 | 31.8 | 50.0 | 63.3 | 20.8 | 53.5 | 30.8 | 16.9 | 24.5 |
Ours | |||||||||||||||||||||
InfiGUI-R1-3B | 49.1 | 14.1 | 35.7 | 33.0 | 14.1 | 28.4 | 51.3 | 12.4 | 32.4 | 44.9 | 7.0 | 29.0 | 58.3 | 20.0 | 41.7 | 65.5 | 28.3 | 57.0 | 43.9 | 12.4 | 29.6 |
On the AndroidControl, which involves diverse Android trajectory tasks (with Low and High difficulty levels), InfiGUI-R1-3B achieves success rates on its test split of 92.1% and 71.1%, respectively. This reaches the state-of-the-art level for models of similar parameter size (as of 2025/04/19).
Model | AndroidControl-Low | AndroidControl-High | ||||
---|---|---|---|---|---|---|
Type | Grounding | SR | Type | Grounding | SR | |
GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 |
Aria-UI | โ | 87.7 | 67.3 | โ | 43.2 | 10.2 |
OS-Atlas-4B | 91.9 | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 |
Aguvis-7B | โ | โ | 80.5 | โ | โ | 61.5 |
Aguvis-72B | โ | โ | 84.4 | โ | โ | 66.4 |
UI-R1 | 94.3 | 82.6 | 88.5 | - | - | - |
GUI-R1-3B | - | - | - | 58.0 | 56.2 | 46.6 |
GUI-R1-7B | - | - | - | 71.6 | 65.6 | 51.7 |
UI-TARS-2B | 98.1 | 87.3 | 89.3 | 81.2 | 78.4 | 68.9 |
Ours | ||||||
InfiGUI-R1-3B | 96.0 | 93.2 | 92.1 | 82.7 | 74.4 | 71.1 |
To reproduce the results on AndroidControl:
- Install the
vllm
library:pip install vllm
- Navigate to the evaluation directory (
eval/android_control
):cd eval/android_control
- Download the processed test set from Hugging Face (
Reallm-Labs/android_control_test
):huggingface-cli download --repo-type dataset --resume-download Reallm-Labs/android_control_test --local-dir ./
- Extract the downloaded data:
tar -xzf android_control_test.tar.gz
- Run the evaluation scripts for high and low difficulty tasks:
python android_control.py --model_path Reallm-Labs/InfiGUI-R1-3B --eval_type high --thinking python android_control.py --model_path Reallm-Labs/InfiGUI-R1-3B --eval_type low --thinking
If you find this work useful, citations to the following papers are welcome:
@article{liu2025infigui,
title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2504.14239},
year={2025}
}
@article{liu2025infiguiagent,
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2501.04575},
year={2025}
}
We would like to express our gratitude for the following open-source projects: EasyR1, VERL, LLaMA-Factory, and Qwen2.5-VL.