InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

This is the official repository for the paper InfiGUI-R1.

🌟 Overview

In this work, we develop InfiGUI-R1, a multimodal large language model-based GUI agent primarily trained using Reinforcement Learning to enhance planning and error recovery skills for GUI tasks. Specifically, our agent is trained using a two-stage framework: first, we inject spatial reasoning capabilities by distilling reasoning trajectories from teacher models; second, we enhance the agent's planning and error recovery skills using Reinforcement Learning, employing techniques like rewarding accurate sub-goal generation and training on constructed error recovery scenarios.

Our two-stage training framework

Performance results on ScreenSpot Pro benchmark

🔥 News

🔥 2025/05/15 Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" is accepted by ACL 2025.
🔥 2025/4/19 Our paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners" released.
🔥 2025/1/9 Our paper "InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection" released.
🔥 2024/12/12 Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" released.
2024/4/2 Our paper "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks" is accepted by ICML 2024.

🚀 Updates

2025/05/21 Added AndroidControl evaluation methods and uploaded the corresponding test data.
2025/04/20 Model weights have been uploaded to Hugging Face.
2025/04/19 Our preprint has been published on arXiv.

📊 Results

ScreenSpot Results

On the cross-platform ScreenSpot benchmark, InfiGUI-R1-3B achieves an average accuracy of 87.5%. It leads across text and icon localization tasks on Mobile, Desktop, and Web platforms, reaching the state-of-the-art level for models of similar parameter size (as of 2025/04/19):

Model	Mobile		Desktop		Web		Avg.
Model	Text	Icon	Text	Icon	Text	Icon	Avg.
Proprietary Models
GPT-4o	30.5	23.2	20.6	19.4	11.1	7.8	18.8
Claude Computer Use	-	-	-	-	-	-	83.0
Gemini 2.0 (Project Mariner)	-	-	-	-	-	-	84.0
General Open-source Models
Qwen2-VL-7B	61.3	39.3	52.0	45.0	33.0	21.8	42.9
Qwen2.5-VL-3B	-	-	-	-	-	-	55.5
Qwen2.5-VL-7B	-	-	-	-	-	-	84.7
GUI-specific Models
CogAgent	67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
UGround-7B	82.8	60.3	82.5	63.6	80.4	70.4	73.3
OS-Atlas-7B	93.0	72.9	91.8	62.9	90.9	74.3	82.5
ShowUI-2B	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Aguvis-7B	95.6	77.7	93.8	67.1	88.3	75.2	84.4
UI-R1-3B	-	-	90.2	59.3	85.2	73.3	-
GUI-R1-3B	-	-	93.8	64.8	89.6	72.1	-
GUI-R1-7B	-	-	91.8	73.6	91.3	75.7	-
UI-TARS-2B	93.0	75.5	90.7	68.6	84.3	74.8	82.3
Ours
InfiGUI-R1-3B	97.1	81.2	94.3	77.1	91.7	77.6	87.5

ScreenSpot-Pro Results

On the more challenging ScreenSpot-Pro benchmark, which focuses on complex high-resolution desktop applications, InfiGUI-R1-3B achieves an average accuracy of 35.7%. This performance rivals that of larger 7B models with excellent performance, reaching the state-of-the-art level for models of similar parameter size (as of 2025/04/19).

Model	Avg.			CAD			Development			Creative			Scientific			Office			OS
	Acc		Avg.	Acc		Avg.	Acc		Avg.	Acc		Avg.	Acc		Avg.	Acc		Avg.	Acc		Avg.
	Text	Icon	Avg.	Text	Icon	Avg.	Text	Icon	Avg.	Text	Icon	Avg.	Text	Icon	Avg.	Text	Icon	Avg.	Text	Icon	Avg.
Proprietary Models
GPT-4o	1.3	0.0	0.8	2.0	0.0	1.5	1.3	0.0	0.7	1.0	0.0	0.6	2.1	0.0	1.2	1.1	0.0	0.9	0.0	0.0	0.0
Claude Computer Use	23.4	7.1	17.1	14.5	3.7	11.9	22.0	3.9	12.6	25.9	3.4	16.8	33.9	15.8	25.8	30.1	16.3	26.9	11.0	4.5	8.1
General Open-source Models
Qwen2-VL-7B	2.5	0.2	1.6	0.5	0.0	0.4	2.6	0.0	1.3	1.5	0.0	0.9	6.3	0.0	3.5	3.4	1.9	3.0	0.9	0.0	0.5
Qwen2.5-VL-3B	-	-	23.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Qwen2.5-VL-7B	-	-	29.0	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Kimi-VL	-	-	34.5	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
GUI-specific Models
SeeClick	1.8	0.0	1.1	2.5	0.0	1.9	0.6	0.0	0.3	1.0	0.0	0.6	3.5	0.0	2.0	1.1	0.0	0.9	2.8	0.0	1.5
CogAgent-18B	12.0	0.8	7.7	7.1	3.1	6.1	14.9	0.7	8.0	9.6	0.0	5.6	22.2	1.8	13.4	13.0	0.0	10.0	5.6	0.0	3.1
Aria-UI	17.1	2.0	11.3	7.6	1.6	6.1	16.2	0.0	8.4	23.7	2.1	14.7	27.1	6.4	18.1	20.3	1.9	16.1	4.7	0.0	2.6
OS-Atlas-4B	5.0	1.7	3.7	2.0	0.0	1.5	7.1	0.0	3.7	3.0	1.4	2.3	9.0	5.5	7.5	5.1	3.8	4.8	5.6	0.0	3.1
OS-Atlas-7B	28.1	4.0	18.9	12.2	4.7	10.3	33.1	1.4	17.7	28.8	2.8	17.9	37.5	7.3	24.4	33.9	5.7	27.4	27.1	4.5	16.8
ShowUI-2B	10.8	2.6	7.7	2.5	0.0	1.9	16.9	1.4	9.4	9.1	0.0	5.3	13.2	7.3	10.6	15.3	7.5	13.5	10.3	2.2	6.6
UGround-7B	25.0	2.8	16.5	14.2	1.6	11.1	26.6	2.1	14.7	27.3	2.8	17.0	31.9	2.7	19.3	31.6	11.3	27.0	17.8	0.0	9.7
UGround-V1-7B	-	-	31.1	-	-	13.5	-	-	35.5	-	-	27.8	-	-	38.8	-	-	48.8	-	-	26.1
UI-R1-3B	-	-	17.8	11.2	6.3	-	22.7	4.1	-	27.3	3.5	-	42.4	11.8	-	32.2	11.3	-	13.1	4.5	-
GUI-R1-3B	-	-	-	26.4	7.8	-	33.8	4.8	-	40.9	5.6	-	61.8	17.3	-	53.6	17.0	-	28.1	5.6	-
GUI-R1-7B	-	-	-	23.9	6.3	-	49.4	4.8	-	38.9	8.4	-	55.6	11.8	-	58.7	26.4	-	42.1	16.9	-
UI-TARS-2B	39.6	8.4	27.7	17.8	4.7	14.6	47.4	4.1	26.4	42.9	6.3	27.6	56.9	17.3	39.8	50.3	17.0	42.6	21.5	5.6	14.3
UI-TARS-7B	47.8	16.2	35.7	20.8	9.4	18.0	58.4	12.4	36.1	50.0	9.1	32.8	63.9	31.8	50.0	63.3	20.8	53.5	30.8	16.9	24.5
Ours
InfiGUI-R1-3B	49.1	14.1	35.7	33.0	14.1	28.4	51.3	12.4	32.4	44.9	7.0	29.0	58.3	20.0	41.7	65.5	28.3	57.0	43.9	12.4	29.6

AndroidControl Results

On the AndroidControl, which involves diverse Android trajectory tasks (with Low and High difficulty levels), InfiGUI-R1-3B achieves success rates on its test split of 92.1% and 71.1%, respectively. This reaches the state-of-the-art level for models of similar parameter size (as of 2025/04/19).

Model	AndroidControl-Low			AndroidControl-High
Model	Type	Grounding	SR	Type	Grounding	SR
GPT-4o	74.3	0.0	19.4	66.3	0.0	20.8
Aria-UI	–	87.7	67.3	–	43.2	10.2
OS-Atlas-4B	91.9	83.8	80.6	84.7	73.8	67.5
Aguvis-7B	–	–	80.5	–	–	61.5
Aguvis-72B	–	–	84.4	–	–	66.4
UI-R1	94.3	82.6	88.5	-	-	-
GUI-R1-3B	-	-	-	58.0	56.2	46.6
GUI-R1-7B	-	-	-	71.6	65.6	51.7
UI-TARS-2B	98.1	87.3	89.3	81.2	78.4	68.9
Ours
InfiGUI-R1-3B	96.0	93.2	92.1	82.7	74.4	71.1

🧪 Evaluation

AndroidControl

To reproduce the results on AndroidControl:

Install the vllm library:
```
pip install vllm
```
Navigate to the evaluation directory (eval/android_control):
```
cd eval/android_control
```

Download the processed test set from Hugging Face (Reallm-Labs/android_control_test):

huggingface-cli download --repo-type dataset --resume-download Reallm-Labs/android_control_test --local-dir ./

Extract the downloaded data:
```
tar -xzf android_control_test.tar.gz
```

Run the evaluation scripts for high and low difficulty tasks:

python android_control.py --model_path Reallm-Labs/InfiGUI-R1-3B --eval_type high --thinking
python android_control.py --model_path Reallm-Labs/InfiGUI-R1-3B --eval_type low --thinking

📚 Citation Information

If you find this work useful, citations to the following papers are welcome:

@article{liu2025infigui,
  title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
  author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2504.14239},
  year={2025}
}

@article{liu2025infiguiagent,
  title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
  author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2501.04575},
  year={2025}
}

🙏 Acknowledgements

We would like to express our gratitude for the following open-source projects: EasyR1, VERL, LLaMA-Factory, and Qwen2.5-VL.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
eval/android_control		eval/android_control
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

🌟 Overview

🔥 News

🚀 Updates

📊 Results

ScreenSpot Results

ScreenSpot-Pro Results

AndroidControl Results

🧪 Evaluation

AndroidControl

📚 Citation Information

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

InfiXAI/InfiGUI-R1

Folders and files

Latest commit

History

Repository files navigation

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

🌟 Overview

🔥 News

🚀 Updates

📊 Results

ScreenSpot Results

ScreenSpot-Pro Results

AndroidControl Results

🧪 Evaluation

AndroidControl

📚 Citation Information

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages