Skip to content

Repository for the paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners"

License

Notifications You must be signed in to change notification settings

InfiXAI/InfiGUI-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

InfiGUI-R1
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

arXiv Paper Hugging Face Paper Hugging Face Model


This is the official repository for the paper InfiGUI-R1.

๐ŸŒŸ Overview

In this work, we develop InfiGUI-R1, a multimodal large language model-based GUI agent primarily trained using Reinforcement Learning to enhance planning and error recovery skills for GUI tasks. Specifically, our agent is trained using a two-stage framework: first, we inject spatial reasoning capabilities by distilling reasoning trajectories from teacher models; second, we enhance the agent's planning and error recovery skills using Reinforcement Learning, employing techniques like rewarding accurate sub-goal generation and training on constructed error recovery scenarios.

Method Overview

Our two-stage training framework

Results on ScreenSpot Pro

Performance results on ScreenSpot Pro benchmark

๐Ÿ”ฅ News

๐Ÿš€ Updates

๐Ÿ“Š Results

ScreenSpot Results

On the cross-platform ScreenSpot benchmark, InfiGUI-R1-3B achieves an average accuracy of 87.5%. It leads across text and icon localization tasks on Mobile, Desktop, and Web platforms, reaching the state-of-the-art level for models of similar parameter size (as of 2025/04/19):

ModelMobileDesktopWebAvg.
TextIconTextIconTextIcon
Proprietary Models
GPT-4o30.523.220.619.411.17.818.8
Claude Computer Use------83.0
Gemini 2.0 (Project Mariner)------84.0
General Open-source Models
Qwen2-VL-7B61.339.352.045.033.021.842.9
Qwen2.5-VL-3B------55.5
Qwen2.5-VL-7B------84.7
GUI-specific Models
CogAgent67.024.074.220.070.428.647.4
SeeClick78.052.072.230.055.732.553.4
UGround-7B82.860.382.563.680.470.473.3
OS-Atlas-7B93.072.991.862.990.974.382.5
ShowUI-2B92.375.576.361.181.763.675.1
Aguvis-7B95.677.793.867.188.375.284.4
UI-R1-3B--90.259.385.273.3-
GUI-R1-3B--93.864.889.672.1-
GUI-R1-7B--91.873.691.375.7-
UI-TARS-2B93.075.590.768.684.374.882.3
Ours
InfiGUI-R1-3B97.181.294.377.191.777.687.5

ScreenSpot-Pro Results

On the more challenging ScreenSpot-Pro benchmark, which focuses on complex high-resolution desktop applications, InfiGUI-R1-3B achieves an average accuracy of 35.7%. This performance rivals that of larger 7B models with excellent performance, reaching the state-of-the-art level for models of similar parameter size (as of 2025/04/19).

Modelย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Avg.CADDevelopmentCreativeScientificOfficeOS
AccAvg.AccAvg.AccAvg.AccAvg.AccAvg.AccAvg.AccAvg.
TextIconTextIconTextIconTextIconTextIconTextIconTextIcon
Proprietary Models
GPT-4o1.30.00.82.00.01.51.30.00.71.00.00.62.10.01.21.10.00.90.00.00.0
Claude Computer Use23.47.117.114.53.711.922.03.912.625.93.416.833.915.825.830.116.326.911.04.58.1
General Open-source Models
Qwen2-VL-7B2.50.21.60.50.00.42.60.01.31.50.00.96.30.03.53.41.93.00.90.00.5
Qwen2.5-VL-3B--23.9------------------
Qwen2.5-VL-7B--29.0------------------
Kimi-VL--34.5------------------
GUI-specific Models
SeeClick1.80.01.12.50.01.90.60.00.31.00.00.63.50.02.01.10.00.92.80.01.5
CogAgent-18B12.00.87.77.13.16.114.90.78.09.60.05.622.21.813.413.00.010.05.60.03.1
Aria-UI17.12.011.37.61.66.116.20.08.423.72.114.727.16.418.120.31.916.14.70.02.6
OS-Atlas-4B5.01.73.72.00.01.57.10.03.73.01.42.39.05.57.55.13.84.85.60.03.1
OS-Atlas-7B28.14.018.912.24.710.333.11.417.728.82.817.937.57.324.433.95.727.427.14.516.8
ShowUI-2B10.82.67.72.50.01.916.91.49.49.10.05.313.27.310.615.37.513.510.32.26.6
UGround-7B25.02.816.514.21.611.126.62.114.727.32.817.031.92.719.331.611.327.017.80.09.7
UGround-V1-7B--31.1--13.5--35.5--27.8--38.8--48.8--26.1
UI-R1-3B--17.811.26.3-22.74.1-27.33.5-42.411.8-32.211.3-13.14.5-
GUI-R1-3B---26.47.8-33.84.8-40.95.6-61.817.3-53.617.0-28.15.6-
GUI-R1-7B---23.96.3-49.44.8-38.98.4-55.611.8-58.726.4-42.116.9-
UI-TARS-2B39.68.427.717.84.714.647.44.126.442.96.327.656.917.339.850.317.042.621.55.614.3
UI-TARS-7B47.816.235.720.89.418.058.412.436.150.09.132.863.931.850.063.320.853.530.816.924.5
Ours
InfiGUI-R1-3B49.114.135.733.014.128.451.312.432.444.97.029.058.320.041.765.528.357.043.912.429.6

AndroidControl Results

On the AndroidControl, which involves diverse Android trajectory tasks (with Low and High difficulty levels), InfiGUI-R1-3B achieves success rates on its test split of 92.1% and 71.1%, respectively. This reaches the state-of-the-art level for models of similar parameter size (as of 2025/04/19).

ModelAndroidControl-LowAndroidControl-High
TypeGroundingSRTypeGroundingSR
GPT-4o74.30.019.466.30.020.8
Aria-UIโ€“87.767.3โ€“43.210.2
OS-Atlas-4B91.983.880.684.773.867.5
Aguvis-7Bโ€“โ€“80.5โ€“โ€“61.5
Aguvis-72Bโ€“โ€“84.4โ€“โ€“66.4
UI-R194.382.688.5---
GUI-R1-3B---58.056.246.6
GUI-R1-7B---71.665.651.7
UI-TARS-2B98.187.389.381.278.468.9
Ours
InfiGUI-R1-3B96.093.292.182.774.471.1

๐Ÿงช Evaluation

AndroidControl

To reproduce the results on AndroidControl:

  1. Install the vllm library:
    pip install vllm
  2. Navigate to the evaluation directory (eval/android_control):
    cd eval/android_control
  3. Download the processed test set from Hugging Face (Reallm-Labs/android_control_test):
    huggingface-cli download --repo-type dataset --resume-download Reallm-Labs/android_control_test --local-dir ./
  4. Extract the downloaded data:
    tar -xzf android_control_test.tar.gz
  5. Run the evaluation scripts for high and low difficulty tasks:
    python android_control.py --model_path Reallm-Labs/InfiGUI-R1-3B --eval_type high --thinking
    python android_control.py --model_path Reallm-Labs/InfiGUI-R1-3B --eval_type low --thinking

๐Ÿ“š Citation Information

If you find this work useful, citations to the following papers are welcome:

@article{liu2025infigui,
  title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
  author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2504.14239},
  year={2025}
}
@article{liu2025infiguiagent,
  title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
  author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2501.04575},
  year={2025}
}

๐Ÿ™ Acknowledgements

We would like to express our gratitude for the following open-source projects: EasyR1, VERL, LLaMA-Factory, and Qwen2.5-VL.

About

Repository for the paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages