Skip to content

GML-FMGroup/awesome_autonomous_agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 

Repository files navigation

Awesome Autonomous Agents

Paper

  1. Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks,[project] [code] Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji. Preprint'25
  2. Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage. [project] Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li. Preprint'24
  3. UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning. [code] Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, Hongsheng Li. Preprint'25
  4. AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents. [project] [code] Jiabin Tang, Tianyu Fan, Chao Huang. Preprint'25
  5. InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection. [project] [code] Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu. Preprint'25
  6. AppAgentX: Evolving GUI Agents as Proficient Smartphone Users. [project] [code] Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Chi Zhang. Preprint'25
  7. OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning. [project] [code] Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, James Zou. Preprint'25
  8. OS-ATLAS: Foundation Action Model for Generalist GUI Agents. [project] [code] [model] Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao. ICLR 2025 Spotlight
  9. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. [code] [model] Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi. Preprint'25
  10. GPT-4V(ision) is a Generalist Web Agent, if Grounded. [project] [code] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su. Preprint'24
  11. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. [code] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu. Preprint'24
  12. ScreenAgent: A Vision Language Model-driven Computer Control Agent. [code] Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang. Preprint'24
  13. Dual-View Visual Contextualization for Web Navigation. Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao. Preprint'24
  14. Training Software Engineering Agents and Verifiers with SWE-Gym. [code] [model] Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang. Preprint'24
  15. [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement. [code] Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong. Preprint'24
  16. UFO: A UI-Focused Agent for Windows OS Interaction. [code] Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang. Preprint'24
  17. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. [code] Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang. NeurIPS 2024 poster
  18. OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis. [code] [model] Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu. Preprint'24
  19. Multimodal Web Navigation with Instruction-Finetuned Foundation Models. [project] Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, Izzeddin Gur. Preprint'24
  20. Cradle: Empowering Foundation Agents Towards General Computer Control. [project] [code] Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu. Preprint'24
  21. AutoWebGLM: A Large Language Model-based Web Navigating Agent. [code] Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang. Preprint'24
  22. Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning. [code] Moghis Fereidouni, A.B. Siddique. Preprint'24
  23. Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning. Lucas-Andreï Thil, Mirela Popa, Gerasimos Spanakis. Preprint'24
  24. GUICourse: From General Vision Language Models to Versatile GUI Agents. [code] Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun. Preprint'24
  25. On the Effects of Data Scale on UI Control Agents. Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva. Preprint'24
  26. AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents. Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li. Preprint'24
  27. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. [code] Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang. Preprint'24.
  28. E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu. Preprint'24.
  29. ScreenAI: A Vision-Language Model for UI and Infographics Understanding. [code] Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma. Preprint'24
  30. Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems. [code] Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, Ravi Kokku. Preprint'24
  31. Tree Search for Language Model Agents. [code] Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov. Preprint'24
  32. Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. [code] Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov. Preprint'24
  33. Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions. [code] Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao. Preprint'24
  34. OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models. [code] Iat Long Iong, Xiao Liu, Yuxuan Chen, Hanyu Lai, Shuntian Yao, Pengbo Shen, Hao Yu, Yuxiao Dong, Jie Tang. ACL'24
  35. WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration. [project] [code] Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, Volker Tresp. Preprint'24
  36. MobileViews: A Large-Scale Mobile GUI Dataset. [project] Longxi Gao, Li Zhang, Shihe Wang, Shangguang Wang, Yuanchun Li, Mengwei Xu. Preprint'24
  37. MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. [code] Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang. Preprint'24
  38. Steward: Natural Language Web Automation. [code] Brian Tang, Kang G. Shin. Preprint'24
  39. xLAM: A Family of Large Action Models to Empower AI Agent Systems. [code] Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong. Preprint'24
  40. AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. [code] Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, Huzefa Rangwala. Preprint'24
  41. Beyond Browsing: API-Based Web Agents. [code] Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig. Preprint'24
  42. NNetscape navigator: complex demonstrations for web agents without a demonstrator. [code] Shikhar Murty, Dzmitry Bahdanau, Christopher D. Manning. Preprint'24
  43. OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization. [code] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu. Preprint'24
  44. Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation. [code] Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, Jinyoung Yeo. Preprint'24
  45. Agent S: An Open Agentic Framework that Uses Computers Like a Human. [code] Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang. Preprint'24
  46. AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation Difei Gao, Siyuan Hu, Zechen Bai, Qinghong Lin, Mike Zheng Shou. MM'24
  47. AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations. Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker Balch, Manuela Veloso. Preprint'24
  48. Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents. [code] Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su. Preprint'24
  49. ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data. [code] Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar. Preprint'24
  50. WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning. [code] Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong. Preprint'24
  51. ShowUI: One Vision-Language-Action Model for GUI Visual Agent. [code] Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou. Preprint'24
  52. The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use. [project] [code] Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou. Preprint'24
  53. AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials. [project] [code] Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu. Preprint'24
  54. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. [project] [code] Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong. Preprint'24
  55. AutoGLM: Autonomous Foundation Agents for GUIs. [project] Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang. Preprint'24
  56. AGILE: A Novel Reinforcement Learning Framework of LLM Agents. [code] Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li. Preprint'24
  57. DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents. [code] Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao. Preprint'24
  58. Latent State Estimation Helps UI Agents to Reason. William E Bishop, Alice Li, Christopher Rawles, Oriana Riva. Preprint'24
  59. Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms. Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan. Preprint'24
  60. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. [code] Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, Kristina Toutanova. Preprint'23
  61. DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning. [code] Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar. Preprint'23
  62. CogAgent: A Visual Language Model for GUI Agents. [code] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, Jie Tang. Preprint'23
  63. LASER: LLM Agent with State-Space Exploration for Web Navigation. [[code][https://github.com/mayer123/laser]] Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Dong Yu. Preprint'23
  64. OpenAgents: An Open Platform for Language Agents in the Wild. [code] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu. Preprint'23
  65. ProAgent: From Robotic Process Automation to Agentic Process Automation. [code] Yining Ye, Xin Cong, Shizuo Tian, Jiannan Cao, Hao Wang, Yujia Qin, Yaxi Lu, Heyang Yu, Huadong Wang, Yankai Lin, Zhiyuan Liu, Maosong Sun. Preprint'23
  66. ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation. [project] [code] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou. Preprint'23
  67. WebVLN: Vision-and-Language Navigation on Websites. [code] Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, Qi Wu. Preprint'23
  68. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. [project] [code] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem. Preprint'23
  69. Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. [project] [code] Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang. Preprint'25
  70. OSCAR: Operating system control via state-aware reasoning and re-planning. Xiaoqiang Wang, Bang Liu. ICLR 2025 Poster
  71. STEVE: A Step Verification Pipeline for Computer-use Agent Training. [code] Fanbin Lu, Zhisheng Zhong, Ziqin Wei, Shu Liu, Chi-Wing Fu, Jiaya Jia. Preprint'25
  72. UFO2: The Desktop AgentOS. [project] [code] Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang. Preprint'25
  73. Breaking the Data Barrier -- Building GUI Agents Through Task Generalization. [code] Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He. Preprint'25
  74. LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark. [project] [code] Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng. Preprint'25
  75. Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems. Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenqi Zhang, Wenqiao Zhang, Kaitao Song, Weiming Lu, Yueting Zhuang. Preprint'25
  76. TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials. [project] [code] Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, Qing Li. Preprint'25
  77. ScaleTrack: Scaling and back-tracking Automated GUI Agents. Jing Huang, Zhixiong Zeng, Wenkang Han, Yufeng Zhong, Liming Zheng, Shuai Fu, Jingyuan Chen, Lin Ma. Preprint'25
  78. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners. [code] Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, Fei Wu. Preprint'25
  79. GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents. [code] Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu. Preprint'25
  80. GUI-G2: Gaussian Reward Modeling for GUI Grounding. [project] [code] . Preprint'25
  81. BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism. Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan. Preprint'25
  82. InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction. [code] Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding. Preprint'25
  83. UI-Evol: Automatic Knowledge Evolving for Computer Use Agents. Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu. Preprint'25
  84. Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation. [code] Yuyang Wanyan, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Jiabo Ye, Yutong Kou, Ming Yan, Fei Huang, Xiaoshan Yang, Weiming Dong, Changsheng Xu. Preprint'25
  85. ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay. [code] Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia. Preprint'25
  86. ZeroGUI: Automating Online GUI Learning at Zero Human Cost. [code] Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai. Preprint'25
  87. GTA1: GUI Test-time Scaling Agent. [code] Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li. Preprint'25
  88. GUI-G2: Gaussian Reward Modeling for GUI Grounding. [project] [code] Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang. Preprint'25

Benchmark

  1. ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use. [code] Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua. Preprint'25
  2. VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [code] Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue. Preprint'25
  3. SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation. [code] Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao. Preprint'25
  4. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. [project] [code] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu. Preprint'24
  5. Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [code] Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui. Preprint'24
  6. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [code] Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva. Preprint'24
  7. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust. Preprint'24
  8. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. [code] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, Zhiyong Wu. Preprint'24
  9. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. [project] [code] Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried. Preprint'24
  10. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web. Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov. Preprint'24
  11. On the Multi-turn Instruction Following for Conversational Web Agents. [code] Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua. Preprint'24
  12. Understanding the Weakness of Large Language Model Agents within a Complex Android Environment. [code] Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, Zhen Xiao. Preprint'24
  13. AgentStudio: A Toolkit for Building General Virtual Agents. [project] [code] Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan. Preprint'24
  14. Tur[k]ingBench: A Challenge Benchmark for Web Agents. Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi. Preprint'24
  15. ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents. [code] Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov. Preprint'24
  16. VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks. [project] Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida. Preprint'24
  17. WebOlympus: An Open Platform for Web Agents on Live Websites. Boyuan Zheng, Boyu Gou, Scott Salisbury, Zheng Du, Huan Sun, Yu Su. EMNLP'24
  18. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [code] Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste. Preprint'24
  19. Benchmarking Mobile Device Control Agents across Diverse Configurations. [project] [code] Juyong Lee, Taywon Min, Minyong An, Changyeon Kim, Kimin Lee. Preprint'24
  20. LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation. [code] Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu. Preprint'24
  21. MMInA: Benchmarking Multihop Multimodal Internet Agents. [code] Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu. Preprint'24
  22. GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [code] Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo. Preprint'24
  23. GUI Action Narrator: Where and When Did That Action Take Place? Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou. Preprint'24
  24. MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. [project] [code] Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen. Preprint'24
  25. Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding. [code] Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang. Preprint'24
  26. VideoGUI: A Benchmark for GUI Automation from Instructional Videos. [code] Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou. Preprint'24
  27. WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks. [code] Michael Wornow, Avanika Narayan, Ben Viggiano, Ishan S. Khare, Tathagat Verma, Tibor Thompson, Miguel Angel Fuentes Hernandez, Sudharsan Sundar, Chloe Trujillo, Krrish Chawla, Rongfei Lu, Justin Shen, Divya Nagaraj, Joshua Martinez, Vardhan Agrawal, Althea Hudson, Nigam H. Shah, Christopher Re. Preprint'24
  28. OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation. [code] Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang. Preprint'24
  29. WebCanvas: Benchmarking Web Agents in Online Environments. [project] Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu. Preprint'24
  30. CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. [code] Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li. Preprint'24
  31. Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents. [code] Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, Shuo Shang. ACL'24
  32. Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [code] Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu. Preprint'24
  33. VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents. [code] Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang. Preprint'24
  34. WebLINX: Real-World Website Navigation with Multi-Turn Dialogue. [project] [code] Xing Han Lù, Zdeněk Kasner, Siva Reddy. Preprint'24
  35. NaviQAte: Functionality-Guided Web Application Navigation. [code] Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah. Preprint'24
  36. AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [code] Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong. Preprint'24
  37. WebArena: A Realistic Web Environment for Building Autonomous Agents. [code] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig. Preprint'23
  38. AutoDroid: LLM-powered Task Automation in Android. [code] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu. Preprint'23
  39. Android in the Wild: A Large-Scale Dataset for Android Device Control. Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap. Preprint'23
  40. GAIA: a benchmark for General AI Assistants. [code] Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom. Preprint'23
  41. Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [code] Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur. Preprint'23
  42. Mind2Web: Towards a Generalist Agent for the Web. [project] [code] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su. Preprint'23
  43. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. [project] [code] Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan. Preprint'23
  44. Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction. [code] Danyang Zhang, Lu Chen, Zihan Zhao, Ruisheng Cao, Kai Yu. Preprint'23
  45. Grounding Open-Domain Instructions to Automate Web Support Tasks [code] Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James Landay, Monica S Lam. ACL'21
  46. AndroidEnv: A Reinforcement Learning Platform for Android [code] Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup. Preprint'21
  47. Mapping Natural Language Instructions to Mobile UI Action Sequences. [code] Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge. Preprint'20
  48. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. [project] [code] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy Liang. Preprint'18
  49. WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation. [code] Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou. Preprint'25
  50. MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents. [code] Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang. Preprint'25

Survey

  1. OS-Agent-Survey/OS-Agent-Survey: This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use". [project] Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu. Github'25
  2. Large Language Model-Brained GUI Agents: A Survey. [code] Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang. Preprint'25
  3. Foundations and Recent Trends in Multimodal Mobile Agents: A Survey. Biao Wu, Yanda Li, Meng Fang, Zirui Song, Zhiwei Zhang, Yunchao Wei, Ling Chen. Preprint'24
  4. GUI Agents with Foundation Models: A Comprehensive Survey. Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang. Preprint'24
  5. A Survey on the Memory Mechanism of Large Language Model based Agents. [code] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, Ji-Rong Wen. Preprint'24
  6. LLM-based Multi-Agent Reinforcement Learning: Current and Future Directions. Chuanneng Sun, Songjun Huang, Dario Pompili. Preprint'24
  7. A Survey on Evaluation of Multimodal Large Language Models. Jiaxing Huang, Jingyi Zhang. Preprint'24
  8. A Survey on Multimodal Benchmarks: In the Era of Large AI Models. [code] Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, Long Chen. Preprint'24
  9. LLM With Tools: A Survey. Zhuocheng Shen. Preprint'24
  10. Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects. [code] Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, Xiuqiang He. Preprint'24
  11. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. [code] Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang. Preprint'24
  12. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security. [code] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu. Preprint'24
  13. Large Multimodal Agents: A Survey. Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, Guanbin Li. Preprint'24
  14. LLM Multi-Agent Systems: Challenges and Open Problems. Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, Chaoyang He. Preprint'24
  15. Understanding the planning of LLM agents: A survey. Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, Enhong Chen. Preprint'24
  16. Task Automation Intelligent Agents: A Review. Wali Abdul, Saipunidzam Mahamad, Suziah Sulaiman. Future Internet'23
  17. An In-depth Survey of Large Language Model-based Artificial Intelligence Agents. Pengyu Zhao, Zijian Jin, Ning Cheng. Preprint'23
  18. GUI-Based Software Testing: An Automated Approach Using GPT-4 and Selenium WebDriver. Zimmermann, Daniel and Koziolek, Anne. 2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops
  19. The Rise and Potential of Large Language Model Based Agents: A Survey. [code] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Qin Liu, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huan, Tao Gui. Preprint'23
  20. Agent AI: Surveying the Horizons of Multimodal Interaction. [code] Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao. Preprint'23
  21. A Survey on Evaluation of Large Language Models. [code] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie. Preprint'23
  22. A Survey on Large Language Model based Autonomous Agents. [code] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen. Preprint'23

Repo

  1. https://github.com/camel-ai/owl
  2. https://github.com/mannaandpoem/OpenManus
  3. https://github.com/HKUDS/AutoAgent
  4. https://github.com/Darwin-lfl/langmanus
  5. https://github.com/browser-use
  6. https://github.com/TheAgenticAI/CortexON

Contributing

This is an active repository and your contributions are always welcome!

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •