Skip to content

ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).

License

Notifications You must be signed in to change notification settings

OpenGVLab/ScaleCUA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

  📑 Paper   |   🤗 Dataset   |   🤖 Model   |   🖥️ Model Demo  

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously with great potential. However, developing robust CUAs requires extensive in-domain knowledge about software interfaces and operations. Unlike image–text pairs that are widely available on the Internet, computer-use data, particularly operation trajectories, are rare, costly to collect. Consequently, progress in this field remains constrained by both data scale and the limited transferability of existing VLMs. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose cross-platform CUAs.

teaser

🤖 Video Demo

scalecua_demo.mp4

📋 Table of Contents

🎉 News

  • 2025/09/19: ScaleCUA-Data is being uploaded to HuggingFace. Please be patient.
  • 2025/09/19: We have released models and code of ScaleCUA.

🚀 Key Features

  • ScaleCUA-Data: A large-Scale cross-platform dataset spanning 6 operating systems and 3 GUI-centric task domains.
  • ScaleCUA-Models: An cross-platform general-purpose agent that excels at GUI-centric task completion on various environments.
  • SFT Codebase: A comprehensive training framework that supports training computer use agent based on Qwen2.5-VL and InternVL.
  • Interactive Playground: A series of realistic, interactive environments across Ubuntu, Android, and Web.
  • Online Evaluation Suite: A set of online benchmarks to evaluate agents' capabilities of task completion on various platforms.
infer_mode

📂 Project Structure

This repository is organized into three main components:

  • evaluation: Contains all the code and benchmarks for the end-to-end evaluation of our agents.
  • playground: Provides interactive environments (Android, Ubuntu, Web) and model implementations for users to experience the agent's capabilities firsthand.
  • agent-sft: Includes the training code, configurations, and instructions needed to reproduce ScaleCUA on the ScaleCUD dataset.

⚙️ Setup

  1. Clone the repository:
    git clone https://github.com/OpenGVLab/ScaleCUA.git
    cd ScaleCUA
  2. Install dependencies:
    pip install -r requirements.txt

🎮 Playground

The Playground allows you to interactively experience the ScaleCUA agent's capabilities firsthand across Ubuntu, Android, and Web platforms. For a complete guide, please see the [Playground].

Follow these two steps to begin:

  1. Deploy the ScaleCUA models with vLLM following our [Model Deployment]. We support two modes of operation: Native Agentic Model using a single model for both UI grounding and planning, and Agentic Workflow supporting two different models for UI planning and grounding.

  2. Set up your environment following [Playground Environment]. We provide pre-configured, interactive virtual machines for Ubuntu, Android, and Web to simplify this process.

Now, you can try our agent in the interactive environment!

📊 Evaluation

We provide a suite of benchmarks for end-to-end agent evaluation using a vision-only setup. ScaleCUA support using vLLM to deploy and evaluate it through an OpenAI-compatible API. To run the evaluation benchmarks, please refer to the specific instructions within the [Evaluation].

Our evaluation suite covers desktop, mobile, and web environments:

  • Android: AndroidWorld, AndroidLab
  • Ubuntu: OSWorld
  • macOS: MacOSArena
  • Web: WebArenaLite-V2 (A refined version of WebArena-Lite suitable for visual-based agents)
  • Windows: WindowsAgentArena

🧠 Training

The directory agent-sft/ contains all the necessary code and configuration files to train the ScaleCUA from scratch using our ScaleCUA-Data. We support training Computer Use Agents with both InternVL and Qwen-VL.

💐 Acknowledgements

Thanks to the following open-sourced projects:

OSWorldWindowAgentArenaWebArenaAndroidWorldScienceBoardAGUVISMMBench-GUIQwen-VLInternVL

⚖️ License

This project is licensed under the Apache 2.0 License.

📜 Citation

If you find our work useful, please consider citing our paper:

@article{liu2025scalecua,
  title        = {ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data},
  author       = {Liu, Zhaoyang and Xie, Jingjing and Ding, Zichen and Li, Zehao and Yang, Bowen and Wu, Zhenyu and Wang, Xuehui and Sun, Qiushi and Liu, Shi and Wang, Weiyun and Ye, Shenglong and Li, Qingyun and Dong, Xuan and Yu, Yue and Lu, Chenyu and Mo, YunXiang and Yan, Yao and Tian, Zeyue and Zhang, Xiao and Huang, Yuan and Liu, Yiqian and Su, Weijie and Luo, Gen and Yue, Xiangyu and Qi, Biqing and Chen, Kai and Zhou, Bowen and Qiao, Yu and Chen, Qifeng and Wang, Wenhai},
  journal      = {arXiv preprint arXiv:2509.15221},
  year         = {2025},
  note         = {Preprint},
  url          = {https://github.com/OpenGVLab/ScaleCUA}
}

About

ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •