This repository contains a high-performance GUI agent built on the AWorld Framework, specifically designed to tackle complex desktop automation tasks within the OSWorld-verified benchmark. Our agent achieves a 58.04% pass@1 score on the osworld-verified leaderboard (max_step=50).
The core logic for our agent's perception and reasoning is adapted from the great work of the AgentS2 project. We have built upon their foundation by introducing a suite of new executable tools that enhance the agent's ability to interact with the OS environment, leading to significant improvements in the stability and robustness of the Computer Use Agent (CUA).
Our agent demonstrates leading performance on the OSWorld-verified benchmark (max_step=50), using openai/o3 as the base model with temperature=1.0.
| Agent | Score (pass@1) | Success/Total | chrome | gimp | libreoffice_calc | libreoffice_impress | libreoffice_writer | multi_apps | os | thunderbird | vlc | vs_code |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aworldAgent (ours) | 58.04% | 209.55/361 | 22.96/46 | 19/26 | 33/47 | 28.39/47 | 13/23 | 40.41/93 | 16/24 | 11/15 | 9.79/17 | 16/23 |
| Agentic-Lybic-Maestro | 57.1% | 205.47/360 | 27.96/46 | 22/26 | 24/47 | 27.96/47 | 16/23 | 32.71/92 | 16/24 | 11/15 | 10.84/17 | 17/23 |
| CoACT-1 | 56.4% | 203.55/361 | 20.96/46 | 16/26 | 32/47 | 21.96/47 | 17/23 | 39.40/93 | 17/24 | 10/15 | 11.23/17 | 18/23 |
| agent s2.5 w/ o3 | 54.2% | 200.02/369 | 23.96/46 | 20/26 | 26/47 | 25.99/47 | 11/23 | 39.93/101 | 18/24 | 11/15 | 7.14/17 | 17/23 |
Follow these steps to set up the environment and reproduce our results.
-
Set Up OSWorld Environment:
- First, ensure you have a fully functional OSWorld environment. Please follow the official OSWorld setup guide meticulously.
-
Install AWorld Framework:
- Install the specific version of
aworldused in our experiments.
git clone https://github.com/inclusionAI/AWorld.git cd AWorld git checkout osworld_benchmark python setup.py install - Install the specific version of
-
Deploy Agent Code:
- Copy the
aworldAgentfolder and therun_multienv_aworldAgent.pyscript into the root directory of your OSWorld project.
- Copy the
-
Run the Evaluation Script:
- Our results were achieved using
openai/o3for reasoning andbytedance/ui-tars-1.5-7bfor visual grounding, both accessed via OpenRouter. - Activate your conda environment and run the evaluation script. Remember to replace placeholders like
YOUR_OPENROUTER_API_KEYand/path/to/your/vm/Ubuntu.vmxwith your actual credentials and paths.
# Activate your OSWorld conda environment (e.g., osworld_env) conda activate osworld_env # Run the evaluation with the recommended settings python run_multienv_aworldAgent.py \ --headless \ --ground_url YOUR_BASE_URL \ --ground_api_key YOUR_API_KEY \ --ground_model bytedance/ui-tars-1.5-7b \ --ground_provider open_router \ --model_url YOUR_BASE_URL \ --model_api_key YOUR_API_KEY \ --model_temperature 1.0 \ --provider_name aws \ --max_steps 50 \ --model_provider open_router \ --model openai/o3 \ --grounding_width 1920 \ --grounding_height 1080 \ --test_all_meta_path evaluation_examples/test_all.json \ --result_dir ./results \ --observation_type screenshot \ --num_envs 1 \ --region us-east-1 \ --client_password osworld-public-evaluation
- Our results were achieved using
osworld/
├── aworldAgent/ # Core code for our agent
│ ├── agent.py # Main agent logic for reasoning and action generation
│ ├── grounding.py # Grounding module for visual perception of UI elements
│ ├── prompt.py # Contains all prompts used by the agent
│ ├── utils.py # Shared utility functions
│ └── workflow.py # Defines the core execution loop and workflow
├── run_multienv_aworldAgent.py # Main script to run the evaluation
├── evaluation_examples/ # Task definitions for OSWorld
├── desktop_env/ # Environment code for OSWorld
├── requirements.txt # Dependencies for OSWorld
└── ... # Other OSWorld project files
The run_multienv_aworldAgent.py script is configured via command-line arguments. Key parameters are explained below:
| Argument | Type | Description | Example |
|---|---|---|---|
--model |
str | The primary language model for task planning and action generation. | "openai/o3" |
--model_provider |
str | The provider of the LLM service. | "open_router" |
--model_api_key |
str | API key for the LLM service. | "sk-or-v1-..." |
--ground_model |
str | The specific name of the grounding model. | "bytedance/ui-tars-1.5-7b" |
--ground_provider |
str | The provider for the visual grounding model. | "open_router" |
--path_to_vm |
str | The local file path to your VMware virtual machine .vmx file. |
"/path/to/your/vm/Ubuntu.vmx" |
--provider_name |
str | The virtualization provider. | "vmware" |
--client_password |
str | The VNC password for the client VM. | "YOUR_VM_PASSWORD" |
--max_steps |
int | The maximum number of steps the agent can take per task. | 50 |
--num_envs |
int | The number of parallel VM environments to run for evaluation. | 1 |
The evaluation process generates the following outputs:
- Log Files: Stored in the
logs/directory, containing detailed runtime information for debugging. - Results Directory: Located at the path specified by
--result_dir(defaults to./results), with the following structure:results/[action_space]/[observation_type]/[model]/[domain]/[example_id]/traj.jsonl: A complete log of the agent's thought process and action sequence.result.txt: Contains the final score for the task (0.0 for failure, 1.0 for success).recording.mp4: A screen recording of the agent's execution process.
- State-of-the-Art Performance: Achieved the top rank on the OSWorld-verified benchmark.
- Enhanced Agent Stability: By integrating new executable tools, we have significantly improved the agent's robustness and ability to interact with the OS, building upon the AgentS2 foundation.
- AWorld Framework Integration: Leverages the modularity and scalability of the AWorld framework.
- Reproducible: Provides a clear, single-command script to facilitate result reproduction by the community.
This work would not have been possible without building upon the foundations of several incredible open-source projects.
-
AWorld Framework: We thank the developers of the AWorld Framework for providing a powerful and flexible platform for agent development.
-
AgentS2: We extend our sincere gratitude to the creators of the AgentS2 (Agent-S) project. The core agent logic in our implementation is adapted and enhanced from their codebase. We built upon their work by adding a suite of executable tools to improve the agent's interaction with the OS environment, which significantly boosted the stability and capability of our CUA Agent.
-
OSWorld Benchmark: We are grateful to the creators of the OSWorld Benchmark for developing a challenging and comprehensive testbed for GUI agents.