Skip to content

Tongyi-MAI/MAI-UI

Repository files navigation

MAI-UI

arXiv Website Website Hugging Face Model ModelScope Model

We present MAI-UI, a family of GUI agent foundation models spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. Our core contribution includes:

  • 🔧 Agent-user interaction and MCP augmentation: enabling agent to interact with user and use MCP tools to complete the task.
  • ☁️ Device–cloud collaboration system: dynamically selecting on-device or cloud execution based on task execution state and data sensitivity.
  • 📈 Dynamic RL Scaling: large-scale reinforcement learning with scaling parallel environments (up to 512) and context length (up to 50).
  • 🏆 State-of-the-Art Performance: MAI-UI establishes new benchmark SOTA results across GUI grounding and navigation tasks.

Overview
Overview of MAI-UI performance

📰 News

  • [2026-01-15] 🥇 New Record on AndroidWorld: MAI-UI-235B takes #1 on the AndroidWorld Leaderboard for pure-vision, end-to-end models with a 76.7% success rate.
  • [2026-01-13] 🥇 MAI-UI Sweeps ScreenSpot-Pro: MAI-UI (32B, 8B, 2B) now ranks #1 in all size categories on the ScreenSpot-Pro leaderboard. We achieved record scores of 67.9%, 65.7%, and 57.4% respectively—notably reaching these benchmarks without any zoom-in tricks.
  • [2026-01-04] 🤝 We're Hiring! We're actively looking for Research Scientists, Engineers, and Interns to work on foundational GUI agents and their applications. Interested candidates please send your resume to: yue.w@alibaba-inc.com
  • [2025-12-29] 🏆 New Leaderboard Record: MAI-UI achieves a 41.7% success rate on the MobileWorld benchmark, setting a new record for end-to-end model performance!
  • [2025-12-29] 📄 Technical Report & Website: Our technical report is now available on arXiv, and the official project website is live.
  • [2025-12-29] 🤗 Model Release: We are excited to release the weights for MAI-UI-8B and MAI-UI-2B on Hugging Face.

📑 Table of Contents

🎥 Demo

Demo 1 - Daily Life Scenario

Trigger ask_user for more information to complete the task.

Daily Life Demo.
User instruction: 去盒马买菜,买一份雪花牛肉卷、一份娃娃菜、一份金针菇,再随便买一个豆制品。对了,去日历中待办里检查下我老婆有什么要在盒马买的,我确认下要不要一起买

Demo 2 - Navigation

Use mcp_call to invoke AMap tools for navigation.

Navigation Demo.
User instruction: 我现在在阿里巴巴云谷园区,我要先去 招商银行取钱,再去城西银泰城。帮我规划公交地铁出行的路线,选一家在4公里以内的、用时最短的招商银行,两段行程总时间不要超过2小时,把规划行程记在笔 记中我一会看,标题为下午行程,内容为两段行程细节

Demo 3 - Shopping

Cross-apps collaboration to complete the task.

Shopping Demo.
User instruction: Search “timeless earth 2026” on Xiaohongshu, save the one product image to your photo album, then use the saved image on Taobao to search for the same item and add it to my shopping cart.

Demo 4 - Work

Cross-apps collaboration to complete the task.

Work Demo.
User instruction: 我需要紧急出差上海,帮我去12306查询现在最早从杭州西站去上海虹桥、有二等座票的班次,在钉钉前沿技术研讨群里把到达时间同步给大家,再把我和水番的会议日程改到明天同一时间,在群里发消息@他,礼貌解释因为临时出差调整会议时间,询问他明天是否有空

Demo 5 - Device-only

Device-cloud collaboration for simple tasks, no need cloud model invocation.

Device-cloud Collaboration Demo.
User Instruction: 去飞猪查询12月25日去,28日回,杭州到三亚的往返机票

Demo 6 - Device-cloud Collaboration

Device-cloud collaboration for complex tasks, requiring cloud model invocation when the task is beyond the device models capabilities.

Device-cloud Collaboration Demo.
User Instruction: 去淘票票给我买一张25号下午的疯狂动物城2的电影票,选亲橙里的电影院,中间的座位,加一份可乐和爆米花的单人餐,停在最后的订单界面

🚀 Installation & Quick Start

Step 1: Clone the Repository

git clone https://github.com/Tongyi-MAI/MAI-UI.git
cd MAI-UI

Step 2: Start Model API Service with vLLM

Download the model from HuggingFace and deploy the API service using vLLM:

HuggingFace model path:

Deploy the model using vLLM:

# Install vLLM
pip install vllm==0.11.0  # vllm==0.11.0 and transformers>=4.57.0

# Start vLLM API server (replace MODEL_PATH with your local model path or HuggingFace model ID)
python -m vllm.entrypoints.openai.api_server \
    --model <huggingface_model_path> \
    --served-model-name MAI-UI-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --trust-remote-code

💡 Tips:

  • IMPORTANT: Must use VLLM=0.11.0
  • Adjust --tensor-parallel-size based on your GPU count for multi-GPU inference
  • The model will be served at http://localhost:8000/v1

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Run cookbook notebooks

We provide two notebooks in the cookbook/ directory:

4.1 Grounding Demo

The grounding.ipynb demonstrates how to use the MAI Grounding Agent to locate UI elements:

cd cookbook
jupyter notebook grounding.ipynb

Before running, update the API endpoint in the notebook:

agent = MAIGroundingAgent(
    llm_base_url="http://localhost:8000/v1",  # Update to your vLLM server address
    model_name="MAI-UI-8B",                   # Use the served model name
    runtime_conf={
        "history_n": 3,
        "temperature": 0.0,
        "top_k": -1,
        "top_p": 1.0,
        "max_tokens": 2048,
    },
)

4.2 Navigation Agent Demo

The run_agent.ipynb demonstrates the full UI navigation agent:

cd cookbook
jupyter notebook run_agent.ipynb

Similarly, update the API endpoint configuration:

agent = MAIUINaivigationAgent(
    llm_base_url="http://localhost:8000/v1",  # Update to your vLLM server address
    model_name="MAI-UI-8B",                   # Use the served model name
    runtime_conf={
        "history_n": 3,
        "temperature": 0.0,
        "top_k": -1,
        "top_p": 1.0,
        "max_tokens": 2048,
    },
)

📝 Citation

If you find this project useful for your research, please consider citing our works:

@misc{zhou2025maiuitechnicalreportrealworld,
      title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents}, 
      author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
      year={2025},
      eprint={2512.22047},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.22047}, 
}
@misc{chen2025uiinsenhancingguigrounding,
      title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning}, 
      author={Liangyu Chen and Hanzhang Zhou and Chenglin Cai and Jianan Zhang and Panrong Tong and Quyu Kong and Xu Zhang and Chen Liu and Yuqi Liu and Wenxuan Wang and Yue Wang and Qin Jin and Steven Hoi},
      year={2025},
      eprint={2510.20286},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.20286}, 
}
@misc{kong2025mobileworldbenchmarkingautonomousmobile,
      title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments}, 
      author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
      year={2025},
      eprint={2512.19432},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.19432}, 
}

📧 Contact

For questions and support, please contact:

📄 License

MAI-UI Mobile is a foundation GUI agent developed by Alibaba Cloud and licensed under the Apache License (Version 2.0).

This product contains various third-party components under other open source licenses. See the NOTICE file for more information.

About

MAI-UI: Real-World Centric Foundation GUI Agents ranging from 2B to 235B

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5