WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models
- Updates
- Overview
- Dataset
- Video Quality Evaluation
- Embodied Task Evaluation
- Leaderboard
- Submission
- Human Evaluation
- Citation
- [2026/03/06] Open for submissions.
- [2026/02/13] Code initial release.
- [2026/02/13] Leaderboard release.
WorldArena is a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through (1) video perception quality, measured with sixteen metrics across six sub-dimensions; (2) embodied task functionality, which evaluates world models as synthetic data engines, policy evaluators, and action planners; (3) human evaluations, including overall quality, physics adherence, instruction following and head-to-head win rate. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. This work provides a framework for tracking progress toward truly functional world models in embodied AI.
The project builds on a curated subset of the RoboTwin 2.0 dataset, a simulation framework and benchmark for bimanual robotic manipulation. We use the Clean-50 configuration of RoboTwin 2.0, which includes 50 manipulation tasks (50 episodes per task; we officially use 40 for training and 10 for testing).
Please refer to video quality metrics for implementation.
Please refer to embodied task for implementation.
The official WorldArena leaderboard is hosted on HuggingFace: . It provides standardized evaluation results across video perception quality, embodied task functionality, and the unified EWMScore. We welcome community submissions to benchmark new embodied world models under a fair and reproducible protocol. Join us in advancing truly functional world models for embodied AI.
Please refer to submission for result submission.
Note: Please use the latest version of the test_dataset (release on 2026.3.6) for the submission!
Be part of shaping the future of embodied world models! 👉 Start here: Human Evaluation
We invite you to participate in our human evaluation by providing your judgment about generated videos — it only takes a few minutes. Your feedback helps us uncover hidden failure cases and align automated metrics with real human perception. Every contribution strengthens a more trustworthy and community-driven leaderboard.
We acknowledge RoboTwin 2.0 for providing the dataset and simulation platform support that enables embodied task evaluation.
We thank VPP for providing the IDM framework used in our embodied action planning implementation.
For video quality evaluation, WorldArena references and partially builds upon the code implementations of the following projects: VBench, EWMBench, WorldScore, EvalCrafter, JEDI.
@article{shang2026worldarena,
title={WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models},
author={Shang, Yu and Li, Zhuohang and Ma, Yiding and Su, Weikang and Jin, Xin and Wang, Ziyou and Jin, Lei and Zhang, Xin and Tang, Yinzhou and Su, Haisheng and others},
journal={arXiv preprint arXiv:2602.08971},
year={2026}
}
