WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Discord Group

WeChat Group

📢 Updates

[2026/03/06] Open for submissions.
[2026/02/13] Code initial release.
[2026/02/13] Leaderboard release.

🔍 Overview

WorldArena is a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through (1) video perception quality, measured with sixteen metrics across six sub-dimensions; (2) embodied task functionality, which evaluates world models as synthetic data engines, policy evaluators, and action planners; (3) human evaluations, including overall quality, physics adherence, instruction following and head-to-head win rate. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. This work provides a framework for tracking progress toward truly functional world models in embodied AI.

📦 Dataset

The project builds on a curated subset of the RoboTwin 2.0 dataset, a simulation framework and benchmark for bimanual robotic manipulation. We use the Clean-50 configuration of RoboTwin 2.0, which includes 50 manipulation tasks (50 episodes per task; we officially use 40 for training and 10 for testing).

🎬 Video Quality Evaluation

Please refer to video quality metrics for implementation.

🤖 Embodied Task Evaluation

Please refer to embodied task for implementation.

🏆 Leaderboard

The official WorldArena leaderboard is hosted on HuggingFace: . It provides standardized evaluation results across video perception quality, embodied task functionality, and the unified EWMScore. We welcome community submissions to benchmark new embodied world models under a fair and reproducible protocol. Join us in advancing truly functional world models for embodied AI.

📤 Submission

Please refer to submission for result submission.

Note: Please use the latest version of the test_dataset (release on 2026.3.6) for the submission!

👥 Human Evaluation

Be part of shaping the future of embodied world models! 👉 Start here: Human Evaluation

We invite you to participate in our human evaluation by providing your judgment about generated videos — it only takes a few minutes. Your feedback helps us uncover hidden failure cases and align automated metrics with real human perception. Every contribution strengthens a more trustworthy and community-driven leaderboard.

🙌 Acknowledgement

We acknowledge RoboTwin 2.0 for providing the dataset and simulation platform support that enables embodied task evaluation.

We thank VPP for providing the IDM framework used in our embodied action planning implementation.

For video quality evaluation, WorldArena references and partially builds upon the code implementations of the following projects: VBench, EWMBench, WorldScore, EvalCrafter, JEDI.

📖 Citation

@article{shang2026worldarena,
  title={WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models},
  author={Shang, Yu and Li, Zhuohang and Ma, Yiding and Su, Weikang and Jin, Xin and Wang, Ziyou and Jin, Lei and Zhang, Xin and Tang, Yinzhou and Su, Haisheng and others},
  journal={arXiv preprint arXiv:2602.08971},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets		assets
embodied_task		embodied_task
video_quality		video_quality
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Table of Contents

📢 Updates

🔍 Overview

📦 Dataset

🎬 Video Quality Evaluation

🤖 Embodied Task Evaluation

🏆 Leaderboard

📤 Submission

👥 Human Evaluation

🙌 Acknowledgement

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Table of Contents

📢 Updates

🔍 Overview

📦 Dataset

🎬 Video Quality Evaluation

🤖 Embodied Task Evaluation

🏆 Leaderboard

📤 Submission

👥 Human Evaluation

🙌 Acknowledgement

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages