Skip to content

tsinghua-fib-lab/WorldArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Table of Contents

📢 Updates

  • [2026/03/06] Open for submissions.
  • [2026/02/13] Code initial release.
  • [2026/02/13] Leaderboard release.

🔍 Overview

WorldArena is a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through (1) video perception quality, measured with sixteen metrics across six sub-dimensions; (2) embodied task functionality, which evaluates world models as synthetic data engines, policy evaluators, and action planners; (3) human evaluations, including overall quality, physics adherence, instruction following and head-to-head win rate. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. This work provides a framework for tracking progress toward truly functional world models in embodied AI.

📦 Dataset

The project builds on a curated subset of the RoboTwin 2.0 dataset, a simulation framework and benchmark for bimanual robotic manipulation. We use the Clean-50 configuration of RoboTwin 2.0, which includes 50 manipulation tasks (50 episodes per task; we officially use 40 for training and 10 for testing).

🎬 Video Quality Evaluation

Please refer to video quality metrics for implementation.

🤖 Embodied Task Evaluation

Please refer to embodied task for implementation.

🏆 Leaderboard

The official WorldArena leaderboard is hosted on HuggingFace: Leaderboard. It provides standardized evaluation results across video perception quality, embodied task functionality, and the unified EWMScore. We welcome community submissions to benchmark new embodied world models under a fair and reproducible protocol. Join us in advancing truly functional world models for embodied AI.

📤 Submission

Please refer to submission for result submission.

Note: Please use the latest version of the test_dataset (release on 2026.3.6) for the submission!

👥 Human Evaluation

Be part of shaping the future of embodied world models! 👉 Start here: Human Evaluation

We invite you to participate in our human evaluation by providing your judgment about generated videos — it only takes a few minutes. Your feedback helps us uncover hidden failure cases and align automated metrics with real human perception. Every contribution strengthens a more trustworthy and community-driven leaderboard.

🙌 Acknowledgement

We acknowledge RoboTwin 2.0 for providing the dataset and simulation platform support that enables embodied task evaluation.

We thank VPP for providing the IDM framework used in our embodied action planning implementation.

For video quality evaluation, WorldArena references and partially builds upon the code implementations of the following projects: VBench, EWMBench, WorldScore, EvalCrafter, JEDI.

📖 Citation

@article{shang2026worldarena,
  title={WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models},
  author={Shang, Yu and Li, Zhuohang and Ma, Yiding and Su, Weikang and Jin, Xin and Wang, Ziyou and Jin, Lei and Zhang, Xin and Tang, Yinzhou and Su, Haisheng and others},
  journal={arXiv preprint arXiv:2602.08971},
  year={2026}
}

About

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors