
Zixuan Wang1, Β Dingming Li1, Β Hongxing Li1, Β Shuo Chen1, Β Yuchen Yan1, Β Wenqi Zhang1,
Yongliang Shen1, Β Weiming Lu1, Β Jun Xiao1, Β Yueting Zhuang1
π Quick Links:
- π Project Homepage - Interactive demos, detailed analysis, and visual results
- π arXiv Paper - Complete research paper with methodology and results
- π Documentation - Installation guides, API reference, and tutorials
- π Dataset - EAR-Bench evaluation dataset with 1,500 scenarios
- π€ Expert Trajectories - SFT expert trajectories for model fine-tuning (1,982 samples)
Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.
Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations.
Figure 1: Overview of the OmniEAR framework comprising three integrated components: OmniSimulator (left) uses structured text representation to model environments with objects, agents, and spatial relationships; EAR-Bench (right) presents our comprehensive evaluation matrix spanning single-agent and multi-agent tasks.
-
Novel Evaluation Framework: We introduce OmniEAR, the first framework to evaluate embodied reasoning through scenarios requiring agents to understand how physical properties determine actions, capabilities, and coordination needs.
-
Comprehensive Benchmark: EAR-Bench provides 1,500 scenarios with continuous physical properties and dynamic capabilities, supported by OmniSimulator and an automated generation pipeline.
-
Fundamental Insights: We provide empirical evidence that current language models lack core embodied reasoning capabilities, with performance degrading over 60% when moving from explicit instructions to embodied reasoning.
Figure 2: OmniEAR automated benchmark generation and evaluation framework showing the four-stage pipeline and comprehensive statistics.
Single-Agent Tasks:
- Direct Command (L1): Basic instruction following
- Attribute Reasoning (L2): Continuous property comparison and inference
- Tool Use (L2): Dynamic capability acquisition through tool manipulation
- Compound Reasoning (L3): Integrated multi-step planning with multiple challenges
Multi-Agent Tasks:
- Explicit Collaboration (L1): Clear coordination directives
- Implicit Collaboration (L2): Autonomous coordination need recognition
- Compound Collaboration (L3): Complex multi-agent scenarios requiring tool use and coordination
- 1,500 diverse scenarios across household and industrial domains
- 64,057 interactive objects with detailed physical properties
- 6,634 spatial nodes (rooms) with an average of 4.4 rooms per scene
- 1,481 unique task files generating 16,592 task instances
- 7 task categories spanning single-agent (65%) and multi-agent (35%) scenarios
- 1,123 distinct material types modeling realistic physical interactions
Metric | Count |
---|---|
Total Scenarios | 1,500 |
Total Task Files | 1,481 |
Total Task Instances | 16,592 |
Interactive Objects | 64,057 |
Spatial Nodes (Rooms) | 6,634 |
Average Objects per Scene | 42.7 |
Average Rooms per Scene | 4.4 |
Collaborative Agent Pairs | 1,481 |
- Object Distribution: Container (27.5%), Tool (23.6%), Appliance (14.0%), Furniture (9.7%), Consumable (7.6%), Others (17.6%)
- Top Materials (out of 1,123 types): Plastic (21.5%), Metal (17.6%), Wood (12.9%), Glass (9.8%), Fabric (7.9%), Ceramic (6.0%)
- Application Domains: Laboratory (39.0%), Office (18.8%), Industrial (11.5%), Medical (6.2%), Household (6.2%), Educational (4.2%), Retail (3.2%), Service (2.0%), Entertainment (1.8%), Transportation (1.5%)
- Room Types: Laboratory (28.3%), Storage (18.6%), Workspace (14.9%), Office (11.5%), Workshop (8.2%)
- EAR-Bench Dataset: The complete evaluation dataset is available in the
data/
directory, including task definitions, scene configurations, and evaluation metrics. - Expert Trajectory SFT Dataset: High-quality expert demonstration trajectories for supervised fine-tuning (1,982 samples) are hosted on π€ HuggingFace. Download using:
cd data/expert_trajectory_sft/ python download_expert_data.py
git clone https://github.com/ZJU-REAL/OmniEmbodied.git
cd OmniEmbodied/OmniSimulator
pip install -e .
cd ..
pip install -r requirements.txt
Before running experiments, configure your LLM API key in config/baseline/llm_config.yaml
:
api:
provider: "deepseekv3" # Choose your provider
providers:
deepseekv3:
api_key: "your-api-key-here" # Replace with your actual API key
# Run basic evaluation
bash scripts/deepseekv3-wo.sh
Note: For scripts ending with -wg.sh
(with global observation), you need to:
- Set the runtime parameter
--global
when running - Configure
global_observation: true
inconfig/simulator/simulator_config.yaml
π OmniEmbodied Documentation
The documentation includes:
- Installation & Quick Start: Setup guides and first steps
- OmniSimulator Guide: Core simulation engine and environment system
- Framework Usage: Evaluation system, agent modes, and data generation
- API Reference: Complete class and function documentation
- Developer Guide: Contributing, examples, and advanced integration
- π Examples: Practical code samples in the
examples/
directory - βοΈ Configuration: Template files in
config/
for different setups - π Analysis: Results visualization with
examples/results_analysis.ipynb
Figure 3: Performance comparison across all evaluated models showing systematic degradation from explicit instructions to constraint-based reasoning.
-
Performance Degradation: All models show substantial performance drops when reasoning must emerge from physical constraints rather than explicit instructions.
-
Scale Effects: Larger models (GPT-4o, Gemini-2.5-Flash) achieve better performance but still struggle with compound reasoning tasks.
-
Reasoning Specialization: Chain-of-thought reasoning models (Deepseek-R1, QwQ-32B) excel at logical planning but fail to ground physical constraints effectively.
-
Fine-tuning Limitations: Supervised fine-tuning dramatically improves single-agent performance (0.6% β 76.3%) but shows minimal multi-agent gains (1.5% β 5.5%).
Figure 4: Detailed performance analysis across task categories and model architectures.
- Text-based Environment Modeling: Efficient simulation through graph representation
- Dynamic Capability System: Tool-dependent action binding and unbinding
- Emergent Collaboration: Physics-constrained multi-agent interactions
- Neural-Symbolic Hybrid: LLM creativity with rule-based validation
- Physical Consistency: Automated verification of scenario feasibility
- Diverse Domains: Scenarios spanning household, industrial, and specialized environments
- Systematic Assessment: Standardized protocols across all models
- Multiple Metrics: Success rate, step efficiency, reasoning quality
- Statistical Reliability: Three independent runs with confidence intervals
If you use OmniEAR in your research, please cite our paper:
@misc{wang2025omniearbenchmarkingagentreasoning,
title={OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks},
author={Zixuan Wang and Dingming Li and Hongxing Li and Shuo Chen and Yuchen Yan and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
year={2025},
eprint={2508.05614},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.05614},
}
OmniEAR builds upon foundational research in embodied AI, multi-agent systems, and language model evaluation. We thank the research community for their contributions to understanding the challenges of embodied intelligence. Special thanks to the anonymous reviewers for their valuable feedback in improving this work. If you have any questions or suggestions, please feel free to email to [email protected]
.