Awesome paper list and repos of the paper "A comprehensive survey of embodied world models".
-
Genie: Generative Interactive Environments.
ICML 2024
[Paper] -
Sora: Creating video from text.
OpenAI 2024
[Website] -
Open-Sora: Democratizing efficient video production for all.
arXiv 2024
[Paper] [Code] -
Genie 2: A large‐scale foundation world model.
DeepMind 2024
[Blog] -
iVideoGPT: Interactive videogpts are scalable world models.
NeurIPS 2024
[Paper] [Code] -
NOVA: Autoregressive video generation without vector quantization.
ICLR 2025
[Paper] [Code] -
Lumos-1: On autoregressive video generation from a unified model perspective.
arXiv 2025
[Paper] -
MAGI-1: Autoregressive Video Generation at Scale.
arXiv 2025
[Paper] -
Video-GPT: Video-GPT via Next Clip Diffusion.
arXiv 2025
[Paper] [Code] -
CogVideoX: Text-to-video diffusion models with an expert transformer.
ICLR 2025
[Paper] [Code] -
Vid2World: Crafting Video Diffusion Models to Interactive World Models.
arXiv 2025
[Paper] -
Wan: Open and Advanced Large-Scale Video Generative Models.
arXiv 2025
[Paper] [Code] -
Cosmos: World foundation model platform for physical AI.
arXiv 2025
[Paper] [Code]
- Spmem: Video World Models with Long-term Spatial Memory.
arXiv 2025
[Paper] [Project Page] - Geodrive: Geodrive: 3d geometry-informed driving world model with precise action control.
arXiv 2025
[Paper] [Code] - Drivedreamer4d: Drivedreamer4d: World models are effective data machines for 4d driving scene representation.
CVPR2025
[Paper] [Code] - Recondreamer: Recondreamer: Crafting world models for driving scene reconstruction via online restoration.
CVPR2025
[Paper] - VGGT: Vggt: Visual geometry grounded transformer.
CVPR2025
[Paper] - DeepVerse: DeepVerse: 4D Autoregressive Video Generation as a World Model.
arXiv 2025
[Paper] - Geometry Forcing: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling.
arXiv 2025
[Paper] - UniFuture: Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception.
arXiv 2025
[Paper] [Code] - Aether: Aether: Geometric-aware unified world modeling.
ICCV 2025
[Paper][Code] - Geo4D: Geo4d: Leveraging video generators for geometric 4d scene reconstruction.
ICCV 2025 Highlight
[Paper][Code] - PosePilot: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth.
IEEE/RSJ IROS 2025
[Paper] - UniScene: Uniscene: Unified occupancy-centric driving scene generation.
CVPR2025
[Paper][Code] - WonderFree: WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration.
arXiv 2025
[Paper] - GaussianWorld: Gaussianworld: Gaussian world model for streaming 3d occupancy prediction.
CVPR2025
[Paper] - DriveWorld: Driveworld: 4d pre-trained scene understanding via world models for autonomous driving.
CVPR2024
[Paper] - Dist-4D: Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.
ICCV 2025
[Paper][Code] - TesserAct: TesserAct: learning 4D embodied world models.
ICCV 2025
[Paper][Code] - FlowDreamer: FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation.
arXiv 2025
[Paper] [Project Page] - Geometry-aware 4D Video Generation for Robot Manipulation: Geometry-aware 4D Video Generation for Robot Manipulation.
arXiv 2025
[Paper] [Code] - ORV: ORV: 4D Occupancy-centric Robot Video Generation.
arXiv 2025
[Paper] [Code] - 3D Persistent Embodied World Models: Learning 3D Persistent Embodied World Models.
arXiv 2025
[Paper] - HunyuanWorld 1.0: HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels.
arXiv 2025
[Paper] [Code] [Project Page]
-
PlaNet: Learning Latent Dynamics for Planning from Pixels.
ICML 2019
[Paper] [Code] [Blog] -
Dreamer: Dream to Control: Learning Behaviors by Latent Imagination.
ICLR 2020
[Paper] [Code] -
DreamerV2: Mastering Atari with Discrete World Models.
ICLR 2021
[Paper] [Code] -
DreamerV3: Dream to Control: Learning Behaviors by Latent Imagination.
Nature 2025
[Paper] [Code] -
I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.
ICCV 2023
[Paper] [Code] -
V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video.
TMLR 2024
[Paper] [Code] -
V-JEPA 2, V-JEPA 2-AC: V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.
Meta 2024
[Paper] [Code] [Website] [Blog] -
TD-MPC: Temporal Difference Learning for Model Predictive Control.
ICML 2022
[Paper] [Code] [Website] -
TD-MPC-offline: Finetuning Offline World Models in the Real World.
CoRL 2023 Oral
[Paper] [Code] [Website] -
TD-MPC2: TD-MPC2: Scalable, Robust World Models for Continuous Control.
ICLR 2024 Spotlight
[Paper] [Code] [Website]
- Sora: Sora: A review on background, technology, limitations, and opportunities of large vision models.
arXiv 2024
[Paper] [Code] - RoboDreamer: Robodreamer: Learning compositional world models for robot imagination.
ICML 2024
[Paper] [Code] [Website] - Pandora: Pandora: Towards general world model with natural language actions and video states.
arXiv 2024
[Paper] [Code] [Website] - Cosmos: Cosmos world foundation model platform for physical ai.
arXiv 2025
[Paper] [Code] [Website]
- Vid2World: Vid2World: Crafting Video Diffusion Models to Interactive World Models.
arXiv 2025
[Paper] [Website] - UWM: Cosmos world foundation model platform for physical ai.
ICML 2025
[Paper] [Code] [Website] - Enverse-AC: Enerverse-ac: Envisioning embodied environments with action condition.
ICML 2025
[Paper] [Code] [Website] - FLARE: FLARE: Robot learning with implicit world modeling.
arXiv 2025
[Paper] [Website]
- RoboScape: RoboScape: Physics-informed Embodied World Model.
arXiv 2025
[Paper] [Code] - TesserAct: TesserAct: learning 4D embodied world models.
arXiv 2025
[Paper] [Code] [Website]
- HMA: Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression.
arXiv 2025
[Paper] [Code] [Website] - UVA: Unified video action model.
RSS 2025
[Paper] [Code] [Website] - WorldVLA: WorldVLA: Towards Autoregressive Action World Model.
DAMO 2025
[Paper] [Code]
- RLVR-World: RLVR-World: Training World Models with Reinforcement Learning
arXiv 2025
[Paper] [Code] [Website]
- DreamGen: DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories
arXiv 2025
[Paper] [Code] [Website] - RoboTransfer: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
arXiv 2025
[Paper] [Code] [Website] - EnerVerse-AC: EnerVerse-AC: Envisioning Embodied Environments with Action Condition
arXiv 2025
[Paper] [Code] [Website]
- GenRL: GenRL: Multimodal-foundation world models for generalization in embodied agents
NeurIPS 2024
[Paper] [Code] [Website] - iVideoGPT: iVideoGPT: Interactive VideoGPTs are Scalable World Models
NeurIPS 2024
[Paper] [Code] [Website] - DreamerV3: Dream to Control: Learning Behaviors by Latent Imagination.
Nature 2025
[Paper] [Code]
- WorldEval: WorldEval: World Model as Real-World Robot Policies Evaluator.
arXiv 2025
[Paper] [Code] [Website] - EnerVerse-AC: EnerVerse-AC: Envisioning Embodied Environments with Action Condition
arXiv 2025
[Paper] [Code] [Website] - RoboScape: EnerVerse-AC: Envisioning Embodied Environments with Action Condition
arXiv 2025
[Paper] [Code]
- GPC: Strengthening Generative Robot Policies through Predictive World Modeling
arXiv 2025
[Paper] [Website] - VPP: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
ICML 2025 Spotlight
[Paper] [Code] [Website] - V-JEPA 2-AC: V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.
Meta 2024
[Paper] [Code] [Website] [Blog]
- VBench : Comprehensive Benchmark Suite for Video Generative Models
CVPR 2024 Highlight
[Paper] [Code] [Website] - T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
CVPR 2025
[Paper] [Code] [Website] - VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
arXiv 2025
[Paper] [Code] - VideoPhy: Evaluating Physical Commonsense for Video Generation
ICLR 2025 Poster
[Paper] [Code] - VideoPhy 2: Challenging Action-Centric Physical Commonsense Evaluation of Video Generation
arXiv 2025
[Paper] [Code] [Website] - PhyGenBench: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
ICML 2025
[Paper] [Code] [Website] - WorldModelBench: Judging Video Generation Models As World Models
arXiv 2025
[Paper] [Code] [Website] - EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
arXiv 2025
[Paper] [Code]
- DreamerV3: Dream to Control: Learning Behaviors by Latent Imagination.
Nature 2025
[Paper] [Code] - V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.
Meta 2024
[Paper] [Code] [Website] [Blog] - WorldSimBench: Towards Video Generation Models as World Simulators
ICML 2025
[Paper] [Website]
- EWM: Evaluating Robot Policies in a World Model,
arXiv 2025
[Paper] - WorldEval: World Model as Real-World Robot Policies Evaluator.
arXiv 2025
[Paper] [Code] [Website] - RoboScape: Physics-informed Embodied World Model.
arXiv 2025
[Paper] [Code]
- DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories
arXiv 2025
[Paper] [Code] [Website] - RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
arXiv 2025
[Paper] [Code] [Website] - GenSim: Generating Robotic Simulation Tasks via Large Language Models
ICLR 2024 Spotlight
[Paper] [Code] - WorldGPT: Empowering LLM as Multimodal World Model
MM 2024
[Paper] [Code] - Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models
ICLR 2025 Poster
[Paper] [Code] - RoboScape: Physics-informed Embodied World Model
arXiv 2025
[Paper] [Code]