Skip to content

JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 

Repository files navigation

Awesome VLM-based VLA for Robotic Manipulation

arXiv Lab Contribution Welcome GitHub star chart License

🛠️ We're still cooking — Stay tuned!🛠️
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨

image info

🔥 Large VLM-based Vision-Language-Action (VLA) models have recently emerged as a transformative paradigm for robotic manipulation by tightly coupling perception, language understanding, and action generation. Built upon large Vision-Language Models (VLMs), they enable robots to interpret natural language instructions, perceive complex environments, and perform diverse manipulation tasks with strong generalization.

📍 We present the first systematic survey on large VLM-based VLA models for robotic manipulation. This repository serves as the companion resource to our survey: "Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey", and includes all the research papers, benchmarks, and resources reviewed in the paper, organized for easy access and reference.

📌 We will keep updating this repository with newly published works to reflect the latest progress in the field.

Table of Contents

Monolithic Models

Single-System

Year Venue Paper Website Code
2023 CoRL RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control 🌐 -
2023 ICRA RT-2-X: Open X-Embodiment: Robotic Learning Datasets and RT-X Models 🌐 -
2023 NeurIPS RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators 🌐 💻
2023 ICML LEO Agent: An Embodied Generalist Agent in 3D World 🌐 💻
2024 NeurIPS RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation 🌐 💻
2024 CoRL OpenVLA: An Open-Source Vision-Language-Action Model 🌐 💻
2024 CoRL ECOT-Lite: Robotic Control via Embodied Chain-of-Thought Reasoning 🌐 💻
2024 ICRA ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models 🌐 -
2024 NeurIPS DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution - 💻
2024 ICLR TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies 🌐 💻
2025 ICRA FuSe: Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding 🌐 💻
2025 CVPR UniAct: Universal Actions for Enhanced Embodied Foundation Models 🌐 💻
2025 arXiv SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model 🌐 💻
2025 ICML UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent - 💻
2025 ICLR VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation - -
2025 arXiv OpenVLA-OFT: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success 🌐 💻
2025 arXiv PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding - -
2025 arXiv HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model 🌐 💻
2025 arXiv MoLe-VLA: Dynamic Layer-Skipping Vision-Language-Action Model via Mixture-of-Layers for Efficient Robot Manipulation 🌐 💻
2025 CVPR CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models 🌐 -
2025 arXiv NORA: A Small Open-Sourced Generalist Vision-Language-Action Model for Embodied Tasks 🌐 💻
2025 arXiv VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation 🌐 -
2025 arXiv OE-VLA: Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions - -
2025 arXiv ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning - -
2025 arXiv FLashVLA: Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models - -
2025 arXiv LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks - -
2025 arXiv BitVLA: 1-Bit Vision-Language-Action Models for Robotics Manipulation - 💻
2025 arXiv BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models 🌐 💻
2025 arXiv UniVLA: Unified Vision-Language-Action Model 🌐 💻
2025 arXiv WorldVLA: Towards Autoregressive Action World Model - 💻
2025 arXiv 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration - 💻
2025 ICCV VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers 🌐 💻
2025 arXiv VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting - -
2025 arXiv SpecVLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance - -
2025 arXiv ST-VLA: Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding 🌐 -
2025 arXiv Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies - -
2025 arXiv LLaDA-VLA: Vision Language Diffusion Action Models 🌐 -
2025 arXiv OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision - -
2025 arXiv TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models 🌐 -
2025 arXiv STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models 🌐 -

Dual-System

Year Venue Paper Website Code
2024 arXiv A dual process vla: efficient robotic manipulation leveraging vlm - -
2024 arXiv Towards synergistic, generalized, and efficient dual-system for robotic manipulation 🌐 -
2024 IROS From llms to actions: latent codes as bridges in hierarchical robot control - -
2025 arXiv Gr00t n1: an open foundation model for generalist humanoid robots - -
2024 arXiv Cogact: a foundational vision-language-action model for synergizing cognition and action in robotic manipulation 🌐 💻
2024 CoRL Hirt: enhancing robotic control with hierarchical robot transformers - -
2025 arXiv GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data 🌐 💻
2025 arXiv Fast-in-slow: a dual-system foundation model unifying fast manipulation within slow reasoning 🌐 💻
2025 arXiv Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation 🌐 💻
2025 arXiv Chatvla: unified multimodal understanding and robot control with vision-language-action model 🌐 💻
2025 arXiv Chatvla-2: vision-language-action model with open-world embodied reasoning from pretrained knowledge 🌐 -
2025 ICML Diffusionvla: scaling robot foundation models via unified diffusion and autoregression 🌐 💻
2025 arXiv Trivla: a unified triple-system-based unified vision-language-action model for general robot control 🌐 -
2025 arXiv Information-theoretic graph fusion with vision-language-action model for policy reasoning and dual robotic control - -
2025 arXiv Rationalvla: a rational vision-language-action model with dual system 🌐 -
2025 ICCV Vq-vla: improving vision-language-action models via scaling vector-quantized action tokenizers 🌐 💻
2025 RA-L Tinyvla: towards fast, data-efficient vision-language-action models for robotic manipulation 🌐 💻
2025 RSS π0: A vision-language-action flow model for general robot control 🌐 -
2025 RSS Fast: efficient action tokenization for vision-language-action models 🌐 -
2025 arXiv π0.5: a vision language-action model with open-world generalization 🌐 -
2025 arXiv Knowledge insulating vision-language-action models: train fast, run fast, generalize better 🌐 -
2025 arXiv Forcevla: enhancing vla models with a force-aware moe for contact-rich manipulation 🌐 -
2025 arXiv Smolvla: a vision-language-action model for affordable and efficient robotics - 💻
2025 arXiv Onetwovla: a unified vision-language-action model with adaptive reasoning 🌐 💻
2025 arXiv Tactile-vla: unlocking vision-language-action model’s physical knowledge for tactile generalization - -
2025 arXiv Gr-3 technical report 🌐 -
2025 arXiv Villa-x: enhancing latent action modeling in vision-language-action models 🌐 💻
2025 arXiv ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning 🌐 -
2025 arXiv F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions 🌐 💻
2025 arXiv iFlyBot-VLA 🌐 -
2025 arXiv iFlyBot-VLA Technical Report 🌐 -
2025 arXiv Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment 🌐 💻
2025 arXiv NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards 🌐 💻
2025 arXiv π*0.6: a VLA that Learns from Experience 🌐 -
2025 arXiv ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation 🌐 -
2025 arXiv METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model 🌐 -

Hierarchical Models

Planner Only

Year Venue Paper Website Code
2023 ICML PaLM-E: An embodied multimodal language model 🌐 -
2023 arXiv ViLa: Look before you leap: Unveiling the power of GPT-4V in robotic vision-language planning 🌐 -
2024 CVPR MoManipVLA: Transferring vision-language-action models for general mobile manipulation 🌐 -
2024 CoRL RoboPoint: A vision-language model for spatial affordance prediction in robotics 🌐 💻
2025 arXiv ManipLVM-R1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models - -
2025 arXiv Embodied-Reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks 🌐 💻
2025 arXiv Reinforced Planning: Solving long-horizon tasks via imitation and reinforcement learning - 💻
2025 ICRA Chain-of-Modality: Learning manipulation programs from multimodal human videos with vision-language-models 🌐 -
2025 arXiv Embodied-R: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning 🌐 💻
2025 CVPR RoVI: Robotic Visual Instruction 🌐 -
2025 arXiv ReLEP: Long-horizon embodied planning with implicit logical inference and hallucination mitigation - -
2025 CVPR RoboBrain: A unified brain model for robotic manipulation from abstract to concrete 🌐 💻

Planner + Policy

Year Venue Paper Website Code
2023 arXiv Instruct2Act: Mapping multi-modality instructions to robotic actions with large language model - 💻
2023 CoRL VoxPoser: Composable 3D value maps for robotic manipulation with language models 🌐 💻
2024 CVPR SkillDiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution 🌐 💻
2024 arXiv RoboMatrix: A skill-centric hierarchical framework for scalable robot task planning and execution in open-world - 💻
2024 CoRL RT-Affordance: Reasoning about robotic manipulation with affordances 🌐 -
2024 CoRL LLARVA: Vision-action instruction tuning enhances robot learning 🌐 💻
2024 CVPR MALMM: Multi-agent large language models for zero-shot robotics manipulation 🌐 💻
2024 arXiv RT-H: Action Hierarchies Using Language 🌐 -
2024 CoRL ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation 🌐 💻
2025 ICLR HAMSTER: Hierarchical action models for open-world robot manipulation 🌐 💻
2025 ICML HiRobot: Open-ended instruction following with hierarchical vision-language-action models 🌐 -
2025 arXiv Agentic Robot: A brain-inspired framework for vision-language-action models in embodied agents 🌐 💻
2025 arXiv DexVLA: Vision-language model with plug-in diffusion expert for general robot control 🌐 💻
2025 arXiv PointVLA: Injecting the 3D world into vision-language-action models 🌐 -
2025 arXiv A0: An affordance-aware hierarchical model for general robotic manipulation 🌐 💻
2025 arXiv From seeing to doing: Bridging reasoning and decision for robotic manipulation 🌐 💻
2025 ICCV RoBridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation 🌐 💻
2025 arXiv RoboCerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation 🌐 -
2025 arXiv π0.5: A vision-language-action model with open-world generalization 🌐 -
2025 arXiv DexGraspVLA: A vision-language-action framework towards general dexterous grasping 🌐 💻
2025 arXiv HiBerNAC: Hierarchical brain-emulated robotic neural agent collective for disentangling complex manipulation - -
2025 arXiv Robix: A Unified Model for Robot Interaction, Reasoning and Planning 🌐 -

Other Advanced Field

Reinforcement Learning-based Methods

Year Venue Paper Website Code
2025 ICLR(Workshop) GRAPE: Generalizing Robot Policy via Preference Alignment 🌐 💻
2025 arXiv Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning - 💻
2025 RSS ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations 🌐 -
2025 RSS ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy 🌐 💻
2025 RSS RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning 🌐 -
2025 arXiv TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization - -
2025 ICRA Improving Vision-Language-Action Model with Online Reinforcement Learning - -
2025 CVPR(Workshop) Interactive Post‑Training for Vision‑Language‑Action Models 🌐 💻
2025 arXiv Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation 🌐 💻
2025 arXiv VLAC: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning 🌐 💻
2025 arXiv SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning - 💻

Training-Free Methods

Year Venue Paper Website Code
2025 arXiv VLA‑Cache: Towards Efficient Vision‑Language‑Action Model via Adaptive Token Caching in Robotic Manipulation - 💻
2025 arXiv Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding - -
2025 arXiv Think Twice, Act Once: Token‑Aware Compression and Action Reuse for Efficient Inference in Vision‑Language‑Action Models - -
2025 arXiv EfficientVLA: Training‑Free Acceleration and Compression for Vision‑Language‑Action Models - -
2025 arXiv SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration - -
2025 arXiv Block-wise Adaptive Caching for Accelerating Diffusion Policy - -
2025 RSS FAST: Efficient Action Tokenization for Vision-Language-Action Models 🌐 -
2025 arXiv Real-Time Execution of Action Chunking Flow Policies 🌐 -
2025 arXiv SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning - -

Learning from Human Videos

Year Venue Paper Website Code
2024 ICML 3D‑VLA: A 3D Vision‑Language‑Action Generative World Model 🌐 💻
2024 NeurIPS Learning an Actionable Discrete Diffusion Policy via Large‑Scale Actionless Video Pre‑Training 🌐 💻
2025 CVPR Mitigating the Human‑Robot Domain Discrepancy in Visual Pre‑training for Robotic Manipulation 🌐 💻
2025 RSS UniVLA: Learning to Act Anywhere with Task‑centric Latent Actions - 💻
2025 ICLR Latent Action Pretraining from Videos 🌐 💻
2025 arXiv Humanoid‑VLA: Towards Universal Humanoid Control with Visual Integration - -

World Model-based VLA

Year Venue Paper Website Code
2024 CVPR FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects 🌐 💻
2024 ICML 3D‑VLA: A 3D Vision‑Language‑Action Generative World Model 🌐 💻
2025 arXiv WorldVLA: Towards Autoregressive Action World Model - 💻
2025 arXiv World4Omni: A Zero‑Shot Framework from Image Generation World Model to Robotic Manipulation 🌐 -
2025 arXiv Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations 🌐 💻
2025 arXiv V‑JEPA 2: Self‑Supervised Video Models Enable Understanding, Prediction and Planning 🌐 💻
2025 arXiv FlowVLA: Thinking in Motion with a Visual Chain of Thought 🌐 -
2025 arXiv Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation 🌐 💻
2025 arXiv WoW: Towards a World-omniscient World-model Through Embodied Interaction 🌐 💻
2025 arXiv RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL - -

Datasets and Benchmarks

Real-world Robot Datasets

Year Venue Paper Website Code Data
2021 CoRL BC-Z: Zero-shot task generalization with robotic imitation learning 🌐 💻 📦
2023 RSS RT‑1: Robotics Transformer for Real‑World Control at Scale 🌐 💻 -
2023 CoRL RT‑2: Vision‑Language Foundation Models as Effective Robot Imitators 🌐 💻 -
2022 RSS Bridge Data: Boosting Generalization of Robotic Skills with Cross‑Domain Datasets 🌐 💻 📦
2023 CoRL BridgeData V2: A Dataset for Robot Learning at Scale 🌐 💻 📦
2024 ICRA RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One‑Shot 🌐 💻 📦
2024 RSS DROID: A Large-Scale In‑The‑Wild Robot Manipulation Dataset 🌐 💻 📦
2024 ICRA Open X‑Embodiment: Robotic Learning Datasets and RT‑X Models 🌐 💻 📦
2025 RSS RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation 🌐 💻 📦
2025 IROS AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems 🌐 💻 📦
2025 arXiv BRMData: Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks 🌐 💻 📦

Simulation Environments and Benchmarks

Year Venue Paper Website Code Data
2022 CoRL BEHAVIOR‑1K: A Human‑Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation 🌐 💻 📦
2020 CVPR ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks 🌐 💻 📦
2020 RA-L RLBench: The Robot Learning Benchmark & Learning Environment 🌐 💻 📦
2024 arXiv PerAct²: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks 🌐 💻 📦
2020 CoRL Meta‑World: A Benchmark and Evaluation for Multi‑Task and Meta Reinforcement Learning 🌐 💻 📦
2019 CoRL Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning 🌐 💻 📦
2023 NeurIPS LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning 🌐 💻 📦
2022 RA-L CALVIN: A Benchmark for Language‑Conditioned Policy Learning for Long‑Horizon Robot Manipulation Tasks 🌐 💻 📦
2024 arXiv MIKASA: Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning 🌐 💻 📦
2024 CoRL SIMPLER: Evaluating Real‑World Robot Manipulation Policies in Simulation 🌐 💻 📦
2019 ICCV Habitat: A Platform for Embodied AI Research 🌐 💻 📦
2021 NeurIPS Habitat 2.0: Training Home Assistants to Rearrange their Habitat 🌐 💻 📦
2024 ICLR Habitat 3.0: A Co‑Habitat for Humans, Avatars and Robots 🌐 💻 📦
2020 CVPR SAPIEN: A Simulated Part-based Interactive Environment 🌐 💻 📦
2024 RSS The Colosseum: A Benchmark for Evaluating Generalization for Robotic Manipulation 🌐 💻 📦
2025 ICCV VLABench: A Large‑Scale Benchmark for Language‑Conditioned Robotics Manipulation with Long‑Horizon Reasoning Tasks 🌐 💻 📦

Human Behavior Datasets

Year Venue Paper Website Code Data
2022 CVPR Ego4D: Around the World in 3,000 Hours of Egocentric Video 🌐 💻 📦
2024 CVPR Ego‑Exo4D: Understanding Skilled Human Activity from First‑ and Third‑Person Perspectives 🌐 💻 📦
2024 arXiv EgoPlan‑Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models 🌐 💻 📦
2024 arXiv EgoVid‑5M: A Large‑Scale Video‑Action Dataset for Egocentric Video Generation 🌐 💻 📦
2018 ECCV Scaling Egocentric Vision: The EPIC‑KITCHENS Dataset 🌐 💻 📦
2024 ECCV COM Kitchens: An Unedited Overhead‑View Video Dataset as a Vision‑Language Benchmark 🌐 💻 📦
2019 ICCV EgoVQA: An Egocentric Video Question Answering Benchmark Dataset 🌐 💻 📦
2022 NeurIPS EgoTaskQA: Understanding Human Tasks in Egocentric Videos 🌐 💻 📦
2025 arXiv EgoDex: Learning Dexterous Manipulation from Large‑Scale Egocentric Video - - -
2024 RSS DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation 🌐 💻 📦
2024 arXiv EgoMimic: Scaling Imitation Learning via Egocentric Video 🌐 💻 📦
2025 CoRL Humanoid Policy ~ Human Policy 🌐 💻 📦
2025 arXiv Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware 🌐 💻 -

Embodied Datasets and Benchmarks

Year Venue Paper Website Code Data
2018 CVPR EQA: Embodied Question Answering 🌐 💻 📦
2018 CVPR IQA: Visual Question Answering in Interactive Environments - 💻 -
2019 CVPR MT‑EQA: Multi‑Target Embodied Question Answering 🌐 💻 📦
2019 CVPR Embodied Question Answering in Photorealistic Environments with Point Cloud Perception 🌐 💻 📦
2023 ICLR EQA‑MX: Embodied Question Answering using Multimodal Expression - - -
2024 CVPR OpenEQA: Embodied Question Answering in the Era of Foundation Models 🌐 💻 📦
2024 ICLR LoTa‑Bench: Benchmarking Language‑oriented Task Planners for Embodied Agents 🌐 💻 📦

Star History Chart

Citation

If you find this survey helpful for your research or applications, please consider citing it using the following BibTeX entry:

@misc{shao2025largevlmbasedvisionlanguageactionmodels,
      title={Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey}, 
      author={Rui Shao and Wei Li and Lingsen Zhang and Renshan Zhang and Zhiyang Liu and Ran Chen and Liqiang Nie},
      year={2025},
      eprint={2508.13073},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.13073}, 
}

Contact Us

For any questions or suggestions, please feel free to contact us at:

Email: shaorui@hit.edu.cn and liwei2024@stu.hit.edu.cn

About

A curated list of large VLM-based VLA models for robotic manipulation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors