Awesome VLM-based VLA for Robotic Manipulation

🛠️ We're still cooking — Stay tuned!🛠️
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨

🔥 Large VLM-based Vision-Language-Action (VLA) models have recently emerged as a transformative paradigm for robotic manipulation by tightly coupling perception, language understanding, and action generation. Built upon large Vision-Language Models (VLMs), they enable robots to interpret natural language instructions, perceive complex environments, and perform diverse manipulation tasks with strong generalization.

📍 We present the first systematic survey on large VLM-based VLA models for robotic manipulation. This repository serves as the companion resource to our survey: "Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey", and includes all the research papers, benchmarks, and resources reviewed in the paper, organized for easy access and reference.

📌 We will keep updating this repository with newly published works to reflect the latest progress in the field.

Monolithic Models

Single-System

Year	Venue	Paper	Website	Code
2023	CoRL	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	🌐	-
2023	ICRA	RT-2-X: Open X-Embodiment: Robotic Learning Datasets and RT-X Models	🌐	-
2023	NeurIPS	RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators	🌐	💻
2023	ICML	LEO Agent: An Embodied Generalist Agent in 3D World	🌐	💻
2024	NeurIPS	RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation	🌐	💻
2024	CoRL	OpenVLA: An Open-Source Vision-Language-Action Model	🌐	💻
2024	CoRL	ECOT-Lite: Robotic Control via Embodied Chain-of-Thought Reasoning	🌐	💻
2024	ICRA	ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models	🌐	-
2024	NeurIPS	DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution	-	💻
2024	ICLR	TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies	🌐	💻
2025	ICRA	FuSe: Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding	🌐	💻
2025	CVPR	UniAct: Universal Actions for Enhanced Embodied Foundation Models	🌐	💻
2025	arXiv	SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model	🌐	💻
2025	ICML	UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent	-	💻
2025	ICLR	VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation	-	-
2025	arXiv	OpenVLA-OFT: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success	🌐	💻
2025	arXiv	PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding	-	-
2025	arXiv	HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model	🌐	💻
2025	arXiv	MoLe-VLA: Dynamic Layer-Skipping Vision-Language-Action Model via Mixture-of-Layers for Efficient Robot Manipulation	🌐	💻
2025	CVPR	CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models	🌐	-
2025	arXiv	NORA: A Small Open-Sourced Generalist Vision-Language-Action Model for Embodied Tasks	🌐	💻
2025	arXiv	VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation	🌐	-
2025	arXiv	OE-VLA: Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions	-	-
2025	arXiv	ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning	-	-
2025	arXiv	FLashVLA: Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models	-	-
2025	arXiv	LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks	-	-
2025	arXiv	BitVLA: 1-Bit Vision-Language-Action Models for Robotics Manipulation	-	💻
2025	arXiv	BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models	🌐	💻
2025	arXiv	UniVLA: Unified Vision-Language-Action Model	🌐	💻
2025	arXiv	WorldVLA: Towards Autoregressive Action World Model	-	💻
2025	arXiv	4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration	-	💻
2025	ICCV	VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers	🌐	💻
2025	arXiv	VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting	-	-
2025	arXiv	SpecVLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance	-	-
2025	arXiv	ST-VLA: Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding	🌐	-
2025	arXiv	Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies	-	-
2025	arXiv	LLaDA-VLA: Vision Language Diffusion Action Models	🌐	-
2025	arXiv	OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision	-	-
2025	arXiv	TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models	🌐	-
2025	arXiv	STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models	🌐	-

Dual-System

Year	Venue	Paper	Website	Code
2024	arXiv	A dual process vla: efficient robotic manipulation leveraging vlm	-	-
2024	arXiv	Towards synergistic, generalized, and efficient dual-system for robotic manipulation	🌐	-
2024	IROS	From llms to actions: latent codes as bridges in hierarchical robot control	-	-
2025	arXiv	Gr00t n1: an open foundation model for generalist humanoid robots	-	-
2024	arXiv	Cogact: a foundational vision-language-action model for synergizing cognition and action in robotic manipulation	🌐	💻
2024	CoRL	Hirt: enhancing robotic control with hierarchical robot transformers	-	-
2025	arXiv	GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data	🌐	💻
2025	arXiv	Fast-in-slow: a dual-system foundation model unifying fast manipulation within slow reasoning	🌐	💻
2025	arXiv	Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation	🌐	💻
2025	arXiv	Chatvla: unified multimodal understanding and robot control with vision-language-action model	🌐	💻
2025	arXiv	Chatvla-2: vision-language-action model with open-world embodied reasoning from pretrained knowledge	🌐	-
2025	ICML	Diffusionvla: scaling robot foundation models via unified diffusion and autoregression	🌐	💻
2025	arXiv	Trivla: a unified triple-system-based unified vision-language-action model for general robot control	🌐	-
2025	arXiv	Information-theoretic graph fusion with vision-language-action model for policy reasoning and dual robotic control	-	-
2025	arXiv	Rationalvla: a rational vision-language-action model with dual system	🌐	-
2025	ICCV	Vq-vla: improving vision-language-action models via scaling vector-quantized action tokenizers	🌐	💻
2025	RA-L	Tinyvla: towards fast, data-efficient vision-language-action models for robotic manipulation	🌐	💻
2025	RSS	π0: A vision-language-action flow model for general robot control	🌐	-
2025	RSS	Fast: efficient action tokenization for vision-language-action models	🌐	-
2025	arXiv	π0.5: a vision language-action model with open-world generalization	🌐	-
2025	arXiv	Knowledge insulating vision-language-action models: train fast, run fast, generalize better	🌐	-
2025	arXiv	Forcevla: enhancing vla models with a force-aware moe for contact-rich manipulation	🌐	-
2025	arXiv	Smolvla: a vision-language-action model for affordable and efficient robotics	-	💻
2025	arXiv	Onetwovla: a unified vision-language-action model with adaptive reasoning	🌐	💻
2025	arXiv	Tactile-vla: unlocking vision-language-action model’s physical knowledge for tactile generalization	-	-
2025	arXiv	Gr-3 technical report	🌐	-
2025	arXiv	Villa-x: enhancing latent action modeling in vision-language-action models	🌐	💻
2025	arXiv	ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning	🌐	-
2025	arXiv	F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions	🌐	💻
2025	arXiv	iFlyBot-VLA	🌐	-
2025	arXiv	iFlyBot-VLA Technical Report	🌐	-
2025	arXiv	Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment	🌐	💻
2025	arXiv	NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards	🌐	💻
2025	arXiv	π*0.6: a VLA that Learns from Experience	🌐	-
2025	arXiv	ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation	🌐	-
2025	arXiv	METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model	🌐	-

Hierarchical Models

Planner Only

Year	Venue	Paper	Website	Code
2023	ICML	PaLM-E: An embodied multimodal language model	🌐	-
2023	arXiv	ViLa: Look before you leap: Unveiling the power of GPT-4V in robotic vision-language planning	🌐	-
2024	CVPR	MoManipVLA: Transferring vision-language-action models for general mobile manipulation	🌐	-
2024	CoRL	RoboPoint: A vision-language model for spatial affordance prediction in robotics	🌐	💻
2025	arXiv	ManipLVM-R1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models	-	-
2025	arXiv	Embodied-Reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks	🌐	💻
2025	arXiv	Reinforced Planning: Solving long-horizon tasks via imitation and reinforcement learning	-	💻
2025	ICRA	Chain-of-Modality: Learning manipulation programs from multimodal human videos with vision-language-models	🌐	-
2025	arXiv	Embodied-R: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning	🌐	💻
2025	CVPR	RoVI: Robotic Visual Instruction	🌐	-
2025	arXiv	ReLEP: Long-horizon embodied planning with implicit logical inference and hallucination mitigation	-	-
2025	CVPR	RoboBrain: A unified brain model for robotic manipulation from abstract to concrete	🌐	💻

Planner + Policy

Year	Venue	Paper	Website	Code
2023	arXiv	Instruct2Act: Mapping multi-modality instructions to robotic actions with large language model	-	💻
2023	CoRL	VoxPoser: Composable 3D value maps for robotic manipulation with language models	🌐	💻
2024	CVPR	SkillDiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution	🌐	💻
2024	arXiv	RoboMatrix: A skill-centric hierarchical framework for scalable robot task planning and execution in open-world	-	💻
2024	CoRL	RT-Affordance: Reasoning about robotic manipulation with affordances	🌐	-
2024	CoRL	LLARVA: Vision-action instruction tuning enhances robot learning	🌐	💻
2024	CVPR	MALMM: Multi-agent large language models for zero-shot robotics manipulation	🌐	💻
2024	arXiv	RT-H: Action Hierarchies Using Language	🌐	-
2024	CoRL	ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation	🌐	💻
2025	ICLR	HAMSTER: Hierarchical action models for open-world robot manipulation	🌐	💻
2025	ICML	HiRobot: Open-ended instruction following with hierarchical vision-language-action models	🌐	-
2025	arXiv	Agentic Robot: A brain-inspired framework for vision-language-action models in embodied agents	🌐	💻
2025	arXiv	DexVLA: Vision-language model with plug-in diffusion expert for general robot control	🌐	💻
2025	arXiv	PointVLA: Injecting the 3D world into vision-language-action models	🌐	-
2025	arXiv	A0: An affordance-aware hierarchical model for general robotic manipulation	🌐	💻
2025	arXiv	From seeing to doing: Bridging reasoning and decision for robotic manipulation	🌐	💻
2025	ICCV	RoBridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation	🌐	💻
2025	arXiv	RoboCerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation	🌐	-
2025	arXiv	π0.5: A vision-language-action model with open-world generalization	🌐	-
2025	arXiv	DexGraspVLA: A vision-language-action framework towards general dexterous grasping	🌐	💻
2025	arXiv	HiBerNAC: Hierarchical brain-emulated robotic neural agent collective for disentangling complex manipulation	-	-
2025	arXiv	Robix: A Unified Model for Robot Interaction, Reasoning and Planning	🌐	-

Other Advanced Field

Reinforcement Learning-based Methods

Year	Venue	Paper	Website	Code
2025	ICLR(Workshop)	GRAPE: Generalizing Robot Policy via Preference Alignment	🌐	💻
2025	arXiv	Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning	-	💻
2025	RSS	ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations	🌐	-
2025	RSS	ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy	🌐	💻
2025	RSS	RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning	🌐	-
2025	arXiv	TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization	-	-
2025	ICRA	Improving Vision-Language-Action Model with Online Reinforcement Learning	-	-
2025	CVPR(Workshop)	Interactive Post‑Training for Vision‑Language‑Action Models	🌐	💻
2025	arXiv	Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation	🌐	💻
2025	arXiv	VLAC: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning	🌐	💻
2025	arXiv	SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning	-	💻

Training-Free Methods

Year	Venue	Paper	Website	Code
2025	arXiv	VLA‑Cache: Towards Efficient Vision‑Language‑Action Model via Adaptive Token Caching in Robotic Manipulation	-	💻
2025	arXiv	Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding	-	-
2025	arXiv	Think Twice, Act Once: Token‑Aware Compression and Action Reuse for Efficient Inference in Vision‑Language‑Action Models	-	-
2025	arXiv	EfficientVLA: Training‑Free Acceleration and Compression for Vision‑Language‑Action Models	-	-
2025	arXiv	SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration	-	-
2025	arXiv	Block-wise Adaptive Caching for Accelerating Diffusion Policy	-	-
2025	RSS	FAST: Efficient Action Tokenization for Vision-Language-Action Models	🌐	-
2025	arXiv	Real-Time Execution of Action Chunking Flow Policies	🌐	-
2025	arXiv	SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning	-	-

Learning from Human Videos

Year	Venue	Paper	Website	Code
2024	ICML	3D‑VLA: A 3D Vision‑Language‑Action Generative World Model	🌐	💻
2024	NeurIPS	Learning an Actionable Discrete Diffusion Policy via Large‑Scale Actionless Video Pre‑Training	🌐	💻
2025	CVPR	Mitigating the Human‑Robot Domain Discrepancy in Visual Pre‑training for Robotic Manipulation	🌐	💻
2025	RSS	UniVLA: Learning to Act Anywhere with Task‑centric Latent Actions	-	💻
2025	ICLR	Latent Action Pretraining from Videos	🌐	💻
2025	arXiv	Humanoid‑VLA: Towards Universal Humanoid Control with Visual Integration	-	-

World Model-based VLA

Year	Venue	Paper	Website	Code
2024	CVPR	FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects	🌐	💻
2024	ICML	3D‑VLA: A 3D Vision‑Language‑Action Generative World Model	🌐	💻
2025	arXiv	WorldVLA: Towards Autoregressive Action World Model	-	💻
2025	arXiv	World4Omni: A Zero‑Shot Framework from Image Generation World Model to Robotic Manipulation	🌐	-
2025	arXiv	Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations	🌐	💻
2025	arXiv	V‑JEPA 2: Self‑Supervised Video Models Enable Understanding, Prediction and Planning	🌐	💻
2025	arXiv	FlowVLA: Thinking in Motion with a Visual Chain of Thought	🌐	-
2025	arXiv	Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation	🌐	💻
2025	arXiv	WoW: Towards a World-omniscient World-model Through Embodied Interaction	🌐	💻
2025	arXiv	RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL	-	-

Datasets and Benchmarks

Real-world Robot Datasets

Year	Venue	Paper	Website	Code	Data
2021	CoRL	BC-Z: Zero-shot task generalization with robotic imitation learning	🌐	💻	📦
2023	RSS	RT‑1: Robotics Transformer for Real‑World Control at Scale	🌐	💻	-
2023	CoRL	RT‑2: Vision‑Language Foundation Models as Effective Robot Imitators	🌐	💻	-
2022	RSS	Bridge Data: Boosting Generalization of Robotic Skills with Cross‑Domain Datasets	🌐	💻	📦
2023	CoRL	BridgeData V2: A Dataset for Robot Learning at Scale	🌐	💻	📦
2024	ICRA	RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One‑Shot	🌐	💻	📦
2024	RSS	DROID: A Large-Scale In‑The‑Wild Robot Manipulation Dataset	🌐	💻	📦
2024	ICRA	Open X‑Embodiment: Robotic Learning Datasets and RT‑X Models	🌐	💻	📦
2025	RSS	RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation	🌐	💻	📦
2025	IROS	AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems	🌐	💻	📦
2025	arXiv	BRMData: Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks	🌐	💻	📦

Simulation Environments and Benchmarks

Year	Venue	Paper	Website	Code	Data
2022	CoRL	BEHAVIOR‑1K: A Human‑Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation	🌐	💻	📦
2020	CVPR	ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks	🌐	💻	📦
2020	RA-L	RLBench: The Robot Learning Benchmark & Learning Environment	🌐	💻	📦
2024	arXiv	PerAct²: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks	🌐	💻	📦
2020	CoRL	Meta‑World: A Benchmark and Evaluation for Multi‑Task and Meta Reinforcement Learning	🌐	💻	📦
2019	CoRL	Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning	🌐	💻	📦
2023	NeurIPS	LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	🌐	💻	📦
2022	RA-L	CALVIN: A Benchmark for Language‑Conditioned Policy Learning for Long‑Horizon Robot Manipulation Tasks	🌐	💻	📦
2024	arXiv	MIKASA: Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning	🌐	💻	📦
2024	CoRL	SIMPLER: Evaluating Real‑World Robot Manipulation Policies in Simulation	🌐	💻	📦
2019	ICCV	Habitat: A Platform for Embodied AI Research	🌐	💻	📦
2021	NeurIPS	Habitat 2.0: Training Home Assistants to Rearrange their Habitat	🌐	💻	📦
2024	ICLR	Habitat 3.0: A Co‑Habitat for Humans, Avatars and Robots	🌐	💻	📦
2020	CVPR	SAPIEN: A Simulated Part-based Interactive Environment	🌐	💻	📦
2024	RSS	The Colosseum: A Benchmark for Evaluating Generalization for Robotic Manipulation	🌐	💻	📦
2025	ICCV	VLABench: A Large‑Scale Benchmark for Language‑Conditioned Robotics Manipulation with Long‑Horizon Reasoning Tasks	🌐	💻	📦

Human Behavior Datasets

Year	Venue	Paper	Website	Code	Data
2022	CVPR	Ego4D: Around the World in 3,000 Hours of Egocentric Video	🌐	💻	📦
2024	CVPR	Ego‑Exo4D: Understanding Skilled Human Activity from First‑ and Third‑Person Perspectives	🌐	💻	📦
2024	arXiv	EgoPlan‑Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models	🌐	💻	📦
2024	arXiv	EgoVid‑5M: A Large‑Scale Video‑Action Dataset for Egocentric Video Generation	🌐	💻	📦
2018	ECCV	Scaling Egocentric Vision: The EPIC‑KITCHENS Dataset	🌐	💻	📦
2024	ECCV	COM Kitchens: An Unedited Overhead‑View Video Dataset as a Vision‑Language Benchmark	🌐	💻	📦
2019	ICCV	EgoVQA: An Egocentric Video Question Answering Benchmark Dataset	🌐	💻	📦
2022	NeurIPS	EgoTaskQA: Understanding Human Tasks in Egocentric Videos	🌐	💻	📦
2025	arXiv	EgoDex: Learning Dexterous Manipulation from Large‑Scale Egocentric Video	-	-	-
2024	RSS	DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation	🌐	💻	📦
2024	arXiv	EgoMimic: Scaling Imitation Learning via Egocentric Video	🌐	💻	📦
2025	CoRL	Humanoid Policy ~ Human Policy	🌐	💻	📦
2025	arXiv	Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware	🌐	💻	-

Embodied Datasets and Benchmarks

Year	Venue	Paper	Website	Code	Data
2018	CVPR	EQA: Embodied Question Answering	🌐	💻	📦
2018	CVPR	IQA: Visual Question Answering in Interactive Environments	-	💻	-
2019	CVPR	MT‑EQA: Multi‑Target Embodied Question Answering	🌐	💻	📦
2019	CVPR	Embodied Question Answering in Photorealistic Environments with Point Cloud Perception	🌐	💻	📦
2023	ICLR	EQA‑MX: Embodied Question Answering using Multimodal Expression	-	-	-
2024	CVPR	OpenEQA: Embodied Question Answering in the Era of Foundation Models	🌐	💻	📦
2024	ICLR	LoTa‑Bench: Benchmarking Language‑oriented Task Planners for Embodied Agents	🌐	💻	📦

Citation

If you find this survey helpful for your research or applications, please consider citing it using the following BibTeX entry:

@misc{shao2025largevlmbasedvisionlanguageactionmodels,
      title={Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey}, 
      author={Rui Shao and Wei Li and Lingsen Zhang and Renshan Zhang and Zhiyang Liu and Ran Chen and Liqiang Nie},
      year={2025},
      eprint={2508.13073},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.13073}, 
}

Contact Us

For any questions or suggestions, please feel free to contact us at:

Email: shaorui@hit.edu.cn and liwei2024@stu.hit.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
LICENSE		LICENSE
README.md		README.md
pipline & timeline.png		pipline & timeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome VLM-based VLA for Robotic Manipulation

🛠️ We're still cooking — Stay tuned!🛠️
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨

Table of Contents

Monolithic Models

Single-System

Dual-System

Hierarchical Models

Planner Only

Planner + Policy

Other Advanced Field

Reinforcement Learning-based Methods

Training-Free Methods

Learning from Human Videos

World Model-based VLA

Datasets and Benchmarks

Real-world Robot Datasets

Simulation Environments and Benchmarks

Human Behavior Datasets

Embodied Datasets and Benchmarks

Citation

Contact Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome VLM-based VLA for Robotic Manipulation

🛠️ We're still cooking — Stay tuned!🛠️⭐ Give us a star if you like it! ⭐ ✨If you find this work useful for your research, please kindly cite our paper.✨

Table of Contents

Monolithic Models

Single-System

Dual-System

Hierarchical Models

Planner Only

Planner + Policy

Other Advanced Field

Reinforcement Learning-based Methods

Training-Free Methods

Learning from Human Videos

World Model-based VLA

Datasets and Benchmarks

Real-world Robot Datasets

Simulation Environments and Benchmarks

Human Behavior Datasets

Embodied Datasets and Benchmarks

Citation

Contact Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

🛠️ We're still cooking — Stay tuned!🛠️
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨

Packages