🛠️ We're still cooking — Stay tuned!🛠️
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨
🔥 Large VLM-based Vision-Language-Action (VLA) models have recently emerged as a transformative paradigm for robotic manipulation by tightly coupling perception, language understanding, and action generation. Built upon large Vision-Language Models (VLMs), they enable robots to interpret natural language instructions, perceive complex environments, and perform diverse manipulation tasks with strong generalization.
📍 We present the first systematic survey on large VLM-based VLA models for robotic manipulation. This repository serves as the companion resource to our survey: "Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey", and includes all the research papers, benchmarks, and resources reviewed in the paper, organized for easy access and reference.
📌 We will keep updating this repository with newly published works to reflect the latest progress in the field.
- 🤖 Awesome VLA for Robotic Manipulation
| Year | Venue | Paper | Website | Code |
|---|---|---|---|---|
| 2024 | ICML | 3D‑VLA: A 3D Vision‑Language‑Action Generative World Model | 🌐 | 💻 |
| 2024 | NeurIPS | Learning an Actionable Discrete Diffusion Policy via Large‑Scale Actionless Video Pre‑Training | 🌐 | 💻 |
| 2025 | CVPR | Mitigating the Human‑Robot Domain Discrepancy in Visual Pre‑training for Robotic Manipulation | 🌐 | 💻 |
| 2025 | RSS | UniVLA: Learning to Act Anywhere with Task‑centric Latent Actions | - | 💻 |
| 2025 | ICLR | Latent Action Pretraining from Videos | 🌐 | 💻 |
| 2025 | arXiv | Humanoid‑VLA: Towards Universal Humanoid Control with Visual Integration | - | - |
| Year | Venue | Paper | Website | Code | Data |
|---|---|---|---|---|---|
| 2018 | CVPR | EQA: Embodied Question Answering | 🌐 | 💻 | 📦 |
| 2018 | CVPR | IQA: Visual Question Answering in Interactive Environments | - | 💻 | - |
| 2019 | CVPR | MT‑EQA: Multi‑Target Embodied Question Answering | 🌐 | 💻 | 📦 |
| 2019 | CVPR | Embodied Question Answering in Photorealistic Environments with Point Cloud Perception | 🌐 | 💻 | 📦 |
| 2023 | ICLR | EQA‑MX: Embodied Question Answering using Multimodal Expression | - | - | - |
| 2024 | CVPR | OpenEQA: Embodied Question Answering in the Era of Foundation Models | 🌐 | 💻 | 📦 |
| 2024 | ICLR | LoTa‑Bench: Benchmarking Language‑oriented Task Planners for Embodied Agents | 🌐 | 💻 | 📦 |
If you find this survey helpful for your research or applications, please consider citing it using the following BibTeX entry:
@misc{shao2025largevlmbasedvisionlanguageactionmodels,
title={Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey},
author={Rui Shao and Wei Li and Lingsen Zhang and Renshan Zhang and Zhiyang Liu and Ran Chen and Liqiang Nie},
year={2025},
eprint={2508.13073},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.13073},
}
For any questions or suggestions, please feel free to contact us at:
Email: shaorui@hit.edu.cn and liwei2024@stu.hit.edu.cn