This is a collection of research papers for Visual-Language-Action (VLA) models with memory, designed to solve long-horizon, partially observabe tasks. The repository shall be regularly updated to track the frontiers. As there is no generally accepted definition of memory in the context of VLA, we include all works related to the ability of VLA systems to process information over horizons longer than a single step.
Since most VLAs are based on Visual-Language Models (VLMs), we also included a section on VLMs with memory.
TODOs:
- Separate navigation and manipulation models with memory
Contributions are welcome. Submit a PR with relevant papers or resources you consider significant.
format:
- [title](paper link)
- author1, author2, and author3...
- Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation
- Huajie Tan, Peterson Co, Yijie Xu, Shanyu Rong, Yuheng Ji, Cheng Chi, Xiansheng Chen, Qiongyu Zhang, Zhongxia Zhao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
- HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models
- Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, Donglin Wang
- LoLA: Long Horizon Latent Action Learning for General Robot Manipulation
- Xiaofan Wang, Xingyu Gao, Jianlong Fu, Zuolei Li, Dean Fortier, Galen Mullins, Andrey Kolobov, Baining Guo
- Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation
- Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, Chang Xu
- Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective
- Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa Vo, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, Ngan Le
- HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
- Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin
- LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
- Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng
- RoboHiMan: A Hierarchical Evaluation Paradigm for Compositional Generalization in Long-Horizon Manipulation
- Yangtao Chen, Zixuan Chen, Nga Teng Chan, Junting Chen, Junhui Yin, Jieqi Shi, Yang Gao, Yong-Lu Li, Jing Huo
- Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding
- Maxim A. Patratskiy, Alexey K. Kovalev, Aleksandr I. Panov
- KV-Efficient VLA: A Method to Speed up Vision Language Models with RNN-Gated Chunked KV Cache
- Wanshun Xu, Long Zhuang, Lianlei Shan
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
- Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, Jiangmiao Pang
- Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation
- Yiguo Fan, Pengxiang Ding, Shuanghao Bai, Xinyang Tong, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, Donglin Wang
- RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems
- Mingcong Lei, Honghao Cai, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Yimou Wu, Shaohan Jiang, Ge Wang, Yuyuan Yang, Junyuan Tan, Zhenglin Wan, Zhen Li, Shuguang Cui, Yiming Zhao, Yatong Han
- DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping
- Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, Yaodong Yang, Yuanpei Chen
- RoboOS: A Hierarchical Embodied Framework for Cross-Embodiment and Multi-Agent Collaboration
- Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
- EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
- Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
- CycleManip: Enabling Cyclic Task Manipulation via Effective Historical Perception and Understanding
- Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang, Guiliang Liu, Wei-Shi Zheng
- History-Aware Visuomotor Policy Learning via Point Tracking
- Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu
- MemER: Scaling Up Memory for Robot Control via Experience Retrieval
- Ajay Sridhar, Jennifer Pan, Satvik Sharma, Chelsea Finn
- MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
- Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang
- Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning
- Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, Aleksandr I. Panov
- MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation
- Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap-Peng Tan, Ziwei Wang
- Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective
- Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa Vo, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, Ngan Le
- MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning
- Anonymous authors. Paper under double-blind review
- Towards Fast, Memory-based and Data-Efficient Vision-Language Policy
- Haoxuan Li, Sixu Yan, Yuhan Li, Xinggang Wang
- EvoVLA: Self-Evolving Vision-Language-Action Model
- Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang
- EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation
- Min Lin, Xiwen Liang, Bingqian Lin, Liu Jingzhi, Zijian Jiao, Kehan Li, Yuhan Ma, Yuecheng Liu, Shen Zhao, Yuzheng Zhuang, Xiaodan Liang
- Mixture of Horizons in Action Chunking
- Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding
- AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
- Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu
- Vision-Language Memory for Spatial Reasoning
- Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang
- CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
- Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang
- ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context
- Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, Jinwoo Shin
- GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
- Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, Wanli Peng, Jingchao Qiao, Zeyu Ren, Haixin Shi, Zhi Su, Jiawen Tian, Yuyang Xiao, Shenyu Zhang, Liwei Zheng, Hang Li, Yonghui Wu
- CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
- Oier Mees, Lukas Hermann, Erick Rosete-Beas, Wolfram Burgard
- VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
- Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng
- Vision-Language Memory for Spatial Reasoning
- Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang