: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

We introduce $\mathcal{F}_1$, a novel paradigm by integrating visual foresight generation into the decision-making pipeline. Our model employs a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions through predictive inverse dynamics modeling.

demo.mp4

🏁 Best viewed with sound on

🚀 Key Innovations

🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
🏗️ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
📈 Three-Stage Training: Progressive alignment, pretraining, and adaptation

🤖 Real-World Robot Experiments

Multi-task Manipulation

Genie-1.mov

9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation

Rapid Adaptation

Franka.mov

Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities

Long-horizon Planning

Long-horizon.mp4

10-step sequential task over 2 minutes, showcasing long-term planning and execution

Dynamic Environment

Dynamic-Env.mp4

Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities

Performance Summary

Task	Platform	$\mathcal{F}_1$	$\pi_0$	Improvement
Multi-task	Genie-1	82.2%	65.2%	+17.0%
Adaptation	Franka	66.7%	53.3%	+13.4%
Long-horizon	ARX LIFT II	40.0%	0.0%	+40.0%
Dynamic Env	ARX LIFT II	66.7%	33.3%	+33.4%

🚀 Quick Start

Prerequisites

Python ≥ 3.10
torch ≥ 2.6.0
CUDA ≥ 12.4

Installation

# Clone repository
git clone https://github.com/aopolin-lv/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla

# Create environment
conda create -f f1_vla python==3.10
conda activate f1_vla

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124

# install f1_vla
pip install -e .

pip install numpy==1.26.4

For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.

FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.

By using these two tools, the time of loading the video dataset is greatly accelerated.

Download Pretrained Datasets and Models

Name	link
LIBERO_SPATIAL_NO_NOOPS_PATH	IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot
STAGE2_CKPT_PATH	F1_pretrain
LEROBOT_PI0_PATH	lerobot/pi0
PALIGEMMA_PATH	google/paligemma-3b-pt-224
VAE_PATH	var_d16.pth

Basic Usage

f1_vla
├── config
│   ├── debug_test.yaml
│   └── f1_config.json
├── requirements.txt
├── setup.py
├── src
│   ├── configs
│   ├── models
│   ├── policies
│   ├── processors
│   └── utils
└── train_hf.py

Finetune

# 1. edit config file
vim f1_vla/config/debug_test.yaml

# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml

📚 Citation

If you use this work in your research, please cite our paper:

@article{f1_vla_2025,
  title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
  eprint={2509.06951},
  archivePrefix={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
f1_vla		f1_vla
README.md		README.md
train_hf.py		train_hf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

🚀 Key Innovations

🤖 Real-World Robot Experiments

Multi-task Manipulation

Rapid Adaptation

Long-horizon Planning

Dynamic Environment

Performance Summary

🚀 Quick Start

Prerequisites

Installation

Download Pretrained Datasets and Models

Basic Usage

Finetune

📚 Citation

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

InternRobotics/F1-VLA

Folders and files

Latest commit

History

Repository files navigation

: A Vision-Language-Action Model BridgingUnderstanding and Generation to Actions

🚀 Key Innovations

🤖 Real-World Robot Experiments

Multi-task Manipulation

Rapid Adaptation

Long-horizon Planning

Dynamic Environment

Performance Summary

🚀 Quick Start

Prerequisites

Installation

Download Pretrained Datasets and Models

Basic Usage

Finetune

📚 Citation

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

Packages