Skip to content

InternRobotics/F1-VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

F1 Logo: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

Paper Website Demo License


We introduce $\mathcal{F}_1$, a novel paradigm by integrating visual foresight generation into the decision-making pipeline. Our model employs a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions through predictive inverse dynamics modeling.

demo.mp4

🏁 Best viewed with sound on

πŸš€ Key Innovations

  • 🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
  • πŸ—οΈ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
  • πŸ“ˆ Three-Stage Training: Progressive alignment, pretraining, and adaptation

πŸ€– Real-World Robot Experiments

Multi-task Manipulation

Genie-1.mov

9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation

Rapid Adaptation

Franka.mov

Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities

Long-horizon Planning

Long-horizon.mp4

10-step sequential task over 2 minutes, showcasing long-term planning and execution

Dynamic Environment

Dynamic-Env.mp4

Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities

Performance Summary

Task Platform $\mathcal{F}_1$ $\pi_0$ Improvement
Multi-task Genie-1 82.2% 65.2% +17.0%
Adaptation Franka 66.7% 53.3% +13.4%
Long-horizon ARX LIFT II 40.0% 0.0% +40.0%
Dynamic Env ARX LIFT II 66.7% 33.3% +33.4%

πŸš€ Quick Start

Prerequisites

  • Python β‰₯ 3.10
  • torch β‰₯ 2.6.0
  • CUDA β‰₯ 12.4

Installation

# Clone repository
git clone https://github.com/InternRobotics/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla

# Create environment
conda create -n f1_vla python==3.10
conda activate f1_vla

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124

# install f1_vla
pip install -e .

pip install numpy==1.26.4

For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.

  • FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
  • TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.

By using these two tools, the time of loading the video dataset is greatly accelerated.

Download Pretrained Datasets and Models

Name link
LIBERO_SPATIAL_NO_NOOPS_PATH IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot
STAGE2_CKPT_PATH F1_pretrain
LEROBOT_PI0_PATH lerobot/pi0_base
PALIGEMMA_PATH google/paligemma-3b-pt-224
VAE_PATH vae_ch160v4096z32.pth

Basic Usage

f1_vla
β”œβ”€β”€ config
β”‚   β”œβ”€β”€ debug_test.yaml
β”‚   └── f1_config.json
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
β”œβ”€β”€ src
β”‚   β”œβ”€β”€ configs
β”‚   β”œβ”€β”€ models
β”‚   β”œβ”€β”€ policies
β”‚   β”œβ”€β”€ processors
β”‚   └── utils
└── train_hf.py

Finetune

# 1. edit config file
vim f1_vla/config/debug_test.yaml

# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml

πŸ“š Citation

If you use this work in your research, please cite our paper:

@article{lv2025f1,
  title={F1: A vision-language-action model bridging understanding and generation to actions},
  author={Lv, Qi and Kong, Weijie and Li, Hao and Zeng, Jia and Qiu, Zherui and Qu, Delin and Song, Haoming and Chen, Qizhi and Deng, Xiang and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2509.06951},
  year={2025}
}

πŸ“„ License

This project is licensed under the MIT License.

πŸ™ Acknowledgments

About

F1: A Vision Language Action Model Bridging Understanding and Generation to Actions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages