We introduce
demo.mp4
π Best viewed with sound on
- π§ Predictive Inverse Dynamics: Visual foresight generation for planning-based control
- ποΈ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
- π Three-Stage Training: Progressive alignment, pretraining, and adaptation
Genie-1.mov
9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation
Franka.mov
Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities
Long-horizon.mp4
10-step sequential task over 2 minutes, showcasing long-term planning and execution
Dynamic-Env.mp4
Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities
| Task | Platform | Improvement | ||
|---|---|---|---|---|
| Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
| Adaptation | Franka | 66.7% | 53.3% | +13.4% |
| Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
| Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
- Python β₯ 3.10
- torch β₯ 2.6.0
- CUDA β₯ 12.4
# Clone repository
git clone https://github.com/InternRobotics/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla
# Create environment
conda create -n f1_vla python==3.10
conda activate f1_vla
# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124
# install f1_vla
pip install -e .
pip install numpy==1.26.4For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.
- FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
- TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.
By using these two tools, the time of loading the video dataset is greatly accelerated.
| Name | link |
|---|---|
| LIBERO_SPATIAL_NO_NOOPS_PATH | IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot |
| STAGE2_CKPT_PATH | F1_pretrain |
| LEROBOT_PI0_PATH | lerobot/pi0_base |
| PALIGEMMA_PATH | google/paligemma-3b-pt-224 |
| VAE_PATH | vae_ch160v4096z32.pth |
f1_vla
βββ config
β βββ debug_test.yaml
β βββ f1_config.json
βββ requirements.txt
βββ setup.py
βββ src
β βββ configs
β βββ models
β βββ policies
β βββ processors
β βββ utils
βββ train_hf.py# 1. edit config file
vim f1_vla/config/debug_test.yaml
# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yamlIf you use this work in your research, please cite our paper:
@article{lv2025f1,
title={F1: A vision-language-action model bridging understanding and generation to actions},
author={Lv, Qi and Kong, Weijie and Li, Hao and Zeng, Jia and Qiu, Zherui and Qu, Delin and Song, Haoming and Chen, Qizhi and Deng, Xiang and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2509.06951},
year={2025}
}This project is licensed under the MIT License.