We introduce
demo.mp4
π Best viewed with sound on
- π§ Predictive Inverse Dynamics: Visual foresight generation for planning-based control
- ποΈ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
- π Three-Stage Training: Progressive alignment, pretraining, and adaptation
Genie-1.mov
9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation
Franka.mov
Sweep and sort tasks demonstrating rapid embodiment adaptation capabilities
Long-horizon.mp4
10-step sequential task over 2 minutes, showcasing long-term planning and execution
Dynamic-Env.mp4
Moving conveyor belt manipulation, demonstrating dynamic scene handling capabilities
Task | Platform | Improvement | ||
---|---|---|---|---|
Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
Adaptation | Franka | 66.7% | 53.3% | +13.4% |
Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
- Python β₯ 3.10
- torch β₯ 2.6.0
- CUDA β₯ 12.4
# Clone repository
git clone https://github.com/aopolin-lv/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla
# Create environment
conda create -f f1_vla python==3.10
conda activate f1_vla
# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124
# install f1_vla
pip install -e .
pip install numpy==1.26.4
For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.
- FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
- TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.
By using these two tools, the time of loading the video dataset is greatly accelerated.
Name | link |
---|---|
LIBERO_SPATIAL_NO_NOOPS_PATH | IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot |
STAGE2_CKPT_PATH | F1_pretrain |
LEROBOT_PI0_PATH | lerobot/pi0 |
PALIGEMMA_PATH | google/paligemma-3b-pt-224 |
VAE_PATH | var_d16.pth |
f1_vla
βββ config
β βββ debug_test.yaml
β βββ f1_config.json
βββ requirements.txt
βββ setup.py
βββ src
β βββ configs
β βββ models
β βββ policies
β βββ processors
β βββ utils
βββ train_hf.py
# 1. edit config file
vim f1_vla/config/debug_test.yaml
# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml
If you use this work in your research, please cite our paper:
@article{f1_vla_2025,
title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
eprint={2509.06951},
archivePrefix={arXiv},
year={2025},
url={https://arxiv.org/abs/2509.06951}
}
This project is licensed under the MIT License.