-
Notifications
You must be signed in to change notification settings - Fork 3.5k
feat(policies): add autoregressive VLAs with tokenization PiFast #2734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces autoregressive Vision-Language-Action (VLA) models to LeRobot, implementing PiFast alongside existing flow-matching policies. Unlike flow matching which predicts actions in parallel over a horizon, this implementation models actions sequentially as discrete tokens using the FAST (Fast Action Sequence Tokenization) tokenizer. The PR provides a complete reference implementation including model architecture, training scripts, and processor pipelines.
Key Changes:
- Implements PI0Fast policy with autoregressive action token prediction using cross-entropy loss
- Adds FAST tokenizer integration for converting continuous actions to discrete tokens via DCT coefficients and BPE
- Introduces custom attention masking patterns supporting bidirectional attention for images/language and causal attention for action tokens
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lerobot/utils/constants.py | Adds constants for action tokens and token masks |
| src/lerobot/processor/tokenizer_processor.py | Implements ActionTokenizerProcessorStep for tokenizing actions using FAST with PaliGemma token space conversion |
| src/lerobot/processor/init.py | Exports ActionTokenizerProcessorStep for use in pipelines |
| src/lerobot/policies/pi0_fast/train_fast_tokenizer.py | Provides training script for FAST tokenizer with delta transforms, normalization, and compression statistics |
| src/lerobot/policies/pi0_fast/processor_pi0_fast.py | Creates pre/post-processor pipelines including state discretization and language tokenization |
| src/lerobot/policies/pi0_fast/modeling_pi0_fast.py | Implements core PI0FastPytorch model with PaliGemma+Gemma expert architecture and autoregressive decoding |
| src/lerobot/policies/pi0_fast/configuration_pi0_fast.py | Defines PI0FastConfig with model hyperparameters and training settings |
| src/lerobot/policies/pi0_fast/init.py | Exports PI0Fast components for module access |
| src/lerobot/policies/factory.py | Registers PI0FastPolicy in the policy factory |
| src/lerobot/policies/init.py | Exports PI0FastConfig at package level |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Title
feat(policies): add autoregressive VLAs with tokenization PiFast
This PR brings autoregressive Vision-Language-Action (VLA) models back to LeRobot, alongside the existing flow-matching–based policies.
Unlike flow matching, which predicts actions in parallel over a horizon, autoregressive VLAs model actions sequentially as discrete tokens.
As a first step toward supporting multiple action tokenizers, this PR introduces PiFast, together with a training script for FAST tokenization, this provides a concrete reference implementation for autoregressive action modeling in LeRobot.
Future work will extend this framework to additional tokenizers and autoregressive variants.
TODO:
2- Provide PiFast pretrained checkpoints, and unveil HF LeRobot new AR VLA work.
3- Add testing and docs.
DONE:
1- Trained and evaluated successfully on libero, we will share the ckpts along with the results.
2- Support KV-caching for faster inference (a must for this PR) https://mett29.github.io/posts/kv-cache/