Skip to content

Add audit draft1: Action Tokenization#27

Open
Hhy903 wants to merge 2 commits intostagingfrom
audit/heyangmel-actiontokenization-draft1
Open

Add audit draft1: Action Tokenization#27
Hhy903 wants to merge 2 commits intostagingfrom
audit/heyangmel-actiontokenization-draft1

Conversation

@Hhy903
Copy link
Contributor

@Hhy903 Hhy903 commented Jan 23, 2026

First draft of paper audit by Heyang Huang and Mel Krusniak

@github-actions
Copy link

github-actions bot commented Jan 23, 2026

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/27/textbook/audits/heyangmel/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered content using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.


Can we avoid having to use discrete actions entirely?

This remains an open question. In late 2024, there was some significant work ([arXiv:2409.12514](https://arxiv.org/abs/2409.12514)) involving the usage of diffusion models to avoid tokenizing actions at all. The approach, taken as part of an optimized fine-tuning regime, also yielded a success rate increase in [OpenVLA-OFT](https://arxiv.org/pdf/2502.19645). Initially, [the $\pi_0$ model used a similar approach](https://arxiv.org/html/2410.24164v1)... but is outperformed in some respects by $\pi_0$-FAST, which uses the FAST action tokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure I'm interpreting the FAST paper correctly. My impression is that they tear out all of pi0's "action expert" / flow matching system and use FAST instead in pi0-FAST. But it's made more confusing because the original pi0 paper has a somewhat overloaded meaning of "token" (see first footnote of that paper).


- **Dexterous manipulation:** This is the most straightforward and commonly recognized failure mode, caused when there is too little precision in the action token vocabulary to specify very precise motions.
- **Literal actuator awareness:** Most of the time, natural language is substantially vaguer than motion, so action tokens are sufficient to represent most desired motions. But if a (byzantine) roboticist were to prompt, in natural language, "set gripper to state 50," there is no guarantee that the instruction would be exactly executed as requested. (In fact, if the fine-tuning dataset used refers primarily to gross actions, as many do, there is no guarantee that the system would associate a "gripper" language token with a gripper action token at all.)
- **Token conflation:** Few papers discuss in detail how non-action tokens are masked out of the output, if they happen to be generated erroneously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in class earlier - it's straightfowardly done in the loss function during SFT. i.e., only action tokens are factored into the loss. OpenVLA paper alludes to this practice.

@Hhy903
Copy link
Contributor Author

Hhy903 commented Jan 28, 2026

@crheckman The preview looks good and is ready for instructor review.


### How it works

VQ-BeT decomposes action generation into two explicitly separated stages: (i) offline action tokenization via residual vector quantization, and (ii) online autoregressive prediction of discrete latent codes conditioned on observations (and optionally goals). This separation reflects a deliberate reorganization of the control stack, in which representation learning, sequence modeling, and continuous execution are assigned distinct roles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a general curiosity about latent code books question. What is to stop a VQ approach from having the same types of issues as discrete binning for action tokens? Does the VQ-BeT paper discuss the code book size and the overall performance of the system as that size grows/shrinks? How would one determine the optimal code book size? Trial and error? Or are there some type of scaling laws that impact what should be chosen?


Can we avoid having to use discrete actions entirely?

This remains an open question. In late 2024, there was some significant work ([arXiv:2409.12514](https://arxiv.org/abs/2409.12514)) involving the usage of diffusion models to avoid tokenizing actions at all. The approach, taken as part of an optimized fine-tuning regime, also yielded a success rate increase in [OpenVLA-OFT](https://arxiv.org/pdf/2502.19645). Initially, [the $\pi_0$ model used a similar approach](https://arxiv.org/html/2410.24164v1)... but is outperformed in some respects by $\pi_0$-FAST, which uses the FAST action tokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: "...but is outperformed in some respects". I would love to know which places it outperformed, and what was the measure of performance that was used for evaluation. In particular, I'm trying to understand whether there are specific metrics or evaluation benchmarks in which the same schemes do well repeatedly but others perform worse.

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review (pre-class).


*What's the core technical challenge?*

Traditional robot control relies on continuous action spaces that align closely with the underlying physics of robotic systems. Torques, velocities, and end-effector motions are naturally continuous. Classical control theory is built around preserving smoothness, stability, and reactivity under this assumption. But motivated by the success of token-based generative modeling, a growing body of work replaces continuous control with action tokens: latent discrete symbols that are decoded into continuous actions at execution time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also that the traditional control stacks have to be discretized, since update frequencies on a computer end up being discrete anyway (how frequently you push a new action).


The practice of action discretization is not new. However, in the VLA context, it's substantially more important than in the past, credited with stabilizing training, enabling reuse of language-model architectures, and simplifying long-horizon credit assignment. However, it still comes at the fundamental tradeoff of degrading the mechanical precision of a robot. As such, it's natural to ask:

- What assumptions about dynamics, smoothness, and temporal structure make discrete actions viable?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like these three questions (and their follow-ups). They don't seem to be explicitly addressed elsewhere in the doc, or maybe they're not linked to these questions. Could you organize the doc around answering these questions, rather than jumping into methods?

---
# Evaluating Vector Quantization for VLA Action Tokenization

When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is **action tokenization**. But there is no single, obviously correct way to carry out this tokenization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing from this description is some kind of interaction-level discussion between the decoder and the tokenizer, especially as it relates to eliminating tokens from the dictionary. Some kind of discussion of BPE-like tokenizations, from a reductive/agglomerative approach at the decoder performance level, is crucial to understanding the trade-offs.


1. Brohan, A., et al. (2022). *RT-1: Robotics Transformer for Real-World Control at Scale.* arXiv:2212.06817.
2. Brohan, A., et al. (2023). *RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.* arXiv:2307.15818.
3. Shi, W., Zhao, H., Liu, Y., et al. (2023). *MotionLM: Multi-Agent Motion Forecasting as Language Modeling.* arXiv:2309.16534.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking this write-up has a deep dive on action tokenization for grasping, and the motivations for that. Grasping has sharp discontinuities in action space, specifically at contact points. Frequency-space tokenization handles discontinuities elegantly. However there exist many domains that are rarely discontinuous (e.g., navigation). MotionLM handles this kind of domain. Long horizon planning is another. Do FAST, BEAST, etc. have any relevance to those domains?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, we do focus mostly on grasping and similar tasks. We should expand on other tasks. We hardly talk about MotionLM at all.

I don't totally understand why FAST handles discontinuities well, though. Isn't part of the premise of DCT that it uses some smoothness assumptions (which to my understanding is why it's used for JPEG compression, etc)?

---
# Evaluating Vector Quantization for VLA Action Tokenization

When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is **action tokenization**. But there is no single, obviously correct way to carry out this tokenization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see more description on what these real-valued positions are on the actuators, or even what continuous control variables would be.


---

# [6] References
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a really helpful "action token taxonomy" reference: A Survey on Vision-Language-Action Models: An Action Tokenization Perspective


However, it was long suspected that poor action tokenization prevented dexterous performance in RT-2. Indeed, in 2025, Physical Intelligence released a performance comparison of a "naive" (binning) tokenizer with a new, bespoke alternative ([FAST](https://arxiv.org/abs/2501.09747)), suggesting serious deterioration of performance specifically with increased sampling rate:

![Performance comparison of naive tokenizer vs FAST](https://hackmd.io/_uploads/B1OfcT0HZe.png)
Copy link
Contributor

@yi-shiuan-tung yi-shiuan-tung Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is DCT in the image? It would be helpful to get more explanation for how FAST works.

\psi:\hat{a}_{t:t+n} = \psi(z_q(x)).
$$

VQ-BeT uses a small number of residual quantization layers (typically two), interpreting the first as capturing coarse action modes (*primary codes*) and subsequent layers as encoding finer-grained residual structure (*secondary codes*).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question, but is there any slowdown observed during inference time? Or is the overhead mostly during the training phase?


High-frequency control exposes where tokenization delays or abstracts feedback. Discrete action tokens necessarily operate at a coarser temporal granularity than physical dynamics, forcing systems to assume that short-horizon correction can be deferred or handled outside the tokenized decision loop.

Under these conditions, different tokenization strategies fail differently. Primitive discretization degrades precision as update rates increase; latent action tokenization relies on decoders or offset pathways to absorb rapid corrections; continuous-action approaches retain immediate feedback at the cost of heavier computation and tighter coupling. Performance at high frequency therefore reflects not model capacity, but whether the tokenization boundary aligns with the timescale at which control errors must be corrected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This color coding scheme and the breakdown is very nice. I don't have a concept of what "fast" would mean though, or what it means to be 2.5x speedup over another method. I think it would help to have the control frequencies here so that the reader knows what frequencies are being compared.


#### Stage 3: Offset head and continuous correction

To compensate for the loss of precision introduced by discretization, VQ-BeT adds a continuous **offset head** that predicts a residual correction to the decoded action:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do the authors specify how much this helps or how much offset is typically applied? Adding a residual component outside the VLA to improve precision makes sense, but I was wondering whether it was added because it empirically helps or to address a problem in a clean way.


From a scaling perspective, an action tokenizer should function as a reusable abstraction rather than a task-specific artifact. Scaling stresses the tokenization boundary first: as data diversity, model capacity, and deployment scope increase, weaknesses in how actions are abstracted tend to amplify rather than average out. The question is whether scaling reduces control error, or merely relocates it elsewhere in the system.

- 🔵 **Residual VQ:** The performance of VQ-BeT seems to diminish only marginally with the size of the VQ codebook, according to the VQ-BeT paper, which is an exciting data-scaling result when it is the complexity of the robot data (not necessarily the amount) that is scaled. Training the tokenizer does not appear to be a terrible inconvenience. (It could potentially be done with simulated data, at least for gross quantization in the primary VQ-residual layer.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which parts of the performance degrade with the size changes of the VQ codebook? For example, does it fail at one particular task at a much higher rate? Or does it fail more frequently across all tasks? Is it possible that the latent code book captures certain things very well which make some tasks successful while it misses capturing important data on other tasks?


When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is **action tokenization**. But there is no single, obviously correct way to carry out this tokenization.

In this paper audit, we analyze action tokenization as a systems-level design choice. Our primary focus is the _vector quantized behavior transformer_ ([VQ-BeT](https://arxiv.org/pdf/2403.03181)), which uses a method called _vector quantization_ to tokenize actions. On the way, we'll discuss some popular alternatives - some newer, some older - focusing on their modeling assumptions and potential failure modes. Then, after a deep dive into VQ-BeT, we'll compare all of these options and provide some discussion about ongoing and future action tokenization work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is vector quantization? and why does this paper use it to tokenize actions?


$$
z_q^{(i)} = \arg\min_{e_j^{(i)}} \left\| r^{(i)} - e_j^{(i)} \right\|_2,
$$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a reference for some of the variables used, like $N_q$. I recommend defining the variables in place or adopting a textual description instead.


- 🔵 **Residual VQ:** The performance of VQ-BeT seems to diminish only marginally with the size of the VQ codebook, according to the VQ-BeT paper, which is an exciting data-scaling result when it is the complexity of the robot data (not necessarily the amount) that is scaled. Training the tokenizer does not appear to be a terrible inconvenience. (It could potentially be done with simulated data, at least for gross quantization in the primary VQ-residual layer.)
- 🟢 **DCT (FAST):** FAST comes with a fantastic promise: that it (FAST+ specifically) can be applied as an action tokenizer for any VLA, no modifications necessary. Whether it delivers on this promise seems like an open question to us. FAST+ required a tremendous amount of robot data to train (1M trajectories across dozens of embodiments), and it's difficult to imagine acquiring even more should that prove to be insufficient for any particular morphology.
- 🟡 **Diffusion ($\pi_0$):** It's not possible to separately train an action tokenizer in this setup, leaving one completely at the mercy of training a flow-matching model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this...I'm fairly certain action tokens are generated and fed to the action expert in Pi0 which means you could hypothetically change the tokenization before feeding it to the action expert/flow matching model. I could be misunderstanding it though.

I'm also not sure about the framing of being "at the mercy of training a flow-matching model". That seems like strongly negative language without the justification for what you dislike about flow-matching models.


*What's the core technical challenge?*

Traditional robot control relies on continuous action spaces that align closely with the underlying physics of robotic systems. Torques, velocities, and end-effector motions are naturally continuous. Classical control theory is built around preserving smoothness, stability, and reactivity under this assumption. But motivated by the success of token-based generative modeling, a growing body of work replaces continuous control with action tokens: latent discrete symbols that are decoded into continuous actions at execution time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the field moving towards action tokenization if the control based approach is working well? How would you compare these two approached?

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments made during reading period.


---

# [2] The competitors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this "Methods" and focus on three elements. Reformulate into some bullets with a list of papers supporting each one.

These kinds of paragraphs read like a book report, but what we need is a technical report.


Depending on the architecture, these 255 (or so) action tokens might be usable as-is, or (as a hack if available tokens are limited) might overwrite the 255 *least used* tokens in the upstream language model.

There is some nuance in how binning is applied. [RT-1](https://arxiv.org/abs/2212.06817)’s action tokenizer discretized uniformly across the entire action space (based on minimum and maximum values seen in the data) into 256 equal-size bins. This choice was, apparently, good enough: in the corresponding ablation study, they note a -25% success rate delta when this action tokenizer is removed (though no comparison was given with other action tokenization schemes). A near-identical scheme was reused for [RT-2](https://arxiv.org/abs/2307.15818) in 2023.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for what?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to see a deep dive on the evaluations. Did they do any ablations against other tokenization schemes?


There is some nuance in how binning is applied. [RT-1](https://arxiv.org/abs/2212.06817)’s action tokenizer discretized uniformly across the entire action space (based on minimum and maximum values seen in the data) into 256 equal-size bins. This choice was, apparently, good enough: in the corresponding ablation study, they note a -25% success rate delta when this action tokenizer is removed (though no comparison was given with other action tokenization schemes). A near-identical scheme was reused for [RT-2](https://arxiv.org/abs/2307.15818) in 2023.

In later 2023, Waymo's [MotionLM](https://arxiv.org/pdf/2309.16534) introduced some slight modifications, using a "Verlet wrapper" around uniformly binned deltas for each coordinate. In practice, this resulted in a substantial reduction in the number of distinct tokens required, though it was not fully specified how this reduction arose. Later, during work on [OpenVLA](https://arxiv.org/pdf/2406.09246) in early 2024, it was noticed that computing bins based on the minimum and maximum actuator values was vulnerable to outliers - though bins were of equal numerical size, the majority of data fell in a subset of bins, so some precision was wasted. To amend this issue, the authors of OpenVLA opted to use a quantile-based approach instead, such that each bin covered the same amount of training data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is a Verlet wrapper, or what was the intuition behind it?

In later 2023, Waymo's [MotionLM](https://arxiv.org/pdf/2309.16534) introduced some slight modifications, using a "Verlet wrapper" around uniformly binned deltas for each coordinate. In practice, this resulted in a substantial reduction in the number of distinct tokens required, though it was not fully specified how this reduction arose. Later, during work on [OpenVLA](https://arxiv.org/pdf/2406.09246) in early 2024, it was noticed that computing bins based on the minimum and maximum actuator values was vulnerable to outliers - though bins were of equal numerical size, the majority of data fell in a subset of bins, so some precision was wasted. To amend this issue, the authors of OpenVLA opted to use a quantile-based approach instead, such that each bin covered the same amount of training data:

$$
\text{action\_token} = \lfloor \text{quantile}(a) \cdot 255 \rfloor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a brilliant move, since it directly connects training data distribution with output distribution, allowing data balancing with a simple change of vocabulary!

\text{action\_token} = \lfloor \text{quantile}(a) \cdot 255 \rfloor
$$

Why did these simple action tokenizers survive for so long, when tokenizers in other components of the VLA stack were more complex? While it's hard to say for certain, we suspect that it was discovered early that performance does not necessarily correlate well with scale in action tokenizers. Binning is good enough in many cases, and choosing a more complex action tokenizer can impact inference times (see [high-frequency performance](#high-frequency-performance)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connecting this section with the improvements later (frequency-space tokenization) gives a clue: it all comes down to evaluations. If you are trying to behavior clone and calculating error on an L2 c-space metric then any trajectories (inferred) that end up close enough to the reference are considered good. However if you zoom in on the discontinuities of actions, you see they aren't good. This is a similar notion of keyframing, which extends to both training data distribution as well as metrics calculations.

On OpenVLA tasks, the correlation wasn't clear. On RT-2 tasks, they broke glass a lot -> metrics may not have been obvious enough, but evaluations demonstrated a clear problem.


In this audit, we use VQ-BeT as a particularly interesting case study - not for its optimality (though it performs well in several respects), but because it crystallizes the core design assumptions behind latent action quantization in VLA systems.

### How it works
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might really help to move the figure to up here, and explain through the figure. :)


VQ-BeT decomposes action generation into two explicitly separated stages: (i) offline action tokenization via residual vector quantization, and (ii) online autoregressive prediction of discrete latent codes conditioned on observations (and optionally goals). This separation reflects a deliberate reorganization of the control stack, in which representation learning, sequence modeling, and continuous execution are assigned distinct roles.

#### Stage 1: Chunk tokenization via residual VQ-VAE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good description of the how, but it's missing the why. This could be in the form of an explanation of the gap, an ablation, or something similar.


where $g$ is an optional goal signal.

Rather than predicting a single token per timestep, the model predicts one categorical distribution per quantization layer. Training loss weights errors in primary code prediction more heavily than secondary codes, reflecting the intended coarse-to-fine structure of the latent space.
Copy link
Collaborator

@crheckman crheckman Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point and I realize not something that is clearly stated in the Introduction (but should be). Consider if we had action tokenization $a_t \in \mathbb{Z|z \in [0,9]}^k_{[z_0 . z_1, ... z_k]}$. That is, we induce a lack of significance through the order of decimals. (example: 0.923146 has significance at 9 and insignificance by the time you get to 6, but they're all weighted the same against a reference). This should be clear about why tokenization of actions is so important in an autoregressive generation / masked token context for training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Literal actuator awareness: Most of the time, natural language is substantially vaguer than motion, so action tokens are sufficient to represent most desired motions. But if a (byzantine) roboticist were to prompt, in natural language, "set gripper to state 50," there is no guarantee that the instruction would be exactly executed as requested. (In fact, if the fine-tuning dataset used refers primarily to gross actions, as many do, there is no guarantee that the system would associate a "gripper" language token with a gripper action token at all.)"

I think the failure modes section is interesting, especially ^, are there any statistical examples or cases where people have tried to find the frequency of these occurrences?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants