Add audit draft1: Action Tokenization by Hhy903 · Pull Request #27 · arpg/vla-foundations

Hhy903 · 2026-01-23T03:41:30Z

First draft of paper audit by Heyang Huang and Mel Krusniak

github-actions · 2026-01-23T03:42:11Z

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/27/textbook/audits/heyangmel/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered content using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

krusnim · 2026-01-28T20:54:13Z

content/textbook/audits/heyangmel.mdx

+
+Can we avoid having to use discrete actions entirely?
+
+This remains an open question. In late 2024, there was some significant work ([arXiv:2409.12514](https://arxiv.org/abs/2409.12514)) involving the usage of diffusion models to avoid tokenizing actions at all. The approach, taken as part of an optimized fine-tuning regime, also yielded a success rate increase in [OpenVLA-OFT](https://arxiv.org/pdf/2502.19645). Initially, [the $\pi_0$ model used a similar approach](https://arxiv.org/html/2410.24164v1)... but is outperformed in some respects by $\pi_0$-FAST, which uses the FAST action tokenizer.


I'm not 100% sure I'm interpreting the FAST paper correctly. My impression is that they tear out all of pi0's "action expert" / flow matching system and use FAST instead in pi0-FAST. But it's made more confusing because the original pi0 paper has a somewhat overloaded meaning of "token" (see first footnote of that paper).

krusnim · 2026-01-28T20:58:35Z

content/textbook/audits/heyangmel.mdx

+
+- **Dexterous manipulation:** This is the most straightforward and commonly recognized failure mode, caused when there is too little precision in the action token vocabulary to specify very precise motions.
+- **Literal actuator awareness:** Most of the time, natural language is substantially vaguer than motion, so action tokens are sufficient to represent most desired motions. But if a (byzantine) roboticist were to prompt, in natural language, "set gripper to state 50," there is no guarantee that the instruction would be exactly executed as requested. (In fact, if the fine-tuning dataset used refers primarily to gross actions, as many do, there is no guarantee that the system would associate a "gripper" language token with a gripper action token at all.)
+- **Token conflation:** Few papers discuss in detail how non-action tokens are masked out of the output, if they happen to be generated erroneously.


We discussed this in class earlier - it's straightfowardly done in the loss function during SFT. i.e., only action tokens are factored into the loss. OpenVLA paper alludes to this practice.

Hhy903 · 2026-01-28T22:29:36Z

@crheckman The preview looks good and is ready for instructor review.

lorinachey · 2026-01-29T15:52:00Z

content/textbook/audits/heyangmel.mdx

+
+### How it works
+
+VQ-BeT decomposes action generation into two explicitly separated stages: (i) offline action tokenization via residual vector quantization, and (ii) online autoregressive prediction of discrete latent codes conditioned on observations (and optionally goals). This separation reflects a deliberate reorganization of the control stack, in which representation learning, sequence modeling, and continuous execution are assigned distinct roles.


This is just a general curiosity about latent code books question. What is to stop a VQ approach from having the same types of issues as discrete binning for action tokens? Does the VQ-BeT paper discuss the code book size and the overall performance of the system as that size grows/shrinks? How would one determine the optimal code book size? Trial and error? Or are there some type of scaling laws that impact what should be chosen?

lorinachey · 2026-01-29T16:00:42Z

content/textbook/audits/heyangmel.mdx

+
+Can we avoid having to use discrete actions entirely?
+
+This remains an open question. In late 2024, there was some significant work ([arXiv:2409.12514](https://arxiv.org/abs/2409.12514)) involving the usage of diffusion models to avoid tokenizing actions at all. The approach, taken as part of an optimized fine-tuning regime, also yielded a success rate increase in [OpenVLA-OFT](https://arxiv.org/pdf/2502.19645). Initially, [the $\pi_0$ model used a similar approach](https://arxiv.org/html/2410.24164v1)... but is outperformed in some respects by $\pi_0$-FAST, which uses the FAST action tokenizer.


RE: "...but is outperformed in some respects". I would love to know which places it outperformed, and what was the measure of performance that was used for evaluation. In particular, I'm trying to understand whether there are specific metrics or evaluation benchmarks in which the same schemes do well repeatedly but others perform worse.

crheckman

First review (pre-class).

crheckman · 2026-01-29T15:52:38Z

content/textbook/audits/heyangmel.mdx

+
+*What's the core technical challenge?*
+
+Traditional robot control relies on continuous action spaces that align closely with the underlying physics of robotic systems. Torques, velocities, and end-effector motions are naturally continuous. Classical control theory is built around preserving smoothness, stability, and reactivity under this assumption. But motivated by the success of token-based generative modeling, a growing body of work replaces continuous control with action tokens: latent discrete symbols that are decoded into continuous actions at execution time.


It's also that the traditional control stacks have to be discretized, since update frequencies on a computer end up being discrete anyway (how frequently you push a new action).

crheckman · 2026-01-29T15:53:57Z

content/textbook/audits/heyangmel.mdx

+
+The practice of action discretization is not new. However, in the VLA context, it's substantially more important than in the past, credited with stabilizing training, enabling reuse of language-model architectures, and simplifying long-horizon credit assignment. However, it still comes at the fundamental tradeoff of degrading the mechanical precision of a robot. As such, it's natural to ask:
+
+- What assumptions about dynamics, smoothness, and temporal structure make discrete actions viable?


I really like these three questions (and their follow-ups). They don't seem to be explicitly addressed elsewhere in the doc, or maybe they're not linked to these questions. Could you organize the doc around answering these questions, rather than jumping into methods?

crheckman · 2026-01-29T16:09:55Z

content/textbook/audits/heyangmel.mdx

+---
+# Evaluating Vector Quantization for VLA Action Tokenization
+
+When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is **action tokenization**. But there is no single, obviously correct way to carry out this tokenization.


Missing from this description is some kind of interaction-level discussion between the decoder and the tokenizer, especially as it relates to eliminating tokens from the dictionary. Some kind of discussion of BPE-like tokenizations, from a reductive/agglomerative approach at the decoder performance level, is crucial to understanding the trade-offs.

crheckman · 2026-01-29T16:19:22Z

content/textbook/audits/heyangmel.mdx

+
+1. Brohan, A., et al. (2022). *RT-1: Robotics Transformer for Real-World Control at Scale.* arXiv:2212.06817.
+2. Brohan, A., et al. (2023). *RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.* arXiv:2307.15818.
+3. Shi, W., Zhao, H., Liu, Y., et al. (2023). *MotionLM: Multi-Agent Motion Forecasting as Language Modeling.* arXiv:2309.16534.


I am thinking this write-up has a deep dive on action tokenization for grasping, and the motivations for that. Grasping has sharp discontinuities in action space, specifically at contact points. Frequency-space tokenization handles discontinuities elegantly. However there exist many domains that are rarely discontinuous (e.g., navigation). MotionLM handles this kind of domain. Long horizon planning is another. Do FAST, BEAST, etc. have any relevance to those domains?

Hmm, yes, we do focus mostly on grasping and similar tasks. We should expand on other tasks. We hardly talk about MotionLM at all.

I don't totally understand why FAST handles discontinuities well, though. Isn't part of the premise of DCT that it uses some smoothness assumptions (which to my understanding is why it's used for JPEG compression, etc)?

Jdvakil · 2026-01-29T16:37:46Z

content/textbook/audits/heyangmel.mdx

+---
+# Evaluating Vector Quantization for VLA Action Tokenization
+
+When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is **action tokenization**. But there is no single, obviously correct way to carry out this tokenization.


I would like to see more description on what these real-valued positions are on the actuators, or even what continuous control variables would be.

lorinachey · 2026-01-29T16:38:34Z

content/textbook/audits/heyangmel.mdx

+
+---
+
+# [6] References


This might be a really helpful "action token taxonomy" reference: A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

yi-shiuan-tung · 2026-01-29T16:39:19Z

content/textbook/audits/heyangmel.mdx

+
+However, it was long suspected that poor action tokenization prevented dexterous performance in RT-2. Indeed, in 2025, Physical Intelligence released a performance comparison of a "naive" (binning) tokenizer with a new, bespoke alternative ([FAST](https://arxiv.org/abs/2501.09747)), suggesting serious deterioration of performance specifically with increased sampling rate:
+
+![Performance comparison of naive tokenizer vs FAST](https://hackmd.io/_uploads/B1OfcT0HZe.png)


What is DCT in the image? It would be helpful to get more explanation for how FAST works.

aritrach · 2026-01-29T16:41:51Z

content/textbook/audits/heyangmel.mdx

+\psi:\hat{a}_{t:t+n} = \psi(z_q(x)).
+$$
+
+VQ-BeT uses a small number of residual quantization layers (typically two), interpreting the first as capturing coarse action modes (*primary codes*) and subsequent layers as encoding finer-grained residual structure (*secondary codes*).


Silly question, but is there any slowdown observed during inference time? Or is the overhead mostly during the training phase?

lorinachey · 2026-01-29T16:44:55Z

content/textbook/audits/heyangmel.mdx

+
+High-frequency control exposes where tokenization delays or abstracts feedback. Discrete action tokens necessarily operate at a coarser temporal granularity than physical dynamics, forcing systems to assume that short-horizon correction can be deferred or handled outside the tokenized decision loop.
+
+Under these conditions, different tokenization strategies fail differently. Primitive discretization degrades precision as update rates increase; latent action tokenization relies on decoders or offset pathways to absorb rapid corrections; continuous-action approaches retain immediate feedback at the cost of heavier computation and tighter coupling. Performance at high frequency therefore reflects not model capacity, but whether the tokenization boundary aligns with the timescale at which control errors must be corrected.


This color coding scheme and the breakdown is very nice. I don't have a concept of what "fast" would mean though, or what it means to be 2.5x speedup over another method. I think it would help to have the control frequencies here so that the reader knows what frequencies are being compared.

antony-zhao · 2026-01-29T16:52:45Z

content/textbook/audits/heyangmel.mdx

+
+#### Stage 3: Offset head and continuous correction
+
+To compensate for the loss of precision introduced by discretization, VQ-BeT adds a continuous **offset head** that predicts a residual correction to the decoded action:


Out of curiosity, do the authors specify how much this helps or how much offset is typically applied? Adding a residual component outside the VLA to improve precision makes sense, but I was wondering whether it was added because it empirically helps or to address a problem in a clean way.

lorinachey · 2026-01-29T16:53:38Z

content/textbook/audits/heyangmel.mdx

+
+From a scaling perspective, an action tokenizer should function as a reusable abstraction rather than a task-specific artifact. Scaling stresses the tokenization boundary first: as data diversity, model capacity, and deployment scope increase, weaknesses in how actions are abstracted tend to amplify rather than average out. The question is whether scaling reduces control error, or merely relocates it elsewhere in the system.
+
+- 🔵 **Residual VQ:** The performance of VQ-BeT seems to diminish only marginally with the size of the VQ codebook, according to the VQ-BeT paper, which is an exciting data-scaling result when it is the complexity of the robot data (not necessarily the amount) that is scaled. Training the tokenizer does not appear to be a terrible inconvenience. (It could potentially be done with simulated data, at least for gross quantization in the primary VQ-residual layer.)


Which parts of the performance degrade with the size changes of the VQ codebook? For example, does it fail at one particular task at a much higher rate? Or does it fail more frequently across all tasks? Is it possible that the latent code book captures certain things very well which make some tasks successful while it misses capturing important data on other tasks?

Jdvakil · 2026-01-29T16:56:22Z

content/textbook/audits/heyangmel.mdx

+
+When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is **action tokenization**. But there is no single, obviously correct way to carry out this tokenization.
+
+In this paper audit, we analyze action tokenization as a systems-level design choice. Our primary focus is the _vector quantized behavior transformer_ ([VQ-BeT](https://arxiv.org/pdf/2403.03181)), which uses a method called _vector quantization_ to tokenize actions. On the way, we'll discuss some popular alternatives - some newer, some older - focusing on their modeling assumptions and potential failure modes. Then, after a deep dive into VQ-BeT, we'll compare all of these options and provide some discussion about ongoing and future action tokenization work.


What is vector quantization? and why does this paper use it to tokenize actions?

zlaouar · 2026-01-29T16:57:23Z

content/textbook/audits/heyangmel.mdx

+
+$$
+z_q^{(i)} = \arg\min_{e_j^{(i)}} \left\| r^{(i)} - e_j^{(i)} \right\|_2,
+$$


I don't have a reference for some of the variables used, like $N_q$. I recommend defining the variables in place or adopting a textual description instead.

lorinachey · 2026-01-29T16:59:09Z

content/textbook/audits/heyangmel.mdx

+
+- 🔵 **Residual VQ:** The performance of VQ-BeT seems to diminish only marginally with the size of the VQ codebook, according to the VQ-BeT paper, which is an exciting data-scaling result when it is the complexity of the robot data (not necessarily the amount) that is scaled. Training the tokenizer does not appear to be a terrible inconvenience. (It could potentially be done with simulated data, at least for gross quantization in the primary VQ-residual layer.)
+- 🟢 **DCT (FAST):** FAST comes with a fantastic promise: that it (FAST+ specifically) can be applied as an action tokenizer for any VLA, no modifications necessary. Whether it delivers on this promise seems like an open question to us. FAST+ required a tremendous amount of robot data to train (1M trajectories across dozens of embodiments), and it's difficult to imagine acquiring even more should that prove to be insufficient for any particular morphology.
+- 🟡 **Diffusion ($\pi_0$):** It's not possible to separately train an action tokenizer in this setup, leaving one completely at the mercy of training a flow-matching model.


I'm not sure about this...I'm fairly certain action tokens are generated and fed to the action expert in Pi0 which means you could hypothetically change the tokenization before feeding it to the action expert/flow matching model. I could be misunderstanding it though.

I'm also not sure about the framing of being "at the mercy of training a flow-matching model". That seems like strongly negative language without the justification for what you dislike about flow-matching models.

Jdvakil · 2026-01-29T17:00:17Z

content/textbook/audits/heyangmel.mdx

+
+*What's the core technical challenge?*
+
+Traditional robot control relies on continuous action spaces that align closely with the underlying physics of robotic systems. Torques, velocities, and end-effector motions are naturally continuous. Classical control theory is built around preserving smoothness, stability, and reactivity under this assumption. But motivated by the success of token-based generative modeling, a growing body of work replaces continuous control with action tokens: latent discrete symbols that are decoded into continuous actions at execution time.


Why is the field moving towards action tokenization if the control based approach is working well? How would you compare these two approached?

crheckman

Comments made during reading period.

crheckman · 2026-01-29T16:37:13Z

content/textbook/audits/heyangmel.mdx

+
+---
+
+# [2] The competitors


Let's call this "Methods" and focus on three elements. Reformulate into some bullets with a list of papers supporting each one.

These kinds of paragraphs read like a book report, but what we need is a technical report.

crheckman · 2026-01-29T16:37:43Z

content/textbook/audits/heyangmel.mdx

+
+Depending on the architecture, these 255 (or so) action tokens might be usable as-is, or (as a hack if available tokens are limited) might overwrite the 255 *least used* tokens in the upstream language model.
+
+There is some nuance in how binning is applied. [RT-1](https://arxiv.org/abs/2212.06817)’s action tokenizer discretized uniformly across the entire action space (based on minimum and maximum values seen in the data) into 256 equal-size bins. This choice was, apparently, good enough: in the corresponding ablation study, they note a -25% success rate delta when this action tokenizer is removed (though no comparison was given with other action tokenization schemes). A near-identical scheme was reused for [RT-2](https://arxiv.org/abs/2307.15818) in 2023.


Good enough for what?

I need to see a deep dive on the evaluations. Did they do any ablations against other tokenization schemes?

crheckman · 2026-01-29T16:41:26Z

content/textbook/audits/heyangmel.mdx

+
+There is some nuance in how binning is applied. [RT-1](https://arxiv.org/abs/2212.06817)’s action tokenizer discretized uniformly across the entire action space (based on minimum and maximum values seen in the data) into 256 equal-size bins. This choice was, apparently, good enough: in the corresponding ablation study, they note a -25% success rate delta when this action tokenizer is removed (though no comparison was given with other action tokenization schemes). A near-identical scheme was reused for [RT-2](https://arxiv.org/abs/2307.15818) in 2023.
+
+In later 2023, Waymo's [MotionLM](https://arxiv.org/pdf/2309.16534) introduced some slight modifications, using a "Verlet wrapper" around uniformly binned deltas for each coordinate. In practice, this resulted in a substantial reduction in the number of distinct tokens required, though it was not fully specified how this reduction arose. Later, during work on [OpenVLA](https://arxiv.org/pdf/2406.09246) in early 2024, it was noticed that computing bins based on the minimum and maximum actuator values was vulnerable to outliers - though bins were of equal numerical size, the majority of data fell in a subset of bins, so some precision was wasted. To amend this issue, the authors of OpenVLA opted to use a quantile-based approach instead, such that each bin covered the same amount of training data:


What exactly is a Verlet wrapper, or what was the intuition behind it?

crheckman · 2026-01-29T16:42:49Z

content/textbook/audits/heyangmel.mdx

+In later 2023, Waymo's [MotionLM](https://arxiv.org/pdf/2309.16534) introduced some slight modifications, using a "Verlet wrapper" around uniformly binned deltas for each coordinate. In practice, this resulted in a substantial reduction in the number of distinct tokens required, though it was not fully specified how this reduction arose. Later, during work on [OpenVLA](https://arxiv.org/pdf/2406.09246) in early 2024, it was noticed that computing bins based on the minimum and maximum actuator values was vulnerable to outliers - though bins were of equal numerical size, the majority of data fell in a subset of bins, so some precision was wasted. To amend this issue, the authors of OpenVLA opted to use a quantile-based approach instead, such that each bin covered the same amount of training data:
+
+$$
+\text{action\_token} = \lfloor \text{quantile}(a) \cdot 255 \rfloor


nit: this is a brilliant move, since it directly connects training data distribution with output distribution, allowing data balancing with a simple change of vocabulary!

crheckman · 2026-01-29T16:48:19Z

content/textbook/audits/heyangmel.mdx

+\text{action\_token} = \lfloor \text{quantile}(a) \cdot 255 \rfloor
+$$
+
+Why did these simple action tokenizers survive for so long, when tokenizers in other components of the VLA stack were more complex? While it's hard to say for certain, we suspect that it was discovered early that performance does not necessarily correlate well with scale in action tokenizers. Binning is good enough in many cases, and choosing a more complex action tokenizer can impact inference times (see [high-frequency performance](#high-frequency-performance)).


Connecting this section with the improvements later (frequency-space tokenization) gives a clue: it all comes down to evaluations. If you are trying to behavior clone and calculating error on an L2 c-space metric then any trajectories (inferred) that end up close enough to the reference are considered good. However if you zoom in on the discontinuities of actions, you see they aren't good. This is a similar notion of keyframing, which extends to both training data distribution as well as metrics calculations.

On OpenVLA tasks, the correlation wasn't clear. On RT-2 tasks, they broke glass a lot -> metrics may not have been obvious enough, but evaluations demonstrated a clear problem.

crheckman · 2026-01-29T16:50:12Z

content/textbook/audits/heyangmel.mdx

+
+In this audit, we use VQ-BeT as a particularly interesting case study - not for its optimality (though it performs well in several respects), but because it crystallizes the core design assumptions behind latent action quantization in VLA systems.
+
+### How it works


It might really help to move the figure to up here, and explain through the figure. :)

crheckman · 2026-01-29T16:52:43Z

content/textbook/audits/heyangmel.mdx

+
+VQ-BeT decomposes action generation into two explicitly separated stages: (i) offline action tokenization via residual vector quantization, and (ii) online autoregressive prediction of discrete latent codes conditioned on observations (and optionally goals). This separation reflects a deliberate reorganization of the control stack, in which representation learning, sequence modeling, and continuous execution are assigned distinct roles.
+
+#### Stage 1: Chunk tokenization via residual VQ-VAE


This is a good description of the how, but it's missing the why. This could be in the form of an explanation of the gap, an ablation, or something similar.

crheckman · 2026-01-29T17:02:39Z

content/textbook/audits/heyangmel.mdx

+
+where $g$ is an optional goal signal.
+
+Rather than predicting a single token per timestep, the model predicts one categorical distribution per quantization layer. Training loss weights errors in primary code prediction more heavily than secondary codes, reflecting the intended coarse-to-fine structure of the latent space.


This is an interesting point and I realize not something that is clearly stated in the Introduction (but should be). Consider if we had action tokenization $a_t \in \mathbb{Z|z \in [0,9]}^k_{[z_0 . z_1, ... z_k]}$. That is, we induce a lack of significance through the order of decimals. (example: 0.923146 has significance at 9 and insignificance by the time you get to 6, but they're all weighted the same against a reference). This should be clear about why tokenization of actions is so important in an autoregressive generation / masked token context for training.

jt7347 · 2026-01-29T17:05:50Z

content/textbook/audits/heyangmel.mdx

"Literal actuator awareness: Most of the time, natural language is substantially vaguer than motion, so action tokens are sufficient to represent most desired motions. But if a (byzantine) roboticist were to prompt, in natural language, "set gripper to state 50," there is no guarantee that the instruction would be exactly executed as requested. (In fact, if the fine-tuning dataset used refers primarily to gross actions, as many do, there is no guarantee that the system would associate a "gripper" language token with a gripper action token at all.)"

I think the failure modes section is interesting, especially ^, are there any statistical examples or cases where people have tried to find the frequency of these occurrences?

Add audit draft1: action tokenization

f8e5870

Merge branch 'staging' into audit/heyangmel-actiontokenization-draft1

bc298ef

krusnim reviewed Jan 28, 2026

View reviewed changes

lorinachey reviewed Jan 29, 2026

View reviewed changes

crheckman requested changes Jan 29, 2026

View reviewed changes

Jdvakil reviewed Jan 29, 2026

View reviewed changes

lorinachey reviewed Jan 29, 2026

View reviewed changes

yi-shiuan-tung reviewed Jan 29, 2026

View reviewed changes

aritrach reviewed Jan 29, 2026

View reviewed changes

lorinachey reviewed Jan 29, 2026

View reviewed changes

antony-zhao reviewed Jan 29, 2026

View reviewed changes

lorinachey reviewed Jan 29, 2026

View reviewed changes

Jdvakil reviewed Jan 29, 2026

View reviewed changes

zlaouar reviewed Jan 29, 2026

View reviewed changes

lorinachey reviewed Jan 29, 2026

View reviewed changes

Jdvakil reviewed Jan 29, 2026

View reviewed changes

crheckman requested changes Jan 29, 2026

View reviewed changes

jt7347 reviewed Jan 29, 2026

View reviewed changes


		Can we avoid having to use discrete actions entirely?

		This remains an open question. In late 2024, there was some significant work ([arXiv:2409.12514](https://arxiv.org/abs/2409.12514)) involving the usage of diffusion models to avoid tokenizing actions at all. The approach, taken as part of an optimized fine-tuning regime, also yielded a success rate increase in [OpenVLA-OFT](https://arxiv.org/pdf/2502.19645). Initially, [the $\pi_0$ model used a similar approach](https://arxiv.org/html/2410.24164v1)... but is outperformed in some respects by $\pi_0$-FAST, which uses the FAST action tokenizer.


		### How it works

		VQ-BeT decomposes action generation into two explicitly separated stages: (i) offline action tokenization via residual vector quantization, and (ii) online autoregressive prediction of discrete latent codes conditioned on observations (and optionally goals). This separation reflects a deliberate reorganization of the control stack, in which representation learning, sequence modeling, and continuous execution are assigned distinct roles.


		What's the core technical challenge?

		Traditional robot control relies on continuous action spaces that align closely with the underlying physics of robotic systems. Torques, velocities, and end-effector motions are naturally continuous. Classical control theory is built around preserving smoothness, stability, and reactivity under this assumption. But motivated by the success of token-based generative modeling, a growing body of work replaces continuous control with action tokens: latent discrete symbols that are decoded into continuous actions at execution time.


		The practice of action discretization is not new. However, in the VLA context, it's substantially more important than in the past, credited with stabilizing training, enabling reuse of language-model architectures, and simplifying long-horizon credit assignment. However, it still comes at the fundamental tradeoff of degrading the mechanical precision of a robot. As such, it's natural to ask:

		- What assumptions about dynamics, smoothness, and temporal structure make discrete actions viable?


		However, it was long suspected that poor action tokenization prevented dexterous performance in RT-2. Indeed, in 2025, Physical Intelligence released a performance comparison of a "naive" (binning) tokenizer with a new, bespoke alternative ([FAST](https://arxiv.org/abs/2501.09747)), suggesting serious deterioration of performance specifically with increased sampling rate:

		![Performance comparison of naive tokenizer vs FAST](https://hackmd.io/_uploads/B1OfcT0HZe.png)


		High-frequency control exposes where tokenization delays or abstracts feedback. Discrete action tokens necessarily operate at a coarser temporal granularity than physical dynamics, forcing systems to assume that short-horizon correction can be deferred or handled outside the tokenized decision loop.

		Under these conditions, different tokenization strategies fail differently. Primitive discretization degrades precision as update rates increase; latent action tokenization relies on decoders or offset pathways to absorb rapid corrections; continuous-action approaches retain immediate feedback at the cost of heavier computation and tighter coupling. Performance at high frequency therefore reflects not model capacity, but whether the tokenization boundary aligns with the timescale at which control errors must be corrected.


		#### Stage 3: Offset head and continuous correction

		To compensate for the loss of precision introduced by discretization, VQ-BeT adds a continuous offset head that predicts a residual correction to the decoded action:


		From a scaling perspective, an action tokenizer should function as a reusable abstraction rather than a task-specific artifact. Scaling stresses the tokenization boundary first: as data diversity, model capacity, and deployment scope increase, weaknesses in how actions are abstracted tend to amplify rather than average out. The question is whether scaling reduces control error, or merely relocates it elsewhere in the system.

		- 🔵 Residual VQ: The performance of VQ-BeT seems to diminish only marginally with the size of the VQ codebook, according to the VQ-BeT paper, which is an exciting data-scaling result when it is the complexity of the robot data (not necessarily the amount) that is scaled. Training the tokenizer does not appear to be a terrible inconvenience. (It could potentially be done with simulated data, at least for gross quantization in the primary VQ-residual layer.)


		When using vision-language-action (VLA) models for robotics, we typically replace continuous control variables, like the real-valued positions of actuators, with discrete action tokens. This is action tokenization. But there is no single, obviously correct way to carry out this tokenization.

		In this paper audit, we analyze action tokenization as a systems-level design choice. Our primary focus is the _vector quantized behavior transformer_ ([VQ-BeT](https://arxiv.org/pdf/2403.03181)), which uses a method called _vector quantization_ to tokenize actions. On the way, we'll discuss some popular alternatives - some newer, some older - focusing on their modeling assumptions and potential failure modes. Then, after a deep dive into VQ-BeT, we'll compare all of these options and provide some discussion about ongoing and future action tokenization work.


		Depending on the architecture, these 255 (or so) action tokens might be usable as-is, or (as a hack if available tokens are limited) might overwrite the 255 least used tokens in the upstream language model.

		There is some nuance in how binning is applied. [RT-1](https://arxiv.org/abs/2212.06817)’s action tokenizer discretized uniformly across the entire action space (based on minimum and maximum values seen in the data) into 256 equal-size bins. This choice was, apparently, good enough: in the corresponding ablation study, they note a -25% success rate delta when this action tokenizer is removed (though no comparison was given with other action tokenization schemes). A near-identical scheme was reused for [RT-2](https://arxiv.org/abs/2307.15818) in 2023.


		There is some nuance in how binning is applied. [RT-1](https://arxiv.org/abs/2212.06817)’s action tokenizer discretized uniformly across the entire action space (based on minimum and maximum values seen in the data) into 256 equal-size bins. This choice was, apparently, good enough: in the corresponding ablation study, they note a -25% success rate delta when this action tokenizer is removed (though no comparison was given with other action tokenization schemes). A near-identical scheme was reused for [RT-2](https://arxiv.org/abs/2307.15818) in 2023.

		In later 2023, Waymo's [MotionLM](https://arxiv.org/pdf/2309.16534) introduced some slight modifications, using a "Verlet wrapper" around uniformly binned deltas for each coordinate. In practice, this resulted in a substantial reduction in the number of distinct tokens required, though it was not fully specified how this reduction arose. Later, during work on [OpenVLA](https://arxiv.org/pdf/2406.09246) in early 2024, it was noticed that computing bins based on the minimum and maximum actuator values was vulnerable to outliers - though bins were of equal numerical size, the majority of data fell in a subset of bins, so some precision was wasted. To amend this issue, the authors of OpenVLA opted to use a quantile-based approach instead, such that each bin covered the same amount of training data:


		In this audit, we use VQ-BeT as a particularly interesting case study - not for its optimality (though it performs well in several respects), but because it crystallizes the core design assumptions behind latent action quantization in VLA systems.

		### How it works


		VQ-BeT decomposes action generation into two explicitly separated stages: (i) offline action tokenization via residual vector quantization, and (ii) online autoregressive prediction of discrete latent codes conditioned on observations (and optionally goals). This separation reflects a deliberate reorganization of the control stack, in which representation learning, sequence modeling, and continuous execution are assigned distinct roles.

		#### Stage 1: Chunk tokenization via residual VQ-VAE


		where $g$ is an optional goal signal.

		Rather than predicting a single token per timestep, the model predicts one categorical distribution per quantization layer. Training loss weights errors in primary code prediction more heavily than secondary codes, reflecting the intended coarse-to-fine structure of the latent space.

Conversation

Hhy903 commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 23, 2026 • edited by Hhy903 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Checklist

Next Steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hhy903 commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yi-shiuan-tung Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 23, 2026 •

edited by Hhy903

Loading

yi-shiuan-tung Jan 29, 2026 •

edited

Loading

crheckman Jan 29, 2026 •

edited

Loading