You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: smolvla.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,7 +78,7 @@ SmolVLA addresses this gap by offering an open-source, compact, and efficient VL
78
78
79
79
Inspired by the training paradigms of Large Language Models (LLMs), SmolVLA goes through a pretraining phase on general manipulation data, followed by task-specific post-training. Architecturally, it combines Transformers with **flow-matching decoders**, and is optimized for speed and low-latency inference with the following design choices:
80
80
81
-
* Skipping the half of the layers of the vision model for faster inference and smaller size
81
+
* Skipping half of the layers of the vision model for faster inference and smaller size
82
82
* Interleaving self-attention and cross-attention blocks
-**Cross-attention (CA)**, where action tokens attend to the VLM’s features
177
177
-**Self-attention (SA)**, where action tokens attend to each other (causally—only to the past)
178
178
179
-
We found out that this **interleaved design** is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.
179
+
We found that this **interleaved design** is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.
180
180
181
181
In SmolVLA, CA ensures that actions are well-conditioned on perception and instructions, while SA improves **temporal smoothness**—especially critical for real-world control, where jittery predictions can result in unsafe or unstable behavior.
182
182
@@ -193,11 +193,11 @@ Modern visuomotor policies output **action chunks**—sequences of actions to ex
193
193
194
194
Our async stack decouples action execution from chunk prediction, resulting in higher adaptability, and the complete lack of execution lags at runtime. It relies on the following key mechanisms:
195
195
196
-
-**1. Early trigger:** When the queue length falls below a threshold (e.g., 70%), we send an observation to a **Policy Server**, calling for a new action chunk.
196
+
-**1. Early trigger:** When the queue length falls below a threshold (e.g., 70%), we send an observation to a **Policy Server**, calling for a new action chunk.
197
197
-**2. Decoupled threads:** Control loop keeps executing → inference happens in parallel (non-blocking).
198
198
-**3. Chunk fusion:** Overlapping actions from successive chunks are stitched with a simple merge rule to avoid jitter.
199
199
200
-
We are really excited about releasing asynchronous inference because it guarantees a greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.
200
+
We are really excited about releasing asynchronous inference because it guarantees greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.
201
201
202
202
## Community Datasets
203
203
@@ -278,7 +278,7 @@ Finally, we evaluate SmolVLA under synchronous and asynchronous inference modes.
278
278
This results in more responsive and robust real-world performance, especially in dynamic environments with shifting objects or external disturbances.
279
279
<divalign="center">
280
280
<imgsrc="https://cdn-uploads.huggingface.co/production/uploads/640e21ef3c82bd463ee5a76d/Goxb9y5cE_Ty1SWCetCoT.png"alt="Asynchronous vs. Synchronous Inference in Real-World Tasks."width="500"/>
281
-
<p>Figure 4. Asynchronous vs. Synchronous Inference in Real-World Tasks.
281
+
<p>Figure 5. Asynchronous vs. Synchronous Inference in Real-World Tasks.
282
282
(a) Task success rates (%), (b) average completion time(s), and (c) number of tasks completed within a fixed time window.
0 commit comments