Skip to content

Commit 0f8e1f1

Browse files
Update smolvla.md (#2895)
* Changed “Skipping **the** half of the layers” → “Skipping half of the layers” * Changed “70 %” → “70%” * Changed “a greater adaptability” → “greater adaptability” * Renamed duplicate caption “Figure 4.” (second occurrence) → “Figure 5.” * Changed “found out that this interleaved design” → “found that this interleaved design”
1 parent 9aa7f93 commit 0f8e1f1

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

smolvla.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ SmolVLA addresses this gap by offering an open-source, compact, and efficient VL
7878

7979
Inspired by the training paradigms of Large Language Models (LLMs), SmolVLA goes through a pretraining phase on general manipulation data, followed by task-specific post-training. Architecturally, it combines Transformers with **flow-matching decoders**, and is optimized for speed and low-latency inference with the following design choices:
8080

81-
* Skipping the half of the layers of the vision model for faster inference and smaller size
81+
* Skipping half of the layers of the vision model for faster inference and smaller size
8282
* Interleaving self-attention and cross-attention blocks
8383
* Using fewer visual tokens
8484
* Leveraging smaller pretrained VLMs
@@ -176,7 +176,7 @@ Inside the action expert, attention layers alternate between:
176176
- **Cross-attention (CA)**, where action tokens attend to the VLM’s features
177177
- **Self-attention (SA)**, where action tokens attend to each other (causally—only to the past)
178178

179-
We found out that this **interleaved design** is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.
179+
We found that this **interleaved design** is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.
180180

181181
In SmolVLA, CA ensures that actions are well-conditioned on perception and instructions, while SA improves **temporal smoothness**—especially critical for real-world control, where jittery predictions can result in unsafe or unstable behavior.
182182

@@ -193,11 +193,11 @@ Modern visuomotor policies output **action chunks**—sequences of actions to ex
193193

194194
Our async stack decouples action execution from chunk prediction, resulting in higher adaptability, and the complete lack of execution lags at runtime. It relies on the following key mechanisms:
195195

196-
- **1. Early trigger:** When the queue length falls below a threshold (e.g., 70 %), we send an observation to a **Policy Server**, calling for a new action chunk.
196+
- **1. Early trigger:** When the queue length falls below a threshold (e.g., 70%), we send an observation to a **Policy Server**, calling for a new action chunk.
197197
- **2. Decoupled threads:** Control loop keeps executing → inference happens in parallel (non-blocking).
198198
- **3. Chunk fusion:** Overlapping actions from successive chunks are stitched with a simple merge rule to avoid jitter.
199199

200-
We are really excited about releasing asynchronous inference because it guarantees a greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.
200+
We are really excited about releasing asynchronous inference because it guarantees greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.
201201

202202
## Community Datasets
203203

@@ -278,7 +278,7 @@ Finally, we evaluate SmolVLA under synchronous and asynchronous inference modes.
278278
This results in more responsive and robust real-world performance, especially in dynamic environments with shifting objects or external disturbances.
279279
<div align="center">
280280
<img src="https://cdn-uploads.huggingface.co/production/uploads/640e21ef3c82bd463ee5a76d/Goxb9y5cE_Ty1SWCetCoT.png" alt="Asynchronous vs. Synchronous Inference in Real-World Tasks." width="500"/>
281-
<p>Figure 4. Asynchronous vs. Synchronous Inference in Real-World Tasks.
281+
<p>Figure 5. Asynchronous vs. Synchronous Inference in Real-World Tasks.
282282
(a) Task success rates (%), (b) average completion time(s), and (c) number of tasks completed within a fixed time window.
283283
</p>
284284
</div>

0 commit comments

Comments
 (0)