Update smolvla.md (#2895)

CharlesCNorton · web-flow · commit 0f8e1f1ae82c · 2025-06-07T10:21:30.000+02:00
* Changed “Skipping **the** half of the layers” → “Skipping half of the layers”
* Changed “70 %” → “70%”
* Changed “a greater adaptability” → “greater adaptability”
* Renamed duplicate caption “Figure 4.” (second occurrence) → “Figure 5.”
* Changed “found out that this interleaved design” → “found that this interleaved design”
diff --git a/smolvla.md b/smolvla.md
@@ -78,7 +78,7 @@ SmolVLA addresses this gap by offering an open-source, compact, and efficient VL
 
 Inspired by the training paradigms of Large Language Models (LLMs), SmolVLA goes through a pretraining phase on general manipulation data, followed by task-specific post-training. Architecturally, it combines Transformers with **flow-matching decoders**, and is optimized for speed and low-latency inference with the following design choices:
 
-* Skipping the half of the layers of the vision model for faster inference and smaller size
+* Skipping half of the layers of the vision model for faster inference and smaller size
 * Interleaving self-attention and cross-attention blocks
 * Using fewer visual tokens
 * Leveraging smaller pretrained VLMs 
@@ -176,7 +176,7 @@ Inside the action expert, attention layers alternate between:
 - **Cross-attention (CA)**, where action tokens attend to the VLM’s features
 - **Self-attention (SA)**, where action tokens attend to each other (causally—only to the past)
 
-We found out that this **interleaved design** is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.
+We found that this **interleaved design** is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.
 
 In SmolVLA, CA ensures that actions are well-conditioned on perception and instructions, while SA improves **temporal smoothness**—especially critical for real-world control, where jittery predictions can result in unsafe or unstable behavior.
 
@@ -193,11 +193,11 @@ Modern visuomotor policies output **action chunks**—sequences of actions to ex
 
 Our async stack decouples action execution from chunk prediction, resulting in higher adaptability, and the complete lack of execution lags at runtime. It relies on the following key mechanisms:
 
-- **1. Early trigger:** When the queue length falls below a threshold (e.g., 70 %), we send an observation to a **Policy Server**, calling for a new action chunk.
+- **1. Early trigger:** When the queue length falls below a threshold (e.g., 70%), we send an observation to a **Policy Server**, calling for a new action chunk.
 - **2. Decoupled threads:** Control loop keeps executing → inference happens in parallel (non-blocking).
 - **3. Chunk fusion:** Overlapping actions from successive chunks are stitched with a simple merge rule to avoid jitter.
 
-We are really excited about releasing asynchronous inference because it guarantees a greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.
+We are really excited about releasing asynchronous inference because it guarantees greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.
 
 ## Community Datasets
 
@@ -278,7 +278,7 @@ Finally, we evaluate SmolVLA under synchronous and asynchronous inference modes.
 This results in more responsive and robust real-world performance, especially in dynamic environments with shifting objects or external disturbances.
 <div align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/640e21ef3c82bd463ee5a76d/Goxb9y5cE_Ty1SWCetCoT.png" alt="Asynchronous vs. Synchronous Inference in Real-World Tasks." width="500"/>
-  <p>Figure 4. Asynchronous vs. Synchronous Inference in Real-World Tasks.
+  <p>Figure 5. Asynchronous vs. Synchronous Inference in Real-World Tasks.
 (a) Task success rates (%), (b) average completion time(s), and (c) number of tasks completed within a fixed time window.
 </p>
 </div>