Conversation
|
@crheckman I didn't squash commits since this is a work in progress and we won't be merging into staging yet. LMK if you're holding strict to that requirement and I can fix |
be4e825 to
c817928
Compare
|
🚀 Audit Rendered Your paper audit preview is ready for review! 🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/23/textbook/audits/staging/ Review Checklist
Next Steps
This preview will be removed when the PR is closed. |
|
|
||
| Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation. | ||
|
|
||
| Since LLaVA, the landscape of Vision-Language Models (VLMs) has transitioned from modular bridging architectures to native multimodal/omni archiecture systems. Early innovations like **LLaVA** introduced the concept of visual instruction tuning, treating image features as "foreign language" tokens. **Prismatic** later refined this by auditing the design space and optimizing its Prisms (doublecheck what their models are called) by fusing semantic and geometric encoders to minimize information decay. And lastly, the state-of-the-art (kali note: this might be overstated) is defined by **Qwen V3**, which replaces the bridge with a unified latent space for text, images, and video, enabling long-context agency and a self-correcting thinking mode. |
There was a problem hiding this comment.
the state of the art
it is indeed the state of the art, but we all know how fast this field moves. This document should be "timeless" in that statements of the state of the art should be qualified (e.g., "it is the state of the art as of early 2026").
| where $W_1 \in \mathbb{R}^{1024 \times 4096}$ and $W_2 \in \mathbb{R}^{4096 \times 4096}$. | ||
|
|
||
| LLaVa's patch calcuation (how it process images through it's vision encoder): | ||
| For ViT-L/14 with patch size $p = 14$ and input resolution 336×336: |
There was a problem hiding this comment.
Is there any reasoning/justification provided for the chosen input resolution and patch size?
|
|
||
| ## Introduction: The Multimodal Trajectory (2023–2025) | ||
|
|
||
| Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it. |
There was a problem hiding this comment.
In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.
this is very fluffy. Your first sentence is clear: "gap between vision and text." What is the gap in robotics explicitly?
my recommendation: move statements about robotics to later areas, and focus on multimodality as an enhancement beyond vision e.g. lidar encoders or scene encoders.
|
|
||
| Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it. | ||
|
|
||
| Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation. |
There was a problem hiding this comment.
by treating visual tokens as "foreign language" prefix to a conversation
While I appreciate this comparison/metaphor, if it's going to be in this document it shouldn't be in the abstract of a technical write-up. These first paragraphs should be direct and technical: "tokens going in are directly compared with language tokens to utilize cross attention more effectively."
|
|
||
| | Feature | LLaVA 1.0 | LLaVA 1.5 | | ||
| | :--- | :--- | :--- | | ||
| | **Connector** | **Linear Projection**: A single trainable matrix aligning feature spaces. | **2-layer MLP (GELU)**: A non-linear bridge that better interprets visual features. | |
There was a problem hiding this comment.
Did they give any reason why they didn't use another nonlienar functions than GeLU?
| ## Part I. LLaVA: Visual Instruction Tuning | ||
|
|
||
| ### 1. Novelty & Contribution | ||
| The primary novelty of LLaVA was the introduction of Visual Instruction Tuning, the process of using a language-only GPT-4 to generate multimodal instruction-following data (158K samples) from text-only image captions. |
There was a problem hiding this comment.
This is a HUGE part of llava and needs to be explicitly discussed in this write-up. There is a great figure in the original llava paper that goes to this. llava-1.5 paper focuses more on architecture and image slicing choices, which are kind of minor in comparison and have the correct amount of content here.
There was a problem hiding this comment.
Just a quick reminder to add figures like this one to the writeup.
| → No benefit from multi-stage. Single-stage saves 20-30% compute. | ||
|
|
||
| **RQ2: Freeze or fine-tune vision encoder?** | ||
| → **Freeze** the vision encoder. Unfreezing degrades performance (contradicts some prior work; Prismatic attributes this to lacking LoRA). |
There was a problem hiding this comment.
Not sure I understand what Prismatic is attributing to "lacking LoRA". Are they suggesting that LoRA is needed to keep from degrading performance when unfrozen? Or is there a current LoRA implementation that is somehow lacking?
| | Feature | LLaVA 1.0 | LLaVA 1.5 | | ||
| | :--- | :--- | :--- | | ||
| | **Connector** | **Linear Projection**: A single trainable matrix aligning feature spaces. | **2-layer MLP (GELU)**: A non-linear bridge that better interprets visual features. | | ||
| | **Input Res.** | 224px (standard CLIP input). | 336px (upscaled for fine-grained OCR). | |
There was a problem hiding this comment.
May want to spell out OCR explicitly here, then use the acronym in the remaining.
| $$Z_v = Z_v^{\text{full}}[1:] \in \mathbb{R}^{576 \times 1024}$$ | ||
| The [CLS] Token/Departure from CLIP: | ||
|
|
||
| CLIP adds a special [CLS] token at the beginning (common in transformer models). This token is meant to represent the entire image globally. So CLIP actually outputs 577 tokens (1 [CLS] + 576 patches). By default, LLaVA drops the [CLS] token and only uses the 576 spatial patches. The [CLS] token is a global summary but loses spatial details; LLaVA preserves more fine-grained spatial information. |
There was a problem hiding this comment.
I appreciate the detail of these calculations here, but the punchline is not obvious. The [CLS] token helps and is the standard (although there exist plenty of alternatives/research against this line of work), and this section should explain why. The exact number of tokens and arithmetic should probably be moved to an appendix.
| | **Input Res.** | 224px (standard CLIP input). | 336px (upscaled for fine-grained OCR). | | ||
| | **Tokens ($N_p$)** | 256 patches. | 576 patches ($24 \times 24$ grid). | | ||
|
|
||
| [add image showing the CLIP encoder, MLP connector, and LLM backbone] |
| $$Z_v = Z_v^{\text{full}}[1:] \in \mathbb{R}^{576 \times 1024}$$ | ||
| The [CLS] Token/Departure from CLIP: | ||
|
|
||
| CLIP adds a special [CLS] token at the beginning (common in transformer models). This token is meant to represent the entire image globally. So CLIP actually outputs 577 tokens (1 [CLS] + 576 patches). By default, LLaVA drops the [CLS] token and only uses the 576 spatial patches. The [CLS] token is a global summary but loses spatial details; LLaVA preserves more fine-grained spatial information. |
There was a problem hiding this comment.
What is the advantage to dropping the CLS token? Is this intentional?
| [add image showing the CLIP encoder, MLP connector, and LLM backbone] | ||
|
|
||
|
|
||
| The mapping $W: \mathbb{R}^{1024} \rightarrow \mathbb{R}^{4096}$ appears to increase dimensionality, but this does **not** increase information content. |
There was a problem hiding this comment.
It's unclear how the following theorem supports this statement. Could you write this in English first (where does 576 come into play wrt 4096 and 1024, for the architecture?) and then the theorem/proof? (nice touch btw)
|
|
||
| ## Introduction: The Multimodal Trajectory (2023–2025) | ||
|
|
||
| Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it. |
There was a problem hiding this comment.
What are some examples of multi-modal data? In robotics it could mean different sources of data (sensors, exo/ego camera, audio, ++) but in the broader vision multi-modal data is used across AI/ML. The first statement could be expanded further to explain the bridge and what the data types are.
eb459b8 to
57cbed1
Compare
Following new audit workflow. File now in staging/ for review.
Changes: - Split long sentences into individual lines for easier PR review - Reformatted table to avoid overly long cells - Added connector evolution section with proper line breaks All linter checks now pass.
The MDX file was missing YAML frontmatter which caused metadata (title, author, paper, topic) to not be parsed correctly.
55db0d2 to
e728f59
Compare
The <3.0 seconds> text was being parsed as an HTML tag by MDX. Changed to <3.0 seconds> to properly escape the angle brackets.
MDX has strict parsing rules and doesn't handle HTML comments well in all contexts. Removed commented-out sections to fix build errors.
Removed final HTML comment block and 'GPT idea' placeholder text that was causing MDX parsing errors.
|
|
||
| **Why it’s hard (first principles):** | ||
|
|
||
| * **Long context** amplifies compute cost and attention memory. |
There was a problem hiding this comment.
I think it would nice to mention just how much it amplifies the cost and memory with specific values.
| ### 2.2 Vision encoder choice and dynamic resolution | ||
|
|
||
| Qwen3-VL uses the **SigLIP-2** architecture as the vision encoder and continues training with **dynamic input resolutions**, using **2D-RoPE** and interpolated absolute embeddings (following CoMP). | ||
| They mention using specific SigLIP-2 variants (SO-400M default; Large 300M for small LLMs). |
There was a problem hiding this comment.
Do they mention why they use different variants and what the outcomes are from those different variants?
| ## Part I. LLaVA: Visual Instruction Tuning | ||
|
|
||
| ### 1. Novelty & Contribution | ||
| The primary novelty of LLaVA was the introduction of Visual Instruction Tuning, the process of using a language-only GPT-4 to generate multimodal instruction-following data (158K samples) from text-only image captions. |
There was a problem hiding this comment.
Just a quick reminder to add figures like this one to the writeup.
| 2. **Projection Module** $W$: Linear layer (LLaVA 1.0) or 2-layer MLP (LLaVA-1.5+) | ||
| 3. **Language Model** $f_\phi(\cdot)$: Vicuna-v1.5 (fine-tuned LLaMA-2), parameterized by $\phi$ | ||
|
|
||
| And it's state space is defined as: |
|
|
||
| The projection maps visual features to the LLM's embedding space: | ||
|
|
||
| **LLaVA 1.0:** $H_v = W \cdot Z_v$ where $W \in \mathbb{R}^{1024 \times 4096}$ |
There was a problem hiding this comment.
no reason to include old version here, just include sota 1.5 or whatever is since then
| LLaVa's patch calcuation (how it process images through it's vision encoder): | ||
| For ViT-L/14 with patch size $p = 14$ and input resolution 336×336: | ||
|
|
||
| $$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$ |
There was a problem hiding this comment.
perfect place to add an animation of this. recommend creating your own with sora/banana
|
|
||
| $$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$ | ||
|
|
||
| **Critical Detail:** CLIP ViT-L/14 prepends a learnable [CLS] token, producing: |
There was a problem hiding this comment.
when you say "the [CLS] token is a global summary but loses spatial details," what does this mean? You also say it is a "learnable token." What is a learnable token? The token is a dictionary. Please devote some space to this.
| | TimeMarker | video dialogue & localization | explicit time reasoning | Video-LLM fusion | varies | No | | ||
| | RT-2 | robotics policy via VLA | action tokens | end-to-end VLA | VLM + robotics data | Yes | | ||
| | Octo | policy model | diffusion policy | task-conditioned | robotics obs | Yes | | ||
| | OpenVLA | open VLA | action tokens | end-to-end | robotics obs | Yes | |
There was a problem hiding this comment.
What does this "policy model," "open VLA" mean? Why are the comments of some of these in "long context strategy"?
| \text{input} = [\tau_1, v_1, \tau_2, v_2, \dots] | ||
| $$ | ||
|
|
||
| > **Audit critique:** This likely improves time-localization tasks (grounding, dense captioning), but it makes time **linguistic** rather than a continuous latent tied to dynamics. Great for QA; potentially weak for control. |
There was a problem hiding this comment.
Sure, and this is a good point. Are there innovations beyond this that have been explored to address?
|
|
||
| ## 5. Training Details | ||
|
|
||
| Qwen3-VL pretraining is structured into **four stages** with growing context windows: |
There was a problem hiding this comment.
This is a finding that probably cost Qwen $100MMs to figure out, if they did a full evaluation. Make sure not to sidetrack these findings.
| * fine insertion / alignment tasks | ||
|
|
||
|
|
||
| ### 6.3 Semantic-motor gap: “reasoning” ≠ “motor primitives” |
There was a problem hiding this comment.
This is very fluffy, as is the merger bottleneck section. Explain why these concerns are here in terms of your critique. Maybe sharpen it to a few statements you can really get behind, rather than surveying a bunch.
| > **Audit takeaway:** Great for *understanding* and *planning narratives*. Not automatically a robot policy. | ||
|
|
||
| --- | ||
|
|
There was a problem hiding this comment.
I stopped reviewing at this point, the remainder of this is very sketchy and needs to be solidified or removed.
No description provided.