first pass on mulitmodality draft by kalhamilton · Pull Request #23 · arpg/vla-foundations

kalhamilton · 2026-01-21T05:14:41Z

No description provided.

kalhamilton · 2026-01-21T05:16:56Z

@crheckman I didn't squash commits since this is a work in progress and we won't be merging into staging yet. LMK if you're holding strict to that requirement and I can fix

github-actions · 2026-01-22T16:20:20Z

🚀 Audit Rendered

Your paper audit preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/23/textbook/audits/staging/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered audit using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

crheckman · 2026-01-22T16:39:56Z

content/textbook/audits/staging/gyanigkali.mdx

+
+Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation. 
+
+Since LLaVA, the landscape of Vision-Language Models (VLMs) has transitioned from modular bridging architectures to native multimodal/omni archiecture   systems. Early innovations like **LLaVA** introduced the concept of visual instruction tuning, treating image features as "foreign language" tokens. **Prismatic** later refined this by auditing the design space and optimizing its Prisms (doublecheck what their models are called) by fusing semantic and geometric encoders to minimize information decay. And lastly, the state-of-the-art (kali note: this might be overstated) is defined by **Qwen V3**, which replaces the bridge with a unified latent space for text, images, and video, enabling long-context agency and a self-correcting thinking mode.


the state of the art

it is indeed the state of the art, but we all know how fast this field moves. This document should be "timeless" in that statements of the state of the art should be qualified (e.g., "it is the state of the art as of early 2026").

lorinachey · 2026-01-22T16:41:01Z

content/textbook/audits/staging/gyanigkali.mdx

+where $W_1 \in \mathbb{R}^{1024 \times 4096}$ and $W_2 \in \mathbb{R}^{4096 \times 4096}$.
+
+LLaVa's patch calcuation (how it process images through it's vision encoder): 
+For ViT-L/14 with patch size $p = 14$ and input resolution 336×336:


Is there any reasoning/justification provided for the chosen input resolution and patch size?

crheckman · 2026-01-22T16:41:51Z

content/textbook/audits/staging/gyanigkali.mdx

+
+## Introduction: The Multimodal Trajectory (2023–2025)
+
+Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it. 


In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.

this is very fluffy. Your first sentence is clear: "gap between vision and text." What is the gap in robotics explicitly?

my recommendation: move statements about robotics to later areas, and focus on multimodality as an enhancement beyond vision e.g. lidar encoders or scene encoders.

crheckman · 2026-01-22T16:44:02Z

content/textbook/audits/staging/gyanigkali.mdx

+
+Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it. 
+
+Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation. 


by treating visual tokens as "foreign language" prefix to a conversation

While I appreciate this comparison/metaphor, if it's going to be in this document it shouldn't be in the abstract of a technical write-up. These first paragraphs should be direct and technical: "tokens going in are directly compared with language tokens to utilize cross attention more effectively."

aritrach · 2026-01-22T16:45:41Z

content/textbook/audits/staging/gyanigkali.mdx

+
+| Feature | LLaVA 1.0 | LLaVA 1.5 |
+| :--- | :--- | :--- |
+| **Connector** | **Linear Projection**: A single trainable matrix aligning feature spaces. | **2-layer MLP (GELU)**: A non-linear bridge that better interprets visual features. |


Did they give any reason why they didn't use another nonlienar functions than GeLU?

crheckman · 2026-01-22T16:46:16Z

content/textbook/audits/staging/gyanigkali.mdx

+## Part I. LLaVA: Visual Instruction Tuning
+
+### 1. Novelty & Contribution
+The primary novelty of LLaVA was the introduction of Visual Instruction Tuning, the process of using a language-only GPT-4 to generate multimodal instruction-following data (158K samples) from text-only image captions. 


This is a HUGE part of llava and needs to be explicitly discussed in this write-up. There is a great figure in the original llava paper that goes to this. llava-1.5 paper focuses more on architecture and image slicing choices, which are kind of minor in comparison and have the correct amount of content here.

Just a quick reminder to add figures like this one to the writeup.

lorinachey · 2026-01-22T16:46:54Z

content/textbook/audits/staging/gyanigkali.mdx

+→ No benefit from multi-stage. Single-stage saves 20-30% compute.
+
+**RQ2: Freeze or fine-tune vision encoder?**
+→ **Freeze** the vision encoder. Unfreezing degrades performance (contradicts some prior work; Prismatic attributes this to lacking LoRA).


Not sure I understand what Prismatic is attributing to "lacking LoRA". Are they suggesting that LoRA is needed to keep from degrading performance when unfrozen? Or is there a current LoRA implementation that is somehow lacking?

content/textbook/audits/staging/gyanigkali.mdx

lorinachey · 2026-01-22T16:49:23Z

content/textbook/audits/staging/gyanigkali.mdx

+| Feature | LLaVA 1.0 | LLaVA 1.5 |
+| :--- | :--- | :--- |
+| **Connector** | **Linear Projection**: A single trainable matrix aligning feature spaces. | **2-layer MLP (GELU)**: A non-linear bridge that better interprets visual features. |
+| **Input Res.** | 224px (standard CLIP input). | 336px (upscaled for fine-grained OCR). |


May want to spell out OCR explicitly here, then use the acronym in the remaining.

crheckman · 2026-01-22T16:49:34Z

content/textbook/audits/staging/gyanigkali.mdx

+$$Z_v = Z_v^{\text{full}}[1:] \in \mathbb{R}^{576 \times 1024}$$
+The [CLS] Token/Departure from CLIP:
+
+CLIP adds a special [CLS] token at the beginning (common in transformer models). This token is meant to represent the entire image globally. So CLIP actually outputs 577 tokens (1 [CLS] + 576 patches). By default, LLaVA drops the [CLS] token and only uses the 576 spatial patches. The [CLS] token is a global summary but loses spatial details; LLaVA preserves more fine-grained spatial information. 


I appreciate the detail of these calculations here, but the punchline is not obvious. The [CLS] token helps and is the standard (although there exist plenty of alternatives/research against this line of work), and this section should explain why. The exact number of tokens and arithmetic should probably be moved to an appendix.

crheckman · 2026-01-22T16:50:07Z

content/textbook/audits/staging/gyanigkali.mdx

+| **Input Res.** | 224px (standard CLIP input). | 336px (upscaled for fine-grained OCR). |
+| **Tokens ($N_p$)** | 256 patches. | 576 patches ($24 \times 24$ grid). |
+
+[add image showing the CLIP encoder, MLP connector, and LLM backbone]


yes, figures please.

cKohl10 · 2026-01-22T16:50:56Z

content/textbook/audits/staging/gyanigkali.mdx

+$$Z_v = Z_v^{\text{full}}[1:] \in \mathbb{R}^{576 \times 1024}$$
+The [CLS] Token/Departure from CLIP:
+
+CLIP adds a special [CLS] token at the beginning (common in transformer models). This token is meant to represent the entire image globally. So CLIP actually outputs 577 tokens (1 [CLS] + 576 patches). By default, LLaVA drops the [CLS] token and only uses the 576 spatial patches. The [CLS] token is a global summary but loses spatial details; LLaVA preserves more fine-grained spatial information. 


What is the advantage to dropping the CLS token? Is this intentional?

crheckman · 2026-01-22T16:51:50Z

content/textbook/audits/staging/gyanigkali.mdx

+[add image showing the CLIP encoder, MLP connector, and LLM backbone]
+
+
+The mapping $W: \mathbb{R}^{1024} \rightarrow \mathbb{R}^{4096}$ appears to increase dimensionality, but this does **not** increase information content.


It's unclear how the following theorem supports this statement. Could you write this in English first (where does 576 come into play wrt 4096 and 1024, for the architecture?) and then the theorem/proof? (nice touch btw)

Jdvakil · 2026-01-22T16:54:05Z

content/textbook/audits/staging/gyanigkali.mdx

+
+## Introduction: The Multimodal Trajectory (2023–2025)
+
+Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it. 


What are some examples of multi-modal data? In robotics it could mean different sources of data (sensors, exo/ego camera, audio, ++) but in the broader vision multi-modal data is used across AI/ML. The first statement could be expanded further to explain the bridge and what the data types are.

Following new audit workflow. File now in staging/ for review.

Changes: - Split long sentences into individual lines for easier PR review - Reformatted table to avoid overly long cells - Added connector evolution section with proper line breaks All linter checks now pass.

The MDX file was missing YAML frontmatter which caused metadata (title, author, paper, topic) to not be parsed correctly.

The <3.0 seconds> text was being parsed as an HTML tag by MDX. Changed to <3.0 seconds> to properly escape the angle brackets.

MDX has strict parsing rules and doesn't handle HTML comments well in all contexts. Removed commented-out sections to fix build errors.

Removed final HTML comment block and 'GPT idea' placeholder text that was causing MDX parsing errors.

lorinachey · 2026-01-27T15:28:36Z

content/textbook/audits/staging/gyanigkali.mdx

+
+**Why it’s hard (first principles):**
+
+* **Long context** amplifies compute cost and attention memory.


I think it would nice to mention just how much it amplifies the cost and memory with specific values.

lorinachey · 2026-01-27T15:31:39Z

content/textbook/audits/staging/gyanigkali.mdx

+### 2.2 Vision encoder choice and dynamic resolution
+
+Qwen3-VL uses the **SigLIP-2** architecture as the vision encoder and continues training with **dynamic input resolutions**, using **2D-RoPE** and interpolated absolute embeddings (following CoMP). 
+They mention using specific SigLIP-2 variants (SO-400M default; Large 300M for small LLMs). 


Do they mention why they use different variants and what the outcomes are from those different variants?

crheckman · 2026-01-27T15:49:24Z

content/textbook/audits/staging/gyanigkali.mdx

+## Part I. LLaVA: Visual Instruction Tuning
+
+### 1. Novelty & Contribution
+The primary novelty of LLaVA was the introduction of Visual Instruction Tuning, the process of using a language-only GPT-4 to generate multimodal instruction-following data (158K samples) from text-only image captions. 


Just a quick reminder to add figures like this one to the writeup.

crheckman · 2026-01-27T15:50:19Z

content/textbook/audits/staging/gyanigkali.mdx

+2. **Projection Module** $W$: Linear layer (LLaVA 1.0) or 2-layer MLP (LLaVA-1.5+)
+3. **Language Model** $f_\phi(\cdot)$: Vicuna-v1.5 (fine-tuned LLaMA-2), parameterized by $\phi$
+
+And it's state space is defined as:


crheckman · 2026-01-27T15:51:02Z

content/textbook/audits/staging/gyanigkali.mdx

+
+The projection maps visual features to the LLM's embedding space:
+
+**LLaVA 1.0:** $H_v = W \cdot Z_v$ where $W \in \mathbb{R}^{1024 \times 4096}$


no reason to include old version here, just include sota 1.5 or whatever is since then

crheckman · 2026-01-27T15:53:22Z

content/textbook/audits/staging/gyanigkali.mdx

+LLaVa's patch calcuation (how it process images through it's vision encoder): 
+For ViT-L/14 with patch size $p = 14$ and input resolution 336×336:
+
+$$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$


perfect place to add an animation of this. recommend creating your own with sora/banana

crheckman · 2026-01-27T15:54:43Z

content/textbook/audits/staging/gyanigkali.mdx

+
+$$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$
+
+**Critical Detail:** CLIP ViT-L/14 prepends a learnable [CLS] token, producing:


when you say "the [CLS] token is a global summary but loses spatial details," what does this mean? You also say it is a "learnable token." What is a learnable token? The token is a dictionary. Please devote some space to this.

crheckman · 2026-01-27T16:17:48Z

content/textbook/audits/staging/gyanigkali.mdx

+| TimeMarker | video dialogue & localization | explicit time reasoning | Video-LLM fusion | varies | No |
+| RT-2 | robotics policy via VLA | action tokens | end-to-end VLA | VLM + robotics data | Yes |
+| Octo | policy model | diffusion policy | task-conditioned | robotics obs | Yes |
+| OpenVLA | open VLA | action tokens | end-to-end | robotics obs | Yes |


What does this "policy model," "open VLA" mean? Why are the comments of some of these in "long context strategy"?

crheckman · 2026-01-27T16:18:16Z

content/textbook/audits/staging/gyanigkali.mdx

+\text{input} = [\tau_1, v_1, \tau_2, v_2, \dots]
+$$
+
+> **Audit critique:** This likely improves time-localization tasks (grounding, dense captioning), but it makes time **linguistic** rather than a continuous latent tied to dynamics. Great for QA; potentially weak for control.


Sure, and this is a good point. Are there innovations beyond this that have been explored to address?

crheckman · 2026-01-27T16:18:48Z

content/textbook/audits/staging/gyanigkali.mdx

+
+## 5. Training Details
+
+Qwen3-VL pretraining is structured into **four stages** with growing context windows: 


This is a finding that probably cost Qwen $100MMs to figure out, if they did a full evaluation. Make sure not to sidetrack these findings.

crheckman · 2026-01-27T16:19:32Z

content/textbook/audits/staging/gyanigkali.mdx

+* fine insertion / alignment tasks
+
+
+### 6.3 Semantic-motor gap: “reasoning” ≠ “motor primitives”


This is very fluffy, as is the merger bottleneck section. Explain why these concerns are here in terms of your critique. Maybe sharpen it to a few statements you can really get behind, rather than surveying a bunch.

crheckman · 2026-01-27T16:20:06Z

content/textbook/audits/staging/gyanigkali.mdx

+> **Audit takeaway:** Great for *understanding* and *planning narratives*. Not automatically a robot policy.
+
+---
+


I stopped reviewing at this point, the remainder of this is very sketchy and needs to be solidified or removed.

crheckman force-pushed the audit/gyanigkali-multimodality-draft1 branch 2 times, most recently from be4e825 to c817928 Compare January 22, 2026 16:19

crheckman reviewed Jan 22, 2026

View reviewed changes

lorinachey reviewed Jan 22, 2026

View reviewed changes

crheckman reviewed Jan 22, 2026

View reviewed changes

aritrach reviewed Jan 22, 2026

View reviewed changes

crheckman reviewed Jan 22, 2026

View reviewed changes

lorinachey reviewed Jan 22, 2026

View reviewed changes

Jdvakil reviewed Jan 22, 2026

View reviewed changes

content/textbook/audits/staging/gyanigkali.mdx Show resolved Hide resolved

lorinachey reviewed Jan 22, 2026

View reviewed changes

crheckman reviewed Jan 22, 2026

View reviewed changes

cKohl10 reviewed Jan 22, 2026

View reviewed changes

crheckman reviewed Jan 22, 2026

View reviewed changes

Jdvakil reviewed Jan 22, 2026

View reviewed changes

crheckman force-pushed the audit/gyanigkali-multimodality-draft1 branch from eb459b8 to 57cbed1 Compare January 22, 2026 17:36

Kali Hamilton and others added 6 commits January 22, 2026 13:37

init draft audit

ea5468b

added qwen deep dive; requires some verification

090971d

draft 1 file management

3793397

Add multimodality audit to staging directory

a96182c

Following new audit workflow. File now in staging/ for review.

Fix linter issues: add semantic line breaks

29085a2

Changes: - Split long sentences into individual lines for easier PR review - Reformatted table to avoid overly long cells - Added connector evolution section with proper line breaks All linter checks now pass.

Add proper frontmatter to multimodality audit

e728f59

The MDX file was missing YAML frontmatter which caused metadata (title, author, paper, topic) to not be parsed correctly.

crheckman force-pushed the audit/gyanigkali-multimodality-draft1 branch from 55db0d2 to e728f59 Compare January 22, 2026 20:37

crheckman added 4 commits January 22, 2026 13:40

Fix MDX syntax error: escape angle brackets in timestamp tokens

7f2fd71

The <3.0 seconds> text was being parsed as an HTML tag by MDX. Changed to <3.0 seconds> to properly escape the angle brackets.

Remove HTML comments that cause MDX parsing errors

6f79097

MDX has strict parsing rules and doesn't handle HTML comments well in all contexts. Removed commented-out sections to fix build errors.

Remove remaining HTML comment and stray text

f9da88d

Removed final HTML comment block and 'GPT idea' placeholder text that was causing MDX parsing errors.

Merge staging to get improved audit layout

f9b7ed1

lorinachey reviewed Jan 27, 2026

View reviewed changes

crheckman requested changes Jan 27, 2026

View reviewed changes


		Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation.

		Since LLaVA, the landscape of Vision-Language Models (VLMs) has transitioned from modular bridging architectures to native multimodal/omni archiecture systems. Early innovations like LLaVA introduced the concept of visual instruction tuning, treating image features as "foreign language" tokens. Prismatic later refined this by auditing the design space and optimizing its Prisms (doublecheck what their models are called) by fusing semantic and geometric encoders to minimize information decay. And lastly, the state-of-the-art (kali note: this might be overstated) is defined by Qwen V3, which replaces the bridge with a unified latent space for text, images, and video, enabling long-context agency and a self-correcting thinking mode.


		## Introduction: The Multimodal Trajectory (2023–2025)

		Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.

		[add image showing the CLIP encoder, MLP connector, and LLM backbone]


		The mapping $W: \mathbb{R}^{1024} \rightarrow \mathbb{R}^{4096}$ appears to increase dimensionality, but this does not increase information content.


		Why it’s hard (first principles):

		* Long context amplifies compute cost and attention memory.


		The projection maps visual features to the LLM's embedding space:

		LLaVA 1.0: $H_v = W \cdot Z_v$ where $W \in \mathbb{R}^{1024 \times 4096}$


		$$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$

		Critical Detail: CLIP ViT-L/14 prepends a learnable [CLS] token, producing:


		## 5. Training Details

		Qwen3-VL pretraining is structured into four stages with growing context windows:

		* fine insertion / alignment tasks


		### 6.3 Semantic-motor gap: “reasoning” ≠ “motor primitives”

		> Audit takeaway: Great for understanding and planning narratives. Not automatically a robot policy.

		---

Conversation

kalhamilton commented Jan 21, 2026

Uh oh!

kalhamilton commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Checklist

Next Steps

Uh oh!

crheckman Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aritrach Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions bot commented Jan 22, 2026 •

edited

Loading

crheckman Jan 22, 2026 •

edited

Loading

aritrach Jan 22, 2026 •

edited

Loading

crheckman Jan 22, 2026 •

edited

Loading

crheckman Jan 22, 2026 •

edited

Loading