Skip to content

first pass on mulitmodality draft#23

Open
kalhamilton wants to merge 10 commits intostagingfrom
audit/gyanigkali-multimodality-draft1
Open

first pass on mulitmodality draft#23
kalhamilton wants to merge 10 commits intostagingfrom
audit/gyanigkali-multimodality-draft1

Conversation

@kalhamilton
Copy link
Contributor

No description provided.

@kalhamilton
Copy link
Contributor Author

@crheckman I didn't squash commits since this is a work in progress and we won't be merging into staging yet. LMK if you're holding strict to that requirement and I can fix

@crheckman crheckman force-pushed the audit/gyanigkali-multimodality-draft1 branch 2 times, most recently from be4e825 to c817928 Compare January 22, 2026 16:19
@github-actions
Copy link

github-actions bot commented Jan 22, 2026

🚀 Audit Rendered

Your paper audit preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/23/textbook/audits/staging/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered audit using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.


Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation.

Since LLaVA, the landscape of Vision-Language Models (VLMs) has transitioned from modular bridging architectures to native multimodal/omni archiecture systems. Early innovations like **LLaVA** introduced the concept of visual instruction tuning, treating image features as "foreign language" tokens. **Prismatic** later refined this by auditing the design space and optimizing its Prisms (doublecheck what their models are called) by fusing semantic and geometric encoders to minimize information decay. And lastly, the state-of-the-art (kali note: this might be overstated) is defined by **Qwen V3**, which replaces the bridge with a unified latent space for text, images, and video, enabling long-context agency and a self-correcting thinking mode.
Copy link
Collaborator

@crheckman crheckman Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state of the art

it is indeed the state of the art, but we all know how fast this field moves. This document should be "timeless" in that statements of the state of the art should be qualified (e.g., "it is the state of the art as of early 2026").

where $W_1 \in \mathbb{R}^{1024 \times 4096}$ and $W_2 \in \mathbb{R}^{4096 \times 4096}$.

LLaVa's patch calcuation (how it process images through it's vision encoder):
For ViT-L/14 with patch size $p = 14$ and input resolution 336×336:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reasoning/justification provided for the chosen input resolution and patch size?


## Introduction: The Multimodal Trajectory (2023–2025)

Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.

this is very fluffy. Your first sentence is clear: "gap between vision and text." What is the gap in robotics explicitly?

my recommendation: move statements about robotics to later areas, and focus on multimodality as an enhancement beyond vision e.g. lidar encoders or scene encoders.


Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.

Before 2023, vision-language alignment relied on expensive, human labeled datasets. With the introduction of LLaVA in 2023, general-purpose LLMs were able to follow visual instructions by treating visual tokens as "foreign language" prefix to a conversation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by treating visual tokens as "foreign language" prefix to a conversation

While I appreciate this comparison/metaphor, if it's going to be in this document it shouldn't be in the abstract of a technical write-up. These first paragraphs should be direct and technical: "tokens going in are directly compared with language tokens to utilize cross attention more effectively."


| Feature | LLaVA 1.0 | LLaVA 1.5 |
| :--- | :--- | :--- |
| **Connector** | **Linear Projection**: A single trainable matrix aligning feature spaces. | **2-layer MLP (GELU)**: A non-linear bridge that better interprets visual features. |
Copy link
Contributor

@aritrach aritrach Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did they give any reason why they didn't use another nonlienar functions than GeLU?

## Part I. LLaVA: Visual Instruction Tuning

### 1. Novelty & Contribution
The primary novelty of LLaVA was the introduction of Visual Instruction Tuning, the process of using a language-only GPT-4 to generate multimodal instruction-following data (158K samples) from text-only image captions.
Copy link
Collaborator

@crheckman crheckman Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a HUGE part of llava and needs to be explicitly discussed in this write-up. There is a great figure in the original llava paper that goes to this. llava-1.5 paper focuses more on architecture and image slicing choices, which are kind of minor in comparison and have the correct amount of content here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick reminder to add figures like this one to the writeup.

→ No benefit from multi-stage. Single-stage saves 20-30% compute.

**RQ2: Freeze or fine-tune vision encoder?**
→ **Freeze** the vision encoder. Unfreezing degrades performance (contradicts some prior work; Prismatic attributes this to lacking LoRA).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what Prismatic is attributing to "lacking LoRA". Are they suggesting that LoRA is needed to keep from degrading performance when unfrozen? Or is there a current LoRA implementation that is somehow lacking?

| Feature | LLaVA 1.0 | LLaVA 1.5 |
| :--- | :--- | :--- |
| **Connector** | **Linear Projection**: A single trainable matrix aligning feature spaces. | **2-layer MLP (GELU)**: A non-linear bridge that better interprets visual features. |
| **Input Res.** | 224px (standard CLIP input). | 336px (upscaled for fine-grained OCR). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to spell out OCR explicitly here, then use the acronym in the remaining.

$$Z_v = Z_v^{\text{full}}[1:] \in \mathbb{R}^{576 \times 1024}$$
The [CLS] Token/Departure from CLIP:

CLIP adds a special [CLS] token at the beginning (common in transformer models). This token is meant to represent the entire image globally. So CLIP actually outputs 577 tokens (1 [CLS] + 576 patches). By default, LLaVA drops the [CLS] token and only uses the 576 spatial patches. The [CLS] token is a global summary but loses spatial details; LLaVA preserves more fine-grained spatial information.
Copy link
Collaborator

@crheckman crheckman Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the detail of these calculations here, but the punchline is not obvious. The [CLS] token helps and is the standard (although there exist plenty of alternatives/research against this line of work), and this section should explain why. The exact number of tokens and arithmetic should probably be moved to an appendix.

| **Input Res.** | 224px (standard CLIP input). | 336px (upscaled for fine-grained OCR). |
| **Tokens ($N_p$)** | 256 patches. | 576 patches ($24 \times 24$ grid). |

[add image showing the CLIP encoder, MLP connector, and LLM backbone]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, figures please.

$$Z_v = Z_v^{\text{full}}[1:] \in \mathbb{R}^{576 \times 1024}$$
The [CLS] Token/Departure from CLIP:

CLIP adds a special [CLS] token at the beginning (common in transformer models). This token is meant to represent the entire image globally. So CLIP actually outputs 577 tokens (1 [CLS] + 576 patches). By default, LLaVA drops the [CLS] token and only uses the 576 spatial patches. The [CLS] token is a global summary but loses spatial details; LLaVA preserves more fine-grained spatial information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage to dropping the CLS token? Is this intentional?

[add image showing the CLIP encoder, MLP connector, and LLM backbone]


The mapping $W: \mathbb{R}^{1024} \rightarrow \mathbb{R}^{4096}$ appears to increase dimensionality, but this does **not** increase information content.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear how the following theorem supports this statement. Could you write this in English first (where does 576 come into play wrt 4096 and 1024, for the architecture?) and then the theorem/proof? (nice touch btw)


## Introduction: The Multimodal Trajectory (2023–2025)

Multi-modality is the framework that enables a model to process and generate information across disparate data types; for VLMs, this is the gap between pixels and linguistc/textual tokens. In robotics, this represents the shift from a robot that sees its environment versus one that can understand or reason about its physical interactions within it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are some examples of multi-modal data? In robotics it could mean different sources of data (sensors, exo/ego camera, audio, ++) but in the broader vision multi-modal data is used across AI/ML. The first statement could be expanded further to explain the bridge and what the data types are.

@crheckman crheckman force-pushed the audit/gyanigkali-multimodality-draft1 branch from eb459b8 to 57cbed1 Compare January 22, 2026 17:36
Kali Hamilton and others added 6 commits January 22, 2026 13:37
Following new audit workflow. File now in staging/ for review.
Changes:
- Split long sentences into individual lines for easier PR review
- Reformatted table to avoid overly long cells
- Added connector evolution section with proper line breaks

All linter checks now pass.
The MDX file was missing YAML frontmatter which caused metadata
(title, author, paper, topic) to not be parsed correctly.
@crheckman crheckman force-pushed the audit/gyanigkali-multimodality-draft1 branch from 55db0d2 to e728f59 Compare January 22, 2026 20:37
The <3.0 seconds> text was being parsed as an HTML tag by MDX.
Changed to &lt;3.0 seconds&gt; to properly escape the angle brackets.
MDX has strict parsing rules and doesn't handle HTML comments
well in all contexts. Removed commented-out sections to fix
build errors.
Removed final HTML comment block and 'GPT idea' placeholder text
that was causing MDX parsing errors.

**Why it’s hard (first principles):**

* **Long context** amplifies compute cost and attention memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would nice to mention just how much it amplifies the cost and memory with specific values.

### 2.2 Vision encoder choice and dynamic resolution

Qwen3-VL uses the **SigLIP-2** architecture as the vision encoder and continues training with **dynamic input resolutions**, using **2D-RoPE** and interpolated absolute embeddings (following CoMP).
They mention using specific SigLIP-2 variants (SO-400M default; Large 300M for small LLMs).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they mention why they use different variants and what the outcomes are from those different variants?

## Part I. LLaVA: Visual Instruction Tuning

### 1. Novelty & Contribution
The primary novelty of LLaVA was the introduction of Visual Instruction Tuning, the process of using a language-only GPT-4 to generate multimodal instruction-following data (158K samples) from text-only image captions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick reminder to add figures like this one to the writeup.

2. **Projection Module** $W$: Linear layer (LLaVA 1.0) or 2-layer MLP (LLaVA-1.5+)
3. **Language Model** $f_\phi(\cdot)$: Vicuna-v1.5 (fine-tuned LLaMA-2), parameterized by $\phi$

And it's state space is defined as:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its


The projection maps visual features to the LLM's embedding space:

**LLaVA 1.0:** $H_v = W \cdot Z_v$ where $W \in \mathbb{R}^{1024 \times 4096}$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason to include old version here, just include sota 1.5 or whatever is since then

LLaVa's patch calcuation (how it process images through it's vision encoder):
For ViT-L/14 with patch size $p = 14$ and input resolution 336×336:

$$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect place to add an animation of this. recommend creating your own with sora/banana


$$N_p = \left(\frac{H}{p}\right) \times \left(\frac{W}{p}\right) = \frac{336}{14} \times \frac{336}{14} = 24 \times 24 = 576 \text{ patches}$$

**Critical Detail:** CLIP ViT-L/14 prepends a learnable [CLS] token, producing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say "the [CLS] token is a global summary but loses spatial details," what does this mean? You also say it is a "learnable token." What is a learnable token? The token is a dictionary. Please devote some space to this.

| TimeMarker | video dialogue & localization | explicit time reasoning | Video-LLM fusion | varies | No |
| RT-2 | robotics policy via VLA | action tokens | end-to-end VLA | VLM + robotics data | Yes |
| Octo | policy model | diffusion policy | task-conditioned | robotics obs | Yes |
| OpenVLA | open VLA | action tokens | end-to-end | robotics obs | Yes |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this "policy model," "open VLA" mean? Why are the comments of some of these in "long context strategy"?

\text{input} = [\tau_1, v_1, \tau_2, v_2, \dots]
$$

> **Audit critique:** This likely improves time-localization tasks (grounding, dense captioning), but it makes time **linguistic** rather than a continuous latent tied to dynamics. Great for QA; potentially weak for control.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and this is a good point. Are there innovations beyond this that have been explored to address?


## 5. Training Details

Qwen3-VL pretraining is structured into **four stages** with growing context windows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a finding that probably cost Qwen $100MMs to figure out, if they did a full evaluation. Make sure not to sidetrack these findings.

* fine insertion / alignment tasks


### 6.3 Semantic-motor gap: “reasoning” ≠ “motor primitives”
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very fluffy, as is the merger bottleneck section. Explain why these concerns are here in terms of your critique. Maybe sharpen it to a few statements you can really get behind, rather than surveying a bunch.

> **Audit takeaway:** Great for *understanding* and *planning narratives*. Not automatically a robot policy.

---

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped reviewing at this point, the remainder of this is very sketchy and needs to be solidified or removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants