Prerequisite: ../01_Architecture/01_Transformer.md. Covers LLaVA, Qwen-VL, and GPT-4V architectures. See Also: ../07_Paper_Tracking/05_Multimodal_Frontiers.md (latest multimodal papers), ../../04_Solutions/09_Vertical_Scenario_Templates.md (multimodal in industry verticals).
All modern Vision-Language Models (VLMs) share a three-component design:
Image → [Vision Encoder] → [Connector/Projector] → [LLM Backbone] → Text
| Encoder | Resolution | Patch Size | Output | Used By |
|---|---|---|---|---|
| CLIP ViT-L/14 | 336×336 | 14×14 | 576 tokens | LLaVA 1.5 |
| EVA-CLIP | 448×448 | 14×14 | 1024 tokens | InternVL |
| SigLIP-So400m | 224 / 448 | 14×14 | 256 / 1024 tokens | PaliGemma |
| Native ViT | Dynamic | 14×14 | Variable | GPT-4V |
Crucial Detail: Most VLMs (e.g., LLaVA) use features from the penultimate layer (倒数第二层) of the vision encoder, as the final layer is too specialized for contrastive matching and lacks spatial detail needed for reasoning.
The connector bridges the vision and language representation spaces:
-
Linear Projection (LLaVA v1): Single linear layer
$W \in \mathbb{R}^{d_{vis} \times d_{llm}}$ . Simple but effective. - MLP Projector (LLaVA v1.5): Two-layer MLP with GELU activation. Standard choice.
- Cross-Attention Resampler (Flamingo/Qwen-VL): Uses learnable queries to compress visual tokens to a fixed count (e.g., 256 → 64). Reduces compute cost.
- C-Abstractor (Honeybee): Convolutional abstractor preserving spatial locality.
- Goal: Align vision encoder output space with LLM input space.
- Data: Large-scale image-caption pairs (e.g., LAION-5B, CC-12M).
- What's trained: Only the connector. Vision encoder and LLM are frozen.
- Goal: Teach the model to follow visual instructions (VQA, OCR, reasoning).
- Data: High-quality visual instruction data (LLaVA-Instruct-150K, ShareGPT4V).
- What's trained: Connector + LLM (often via LoRA). Vision encoder typically stays frozen.
Fixed-resolution encoders waste compute on simple images and lose detail on complex ones. AnyRes (LLaVA-NeXT, InternVL 1.5):
- Divide the input image into tiles matching the encoder's native resolution.
- Encode each tile independently.
- Concatenate all tile tokens + a downscaled global view.
- Impact: Enables OCR and fine-grained document understanding.
High-resolution images produce thousands of visual tokens, overwhelming the LLM context. Solutions:
- Pixel Shuffle (InternVL 2): Spatially merge adjacent tokens, reducing count by 4x.
- Perceiver Resampler (Flamingo): Fixed-length output regardless of input resolution.
- LLaVA 1.0 (2023): Proved that a simple linear projection + visual instruction tuning is surprisingly effective.
- LLaVA 1.5: MLP projector + higher resolution (336px) → significant gains.
- LLaVA-NeXT: AnyRes + more data → competitive with proprietary models.
Architecture not disclosed. Key capabilities:
- Native multi-image understanding.
- Interleaved image-text reasoning.
- OCR and spatial reasoning at production quality.
- Cross-attention resampler with 256 learnable queries.
- Supports bounding box grounding (outputting coordinates).
- Liu et al. (2023): Visual Instruction Tuning (LLaVA).
- Alayrac et al. (2022): Flamingo: a Visual Language Model for Few-Shot Learning.
- Bai et al. (2023): Qwen-VL: A Versatile Vision-Language Model.
- Chen et al. (2023): InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (v1 report; 2024 for v1.5/v2).