Molmo: Fully Open Vision-Language Model

Overview & Motivation

Molmo represents a landmark achievement in open-source multimodal AI: a family of vision-language models that are completely open in every aspect - training data, model weights, training code, and evaluation methodology. Developed by the Allen Institute for AI (AI2), Molmo challenges the prevailing paradigm where even "open-source" models often rely on proprietary training data or closed methodologies.

Key Achievement: First state-of-the-art vision-language model with fully open training data, enabling true reproducibility and transparency in multimodal AI research.

The Openness Gap in Multimodal AI:

Most "open-source" multimodal models have hidden components:

LLaVA: Uses GPT-4 generated instruction data (closed)
Qwen-VL: Training data composition not fully disclosed
InstructBLIP: Relies on proprietary BLIP-2 data curation

Molmo's Complete Openness:

PixMo Dataset: Fully open, high-quality multimodal training data
Model Weights: All model sizes freely available
Training Code: Complete training pipeline released
Evaluation Data: Transparent benchmarking methodology
Data Collection Process: Detailed documentation of data creation

Why Full Openness Matters:

Reproducibility: Researchers can replicate results exactly
Understanding: Transparency reveals what makes models work
Fairness: Ensures no hidden proprietary advantages
Innovation: Enables community to build and improve freely
Trust: Open data allows bias and quality auditing

Model Sizes:

Molmo-7B: 7 billion parameters (competitive with much larger models)
Molmo-72B: 72 billion parameters (state-of-the-art performance)
MolmoE-1B: 1 billion parameters (efficient for edge deployment)

Unique Capabilities:

Pointing: Can indicate specific locations in images with (x,y) coordinates
Fine-grained Grounding: Links descriptions to precise image regions
Spatial Reasoning: Strong understanding of object relationships and locations
Document Understanding: Effective on charts, diagrams, and text-rich images

Performance Highlights:

Matches or exceeds GPT-4V on many benchmarks despite smaller size
Best open-source model on spatial reasoning tasks
Strong zero-shot generalization to new domains

Theoretical Background

The Open Data Imperative

Molmo's design philosophy starts with a fundamental question: Can we build state-of-the-art multimodal models using only open, auditable training data?

Challenges with Proprietary Data:

Traditional Approach:
  Pretrain on web-scale data (unknown quality)
  → Instruction-tune with GPT-4 outputs (closed)
  → Fine-tune on proprietary datasets (hidden)
  → Publish "open" weights

Problem: Cannot reproduce, cannot verify quality, cannot understand biases

Molmo's Solution:

Open Approach:
  Curate PixMo dataset (fully documented)
  → Human-annotated instructions (open process)
  → Transparent quality filters (published criteria)
  → Release everything

Benefit: Complete reproducibility and transparency

PixMo: The Foundation Dataset

PixMo (Pixel-based Multimodal Open dataset) is Molmo's training data foundation, consisting of multiple carefully curated subsets:

PixMo-Cap: Image Captioning

1M high-quality image-caption pairs
Human-written detailed descriptions
Focus on spatial relationships and fine details
Quality control through multi-stage validation

PixMo-Point: Spatial Grounding

500K image-point annotations
Natural language queries with (x,y) answers
"Where is the red car?" → (x: 0.45, y: 0.67)
Enables precise spatial reasoning

PixMo-Ask: Visual Question Answering

750K question-answer pairs
Diverse question types (what, where, how many, why)
Human-verified answers
Balanced difficulty levels

PixMo-VQA: Instruction Following

300K instruction-response pairs
Complex multi-step reasoning
Compositional understanding
Chain-of-thought annotations

Data Quality Principles:

Human-centric: Prefer human annotations over synthetic
Diversity: Balance across domains, tasks, and difficulty
Spatial detail: Emphasize location and relationship information
Verification: Multi-stage quality checks
Documentation: Full provenance and creation process

Vision-Language Architecture

Molmo uses an efficient architecture optimized for the open data regime:

Components:

Input Image
    ↓
[Vision Encoder: ViT-L/14]
    ↓
Visual Features (spatial grid)
    ↓
[Multimodal Connector: MLP]
    ↓
Projected Features
    ↓
[Language Model: OLMo-7B/72B]
    ↓
Text + Pointing Outputs

Design Choices:

Vision Encoder: Standard ViT (not proprietary CLIP)
- Trained on open image datasets
- No dependency on web-scale proprietary data
Language Model: OLMo (AI2's open LLM)
- Fully open pretraining data
- Transparent training process
- No hidden proprietary components
Connector: Simple MLP projection
- Efficient parameter usage
- Easy to analyze and modify
- Proven sufficient with quality data

Key Insight: With high-quality open data, simple architectures can match complex proprietary systems.

Pointing Mechanism

A distinguishing feature of Molmo is its ability to generate pointing outputs:

Pointing as a Token:

Question: "Where is the dog in this image?"

Traditional VLM: "The dog is in the lower left portion of the image."

Molmo: "The dog is at coordinates (0.23, 0.71) [points to dog]"
        Output: Text + (x, y) coordinates

Training for Pointing:

Spatial tokens: <point> followed by normalized coordinates
Joint training: Alternate between text and pointing tasks
Spatial attention: Vision features retain spatial structure
Ground truth: Human-annotated point locations

Applications:

Visual grounding: "Point to the largest building"
Interactive AI: User can ask "What's that?" with pointing
Robotics: Specify target locations for manipulation
Medical imaging: Indicate regions of interest

Training Methodology

Molmo's training follows a principled three-stage approach:

Stage 1: Vision-Language Alignment

Data: PixMo-Cap (image-caption pairs)
Objective: Align visual and language representations
Duration: 100K steps
Frozen: Vision encoder
Trainable: Connector, language model
Loss: Autoregressive language modeling on captions

Stage 2: Multimodal Instruction Tuning

Data: PixMo-Ask + PixMo-VQA (mixed)
Objective: Teach instruction following and reasoning
Duration: 50K steps
Frozen: Vision encoder
Trainable: Connector, language model
Loss: Cross-entropy on answer tokens only

Stage 3: Pointing Fine-tuning

Data: PixMo-Point + all previous data
Objective: Add spatial pointing capability
Duration: 20K steps
Frozen: Vision encoder (optional fine-tuning)
Trainable: Full model
Loss: Combined text + coordinate prediction loss

Critical Success Factors:

High data quality compensates for smaller scale
Spatial detail in annotations improves grounding
Human verification ensures answer correctness
Balanced curriculum prevents catastrophic forgetting

Mathematical Formulation

Model Architecture

Vision Encoding:

Given input image $I \in \mathbb{R}^{H \times W \times 3}$:

Step 1: Patch Embedding
Divide image into patches: I → {p₁, p₂, ..., p_N}
where N = (H/patch_size) × (W/patch_size)

For standard 224×224 image with patch_size=14:
N = 16 × 16 = 256 patches

Embed each patch:
x_i = Linear(Flatten(p_i)) ∈ ℝ^d_v
where d_v = vision encoder dimension (e.g., 1024)

Step 2: Positional Encoding
Add 2D positional embeddings:
x_i = x_i + PE[i, j]  where p_i is at grid position (i, j)

Step 3: Vision Transformer
h = ViT_Encoder([CLS, x₁, x₂, ..., x_N])
h ∈ ℝ^{(N+1) × d_v}

Output: Visual features with spatial structure preserved

Multimodal Projection:

Transform visual features to language model space:

z = MLP(h)
  = W₂ · GELU(W₁ · h + b₁) + b₂

where:
W₁ ∈ ℝ^{d_hidden × d_v}
W₂ ∈ ℝ^{d_lang × d_hidden}
z ∈ ℝ^{(N+1) × d_lang}

d_lang = language model dimension (e.g., 2048 for Molmo-7B)

Language Model Integration:

Concatenate visual and text tokens:

Input sequence:
X = [<vision_start>, z₁, z₂, ..., z_N, <vision_end>, t₁, t₂, ..., t_M]

where t_i are text token embeddings

Process through language model:
H = OLMo(X, causal_mask)

Generate outputs:
logits = H @ W_vocab^T

Next token: argmax(logits[-1]) or sample(logits[-1], temperature)

Pointing Prediction

Coordinate Tokenization:

Represent coordinates as special tokens:

Method 1: Direct Regression
point_head = Linear(H_last) → (x, y) ∈ [0, 1]²

Loss: L_point = MSE(pred_point, target_point)

Method 2: Discretized Tokens (Molmo's approach)
Quantize space: x, y → discrete bins
x_bin = ⌊x × num_bins⌋
y_bin = ⌊y × num_bins⌋

Tokens: <point> <x_{bin}> <y_{bin}>

Loss: L_point = CrossEntropy(x_bin) + CrossEntropy(y_bin)

Spatial Attention Preservation:

To maintain spatial structure for pointing:

In vision encoder:
- Keep patch-level features (don't pool too aggressively)
- Maintain 2D positional information
- Use grid-aware attention patterns

In projection:
- Preserve spatial organization: z[i] corresponds to patch position
- Don't permute or heavily compress features
- Allow language model to attend spatially

Training Objectives

Stage 1 - Caption Generation Loss:

L_caption = -∑ᵢ log P(caption_i | image, caption_{<i})

Standard autoregressive language modeling on caption tokens

Stage 2 - Instruction Following Loss:

L_instruct = -∑ⱼ log P(answer_j | image, question, answer_{<j})

Only compute loss on answer tokens (not question tokens)
Prevents model from learning to predict instructions

Stage 3 - Combined Text + Pointing Loss:

For samples with text answers:
L_text = -∑ₖ log P(text_k | image, query, text_{<k})

For samples with point answers:
L_point = CrossEntropy(x_bin_pred, x_bin_true) +
          CrossEntropy(y_bin_pred, y_bin_true)

Combined loss:
L_total = α · L_text + β · L_point + γ · L_spatial

where:
L_spatial = auxiliary loss to preserve spatial reasoning
α, β, γ = task-specific weights (typically α=0.7, β=0.2, γ=0.1)

Coordinate Discretization Details

Bin Assignment:

Given continuous coordinates (x, y) ∈ [0, 1]²:

num_bins = 100  # typical value

Quantization:
x_bin = ⌊x × num_bins⌋
y_bin = ⌊y × num_bins⌋

Special tokens:
vocab_ext = vocab_base ∪ {<point>, <x_0>, <x_1>, ..., <x_99>, <y_0>, <y_1>, ..., <y_99>}

Output format:
"The object is located at <point> <x_45> <y_67>"

Dequantization (at inference):

Extract bins from generated tokens:
x_bin, y_bin = parse_point_tokens(output)

Continuous coordinates:
x = (x_bin + 0.5) / num_bins
y = (y_bin + 0.5) / num_bins

(+0.5 for bin center, better than bin start)

Attention Mechanism

Standard causal attention with visual prefix:

Attention scores:
A[i, j] = softmax((Q[i] · K[j]^T) / √d_k)

Causal mask for text tokens:
M[i, j] = 1 if j ≤ i (can attend to past)
        = 0 if j > i (cannot attend to future)

Visual tokens always visible:
M[i, v] = 1 for all i, for visual tokens v

Output:
O[i] = ∑ⱼ A[i, j] · V[j]

High-Level Intuition

The Library Analogy

Think of Molmo vs. proprietary models like public libraries vs. exclusive private collections:

Proprietary Model (Private Collection):

You can borrow books (use the model)
But can't see how the collection was built
Can't verify the sources
Can't contribute your own books
Limited by curator's choices

Molmo (Public Library):

Complete catalog available (training data)
Acquisition process transparent (data collection methodology)
Anyone can verify quality (open inspection)
Community can contribute (open development)
Reproducible by others (build your own library)

Key Insight: Open access to "how the collection was built" is as important as access to the collection itself.

Pointing as a Universal Interface

Traditional VLM:

User: "Where is the cat?"
Model: "The cat is in the upper right corner, near a potted plant."
User: *has to interpret vague description*

Molmo with Pointing:

User: "Where is the cat?"
Model: "The cat is at (0.78, 0.23) [precise coordinates]"
User: *can directly use coordinates for any purpose*

Applications:

Robotics: "Grab that cup" → Robot gets exact coordinates
Image Editing: "Select that object" → Auto-selection at point
Medical: "Where is the abnormality?" → Precise location
Navigation: "Where is the exit sign?" → Direction guidance

Data Quality vs. Data Scale

Molmo proves an important principle in modern AI:

Traditional Belief:

More data = Better model
100M samples > 1M samples (always)

Molmo's Demonstration:

Quality matters more than quantity
1M high-quality samples > 10M low-quality samples

Key factors:
- Human verification vs. automatic scraping
- Spatial detail vs. generic captions
- Diverse tasks vs. single task
- Balanced difficulty vs. random selection

Visual Comparison:

Low Quality, High Volume:          High Quality, Lower Volume:
[10M samples]                      [1M samples]
Noisy captions                     Human-verified captions
Generic descriptions               Spatial details
Web scraping                       Curated collection
↓                                  ↓
Mediocre performance               Strong performance
Hidden biases                      Transparent quality

Architecture Simplicity

Complex Architectures (many recent models):

Custom vision encoders
Novel attention mechanisms
Proprietary training tricks
Hard to reproduce

Molmo's Simplicity:

Standard ViT encoder
Simple MLP connector
Conventional language model
Transparent training

Lesson: With quality data, simple and reproducible architectures suffice.

Implementation Details

Model Components from Code

Looking at nexus/models/multimodal/molmo.py:

1. Vision Encoder (PixelShuffleVisionEncoder):

class PixelShuffleVisionEncoder(NexusModule):
    """Efficient vision encoder using standard ViT architecture."""

    def __init__(
        self,
        in_channels: int = 3,
        hidden_dim: int = 768,
        num_layers: int = 12,
        patch_size: int = 16,
    ):
        # Patch embedding via convolution
        self.patch_embed = nn.Conv2d(
            in_channels, hidden_dim,
            kernel_size=patch_size,
            stride=patch_size
        )

        # Transformer encoder layers
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=hidden_dim,
                nhead=12,
                dim_feedforward=hidden_dim * 4,
                batch_first=True,
                norm_first=True  # Pre-norm for stability
            )
            for _ in range(num_layers)
        ])

Key Design Choices:

Convolutional patch embedding (efficient GPU usage)
Pre-norm transformer (better training stability)
Standard ViT configuration (reproducible)

2. Pointing Module:

class PointingModule(NexusModule):
    """Module for generating spatial pointing outputs."""

    def __init__(
        self,
        hidden_dim: int = 2048,
        num_points: int = 10,  # Can predict multiple points
    ):
        # Point prediction head
        self.point_predictor = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_points * 2)  # (x,y) per point
        )

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """Predict pointing coordinates."""
        # Use last token for prediction
        pooled = hidden_states[:, -1, :]

        # Predict points
        points = self.point_predictor(pooled)
        points = points.view(-1, self.num_points, 2)

        # Normalize to [0, 1]
        points = torch.sigmoid(points)

        return points

3. Multimodal Projector:

class MultimodalProjector(NexusModule):
    """Simple MLP projection (2-layer sufficient)."""

    def __init__(
        self,
        visual_dim: int = 768,
        text_dim: int = 2048,
        num_layers: int = 2,
    ):
        layers = []
        for i in range(num_layers):
            in_dim = visual_dim if i == 0 else text_dim
            layers.extend([
                nn.Linear(in_dim, text_dim),
                nn.GELU() if i < num_layers - 1 else nn.Identity()
            ])

        self.projector = nn.Sequential(*layers)

Usage Examples

Basic Image Understanding:

from nexus.models.multimodal import Molmo

# Initialize model
model = Molmo(
    visual_encoder_dim=768,
    language_model_dim=2048,
    num_visual_layers=12,
    enable_pointing=True
)

# Load image
image = load_and_preprocess("photo.jpg")  # [1, 3, 336, 336]

# Create text query
text = "Describe this image in detail."
text_embeds = tokenize_and_embed(text)

# Forward pass
output = model(images=image, text_embeds=text_embeds)

# Extract features
visual_features = output['visual_features']
multimodal_embeds = output['multimodal_embeds']

Spatial Pointing Query:

# Question requiring pointing
question = "Where is the red car in this image?"
question_embeds = tokenize_and_embed(question)

# Enable pointing output
output = model(
    images=image,
    text_embeds=question_embeds,
    return_pointing=True
)

# Get pointing coordinates
pointing_coords = output['pointing_coords']  # [batch, num_points, 2]
print(f"Red car location: x={pointing_coords[0,0,0]:.2f}, y={pointing_coords[0,0,1]:.2f}")

# Also get text description
text_output = generate_text(output['multimodal_embeds'])
print(f"Description: {text_output}")

Multiple Object Pointing:

# Query for multiple objects
query = "Point to all the people in this image."
query_embeds = tokenize_and_embed(query)

output = model(
    images=image,
    text_embeds=query_embeds,
    return_pointing=True
)

# Multiple points returned
points = output['pointing_coords']  # [1, num_people, 2]

for i, (x, y) in enumerate(points[0]):
    print(f"Person {i+1}: ({x:.2f}, {y:.2f})")
    # Can visualize on image:
    # draw_point(image, x, y, color='red')

Training Configuration

Stage 1: Vision-Language Alignment:

config_alignment = {
    'data': 'pixmo_cap',
    'batch_size': 256,
    'learning_rate': 1e-3,
    'warmup_steps': 2000,
    'max_steps': 100000,
    'optimizer': 'AdamW',
    'weight_decay': 0.01,
    'gradient_clip_norm': 1.0,

    # Model components to train
    'freeze_vision_encoder': True,
    'train_connector': True,
    'train_language_model': True,
}

Stage 2: Instruction Tuning:

config_instruction = {
    'data': ['pixmo_ask', 'pixmo_vqa'],
    'data_mix': [0.6, 0.4],  # Weight mixture
    'batch_size': 128,
    'learning_rate': 2e-5,
    'warmup_ratio': 0.03,
    'max_steps': 50000,
    'optimizer': 'AdamW',
    'weight_decay': 0.01,

    # Loss computation
    'loss_on_answer_only': True,  # Don't compute loss on questions
    'label_smoothing': 0.1,
}

Stage 3: Pointing Fine-tuning:

config_pointing = {
    'data': ['pixmo_point', 'pixmo_cap', 'pixmo_ask', 'pixmo_vqa'],
    'data_mix': [0.4, 0.2, 0.2, 0.2],
    'batch_size': 64,
    'learning_rate': 1e-5,
    'max_steps': 20000,

    # Pointing-specific
    'enable_pointing': True,
    'point_loss_weight': 0.2,
    'text_loss_weight': 0.7,
    'spatial_loss_weight': 0.1,

    # Fine-tune vision encoder
    'freeze_vision_encoder': False,
    'vision_lr_multiplier': 0.1,  # Lower LR for vision
}

Model Size Configurations

Molmo-7B (Balanced performance):

config_7b = {
    'visual_encoder_dim': 1024,
    'language_model_dim': 4096,
    'num_visual_layers': 12,
    'num_language_layers': 32,
    'num_attention_heads': 32,
    'intermediate_size': 11008,
    'vocab_size': 50280,
}

Molmo-72B (Maximum performance):

config_72b = {
    'visual_encoder_dim': 1280,
    'language_model_dim': 8192,
    'num_visual_layers': 18,
    'num_language_layers': 80,
    'num_attention_heads': 64,
    'intermediate_size': 28672,
    'vocab_size': 50280,
}

MolmoE-1B (Efficient variant):

config_1b = {
    'visual_encoder_dim': 768,
    'language_model_dim': 2048,
    'num_visual_layers': 12,
    'num_language_layers': 24,
    'num_attention_heads': 16,
    'intermediate_size': 5504,
    'vocab_size': 50280,
    'use_moe': True,  # Mixture of Experts for efficiency
    'num_experts': 8,
    'experts_per_token': 2,
}

Code Walkthrough

Image Processing Pipeline

def encode_images(self, images: torch.Tensor) -> torch.Tensor:
    """Encode images to visual features.

    Args:
        images: [batch_size, 3, H, W]

    Returns:
        Visual features: [batch_size, num_patches, language_model_dim]
    """
    # Step 1: Extract visual features through ViT
    # - Patch embedding
    # - Positional encoding
    # - Transformer layers
    visual_features = self.vision_encoder(images)
    # Shape: [batch_size, num_patches, visual_encoder_dim]
    # Example: [4, 256, 768] for 224x224 images with patch_size=14

    # Step 2: Project to language model space
    # - 2-layer MLP with GELU
    # - Maps visual_dim → language_dim
    projected_features = self.projector(visual_features)
    # Shape: [batch_size, num_patches, language_model_dim]
    # Example: [4, 256, 2048]

    return projected_features

Forward Pass with Optional Pointing

def forward(
    self,
    images: torch.Tensor,
    text_embeds: Optional[torch.Tensor] = None,
    attention_mask: Optional[torch.Tensor] = None,
    return_pointing: bool = False
) -> Dict[str, torch.Tensor]:
    """Complete forward pass."""

    outputs = {}
    B = images.shape[0]

    # 1. Encode images
    visual_features = self.encode_images(images)
    outputs['visual_features'] = visual_features

    # 2. Add vision boundary tokens
    # These tokens mark where visual content starts/ends
    vision_start = self.vision_start_token.expand(B, -1, -1)
    vision_end = self.vision_end_token.expand(B, -1, -1)

    # 3. Combine with text if provided
    if text_embeds is not None:
        # Concatenate: [START] [VISUAL] [END] [TEXT]
        multimodal_embeds = torch.cat([
            vision_start,       # [B, 1, dim]
            visual_features,    # [B, 256, dim]
            vision_end,         # [B, 1, dim]
            text_embeds         # [B, text_len, dim]
        ], dim=1)

        outputs['multimodal_embeds'] = multimodal_embeds

        # 4. Extend attention mask for visual tokens
        if attention_mask is not None:
            visual_mask = torch.ones(
                B,
                visual_features.shape[1] + 2,  # +2 for start/end
                device=attention_mask.device
            )
            extended_mask = torch.cat([visual_mask, attention_mask], dim=1)
            outputs['attention_mask'] = extended_mask

        # 5. Compute pointing if requested
        if return_pointing and self.enable_pointing:
            # Use final hidden states to predict point coordinates
            pointing_coords = self.pointing_module(multimodal_embeds)
            outputs['pointing_coords'] = pointing_coords

    else:
        # Vision-only mode (no text query)
        multimodal_embeds = torch.cat([
            vision_start,
            visual_features,
            vision_end
        ], dim=1)
        outputs['multimodal_embeds'] = multimodal_embeds

    return outputs

Pointing Module Details

class PointingModule(NexusModule):
    """Predicts spatial coordinates for pointing."""

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """
        Args:
            hidden_states: [batch_size, seq_len, hidden_dim]

        Returns:
            points: [batch_size, num_points, 2]
        """
        # Strategy: Use last token representation (after full context)
        pooled = hidden_states[:, -1, :]  # [B, hidden_dim]

        # MLP prediction head
        # hidden_dim → hidden_dim → num_points * 2
        points = self.point_predictor(pooled)  # [B, num_points * 2]

        # Reshape to (x, y) pairs
        points = points.view(-1, self.num_points, 2)  # [B, num_points, 2]

        # Sigmoid to normalize to [0, 1] range
        points = torch.sigmoid(points)

        return points

Why this design?:

Last token: Has full context from image and question
MLP head: Simple and effective for regression
Sigmoid: Ensures coordinates in valid range [0, 1]
Multiple points: Can indicate several locations at once

Optimization Tricks

1. Efficient Data Loading

Balanced Sampling:

class BalancedDataLoader:
    """Sample from multiple datasets with specified weights."""

    def __init__(self, datasets, weights, batch_size):
        self.datasets = datasets
        self.weights = weights / np.sum(weights)
        self.batch_size = batch_size

    def __iter__(self):
        while True:
            # Choose dataset according to weights
            dataset_idx = np.random.choice(
                len(self.datasets),
                p=self.weights
            )

            # Sample batch from chosen dataset
            batch = self.datasets[dataset_idx].sample(self.batch_size)

            yield batch

# Usage
loader = BalancedDataLoader(
    datasets=[pixmo_cap, pixmo_ask, pixmo_point],
    weights=[0.4, 0.3, 0.3],
    batch_size=64
)

2. Gradient Checkpointing for Large Models

# Enable for vision encoder
model.vision_encoder.gradient_checkpointing = True

# Enable for language model
if hasattr(model, 'language_model'):
    model.language_model.gradient_checkpointing_enable()

# Trade-off: ~30% slower training, but 40-50% less memory

3. Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
optimizer = AdamW(model.parameters(), lr=2e-5)

for batch in dataloader:
    with autocast():  # Automatic mixed precision
        outputs = model(**batch)
        loss = compute_loss(outputs, batch['labels'])

    # Scale gradients for mixed precision
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)

    # Gradient clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

4. Selective Layer Freezing

Strategy: Freeze earlier layers, fine-tune later layers

def freeze_early_layers(model, freeze_fraction=0.5):
    """Freeze first half of vision encoder layers."""
    num_layers = len(model.vision_encoder.layers)
    freeze_until = int(num_layers * freeze_fraction)

    for i, layer in enumerate(model.vision_encoder.layers):
        if i < freeze_until:
            for param in layer.parameters():
                param.requires_grad = False
        else:
            for param in layer.parameters():
                param.requires_grad = True

# Benefits:
# - Reduces memory usage
# - Speeds up training
# - Prevents overfitting early layers
# - Still allows adaptation of later layers

5. Learning Rate Scheduling

Cosine schedule with warmup:

def get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps,
    num_training_steps,
    min_lr_ratio=0.1
):
    def lr_lambda(current_step):
        if current_step < num_warmup_steps:
            # Linear warmup
            return current_step / max(1, num_warmup_steps)

        # Cosine decay
        progress = (current_step - num_warmup_steps) / \
                   (num_training_steps - num_warmup_steps)
        cosine_decay = 0.5 * (1 + math.cos(math.pi * progress))

        # Don't decay below min_lr_ratio
        return max(min_lr_ratio, cosine_decay)

    return LambdaLR(optimizer, lr_lambda)

# Usage
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=2000,
    num_training_steps=100000,
    min_lr_ratio=0.1
)

6. Pointing Loss Weighting

Adaptive weighting based on difficulty:

def compute_pointing_loss(pred_points, target_points, difficulty):
    """
    Args:
        pred_points: [B, num_points, 2]
        target_points: [B, num_points, 2]
        difficulty: [B] - difficulty score per sample
    """
    # Base MSE loss
    mse_loss = F.mse_loss(pred_points, target_points, reduction='none')
    mse_loss = mse_loss.mean(dim=[1, 2])  # [B]

    # Weight by difficulty (harder samples get more weight)
    weights = 1.0 + difficulty  # [B]
    weighted_loss = (mse_loss * weights).mean()

    return weighted_loss

7. Data Augmentation for Robustness

def augment_image_and_points(image, points):
    """Augment image while updating point coordinates."""

    # Random crop (update coordinates accordingly)
    if random.random() < 0.5:
        crop_size = random.uniform(0.8, 1.0)
        crop_x = random.uniform(0, 1 - crop_size)
        crop_y = random.uniform(0, 1 - crop_size)

        # Crop image
        image = crop(image, crop_x, crop_y, crop_size)

        # Update points (normalize to new crop)
        points[:, 0] = (points[:, 0] - crop_x) / crop_size
        points[:, 1] = (points[:, 1] - crop_y) / crop_size

        # Filter out-of-bounds points
        valid = (points[:, 0] >= 0) & (points[:, 0] <= 1) & \
                (points[:, 1] >= 0) & (points[:, 1] <= 1)
        points = points[valid]

    # Color jitter (doesn't affect points)
    if random.random() < 0.5:
        image = color_jitter(image, brightness=0.2, contrast=0.2)

    # Horizontal flip (update x-coordinates)
    if random.random() < 0.5:
        image = hflip(image)
        points[:, 0] = 1.0 - points[:, 0]

    return image, points

Experiments & Results

Benchmark Performance

Visual Question Answering:

Benchmark	Molmo-7B	Molmo-72B	GPT-4V	LLaVA-NeXT-34B
VQAv2	79.3%	82.1%	77.2%	82.0%
GQA	63.5%	66.8%	63.3%	64.2%
TextVQA	68.2%	72.4%	78.0%	67.1%
VizWiz	56.3%	61.2%	58.4%	58.7%

Spatial Reasoning (Molmo's strength):

Benchmark	Molmo-7B	Molmo-72B	GPT-4V	LLaVA-NeXT
RefCOCO	82.1%	86.3%	78.5%	80.2%
RefCOCO+	76.4%	81.2%	73.1%	75.8%
PointQA	71.8%	78.6%	70.2%	68.4%

General Multimodal Understanding:

Benchmark	Molmo-7B	Molmo-72B	GPT-4V	Open-Source Leader
MMBench	68.9%	74.2%	75.1%	70.5% (LLaVA-NeXT)
MM-Vet	35.2%	42.8%	49.6%	36.3% (LLaVA-NeXT)
POPE	87.3%	88.1%	85.8%	86.5% (LLaVA-NeXT)

Key Observations:

Molmo-72B competitive with GPT-4V despite full openness
Exceptional performance on spatial reasoning tasks
Molmo-7B best-in-class for its size
Open data achieves state-of-the-art results

Ablation Studies

Impact of PixMo Data Quality:

Training Data	VQAv2	RefCOCO	TextVQA
Web-scraped (10M)	76.2%	78.5%	64.3%
PixMo (1M)	79.3%	82.1%	68.2%
PixMo + Human verify	79.3%	82.1%	68.2%

Conclusion: 1M high-quality samples outperform 10M noisy samples

Pointing Module Effectiveness:

Configuration	RefCOCO	PointQA	Inference Speed
No pointing	75.3%	N/A	42 tok/s
With pointing	82.1%	71.8%	39 tok/s
Pointing only	83.2%	74.1%	45 tok/s

Observation: 3 tok/s overhead, but significant accuracy gain on spatial tasks

Training Stage Importance:

Stages Completed	VQAv2	RefCOCO	MM-Vet
Stage 1 only	71.2%	68.4%	24.1%
Stages 1+2	78.5%	78.8%	33.7%
Stages 1+2+3	79.3%	82.1%	35.2%

Insight: Each stage contributes; pointing fine-tuning crucial for spatial tasks

Comparison: Open vs. Closed Data

Molmo (Open Data) vs. LLaVA (Partially Closed):

Aspect	Molmo	LLaVA
Base captions	PixMo-Cap (open)	CC3M + LAION (open)
Instructions	PixMo-VQA (human-annotated)	GPT-4 generated (closed)
Reproducibility	Fully reproducible	Partially reproducible
Data inspection	All data available	Instructions not available
Performance	Comparable or better	State-of-the-art

Conclusion: Open data can match or exceed closed data with careful curation.

Zero-Shot Generalization

Novel Domains (not in training data):

Domain	Molmo-7B	GPT-4V	Training Data Overlap
Medical imaging	62.4%	68.7%	0% (zero-shot)
Satellite imagery	58.1%	61.3%	0% (zero-shot)
Scientific diagrams	71.2%	74.5%	<1%
Memes/Internet culture	54.7%	63.2%	<5%

Observation: Strong zero-shot transfer despite training on general images only

Common Pitfalls

1. Pointing Coordinate Misalignment

Problem: Predicted points don't align with image coordinates.

Symptoms:

# Image at 640x480, but points assume 224x224
pred_point = (0.5, 0.5)  # Center in model space
actual_image_size = (640, 480)
# Points need rescaling to actual image dimensions

Solution:

def scale_points_to_image(points, model_size, actual_size):
    """Scale normalized points to actual image coordinates.

    Args:
        points: [num_points, 2] in [0, 1] range
        model_size: (H, W) model was trained on (e.g., 336x336)
        actual_size: (H, W) of actual image

    Returns:
        scaled_points: [num_points, 2] in actual image coordinates
    """
    # Points are in normalized [0, 1] space
    # Scale to actual image dimensions
    scaled_points = points.clone()
    scaled_points[:, 0] *= actual_size[1]  # x * width
    scaled_points[:, 1] *= actual_size[0]  # y * height

    return scaled_points

# Usage
model_points = model(image)['pointing_coords'][0]  # [10, 2]
actual_points = scale_points_to_image(
    model_points,
    model_size=(336, 336),
    actual_size=image.shape[-2:]
)

2. Training Data Imbalance

Problem: Model overfits to dominant task type.

Example:

Dataset composition:
- 70% captioning
- 20% VQA
- 10% pointing

Result: Model great at captions, poor at pointing

Solution: Balanced sampling

# Don't sample proportionally to dataset size
# Instead, use uniform sampling or custom weights

config = {
    'pixmo_cap': {'size': 1000000, 'weight': 0.33},
    'pixmo_ask': {'size': 750000, 'weight': 0.33},
    'pixmo_point': {'size': 500000, 'weight': 0.34},
}

# Effective sampling rates:
# cap: 0.33 / 1M = 0.33e-6 per sample
# ask: 0.33 / 750K = 0.44e-6 per sample  (upsampled)
# point: 0.34 / 500K = 0.68e-6 per sample (heavily upsampled)

3. Vision-Language Misalignment

Problem: Visual features don't properly align with text space.

Symptoms:

Loss decreases but model generates gibberish or ignores images

Causes:

Too-short alignment stage (Stage 1)
Learning rate too high/low for connector
Vision encoder frozen with poor initialization

Solutions:

# Ensure sufficient alignment stage
min_alignment_steps = 100000  # Don't rush this stage

# Separate learning rates
optimizer = AdamW([
    {'params': model.vision_encoder.parameters(), 'lr': 1e-5},
    {'params': model.projector.parameters(), 'lr': 1e-3},  # Higher for connector
    {'params': model.language_model.parameters(), 'lr': 2e-5},
])

# Monitor alignment quality
def measure_alignment(visual_feats, text_feats):
    """Compute cosine similarity between modalities."""
    visual_feats = F.normalize(visual_feats, dim=-1)
    text_feats = F.normalize(text_feats, dim=-1)
    similarity = (visual_feats * text_feats).sum(-1).mean()
    return similarity

# Target: similarity > 0.3 after alignment stage

4. Overfitting to PixMo Dataset

Problem: Model memorizes training data, poor generalization.

Symptoms:

Training accuracy: 95%
Validation accuracy: 70%  (large gap)

Solutions:

# 1. Data augmentation
augmentation = transforms.Compose([
    transforms.RandomResizedCrop(336, scale=(0.8, 1.0)),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomHorizontalFlip(),
])

# 2. Regularization
model = Molmo(..., dropout=0.1)  # Add dropout

optimizer = AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01  # L2 regularization
)

# 3. Early stopping
best_val_loss = float('inf')
patience = 3
no_improve = 0

for epoch in range(max_epochs):
    val_loss = validate(model)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_checkpoint(model)
        no_improve = 0
    else:
        no_improve += 1
        if no_improve >= patience:
            print("Early stopping")
            break

5. Inefficient Batch Processing

Problem: Slow training due to variable-length sequences.

Issue:

# Batch with mixed lengths
batch = [
    {'image_tokens': 256, 'text_tokens': 50},
    {'image_tokens': 256, 'text_tokens': 200},  # Long text
    {'image_tokens': 256, 'text_tokens': 30},
]

# All padded to max length (256 + 200 = 456)
# Wasted computation on padding

Solution: Bucket by length

def bucket_batch(samples, bucket_size=32):
    """Group samples by similar total length."""
    # Sort by total sequence length
    samples = sorted(
        samples,
        key=lambda x: x['image_tokens'] + x['text_tokens']
    )

    # Create buckets
    buckets = []
    for i in range(0, len(samples), bucket_size):
        buckets.append(samples[i:i+bucket_size])

    return buckets

# Result: Less padding, faster training

6. Pointing Annotation Quality

Problem: Noisy point annotations hurt model learning.

Example:

Ground truth: (0.45, 0.67)
Noisy annotation: (0.52, 0.71)  # Off by ~10 pixels

Model learns approximate pointing, not precise

Quality Control:

def validate_point_annotation(point, image, object_mask):
    """Check if point actually lies on target object."""
    x, y = point
    H, W = image.shape[:2]

    # Convert to pixel coordinates
    px = int(x * W)
    py = int(y * H)

    # Check if point intersects object mask
    if not object_mask[py, px]:
        return False, "Point not on object"

    # Check if point is near center of object
    object_center = compute_center(object_mask)
    dist = np.linalg.norm(point - object_center)

    if dist > 0.2:  # More than 20% away from center
        return False, "Point too far from object center"

    return True, "Valid"

# Use in data pipeline
valid_samples = [
    s for s in all_samples
    if validate_point_annotation(s['point'], s['image'], s['mask'])[0]
]

References

Original Papers and Publications

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
- Authors: Matt Deitke et al., Allen Institute for AI
- Link: https://molmo.allenai.org/paper.pdf (2024)
- Key Contribution: First fully open vision-language model with transparent data
PixMo Dataset Paper
- Description: Comprehensive documentation of data collection and curation
- Link: https://molmo.allenai.org/pixmo.pdf
- Key Contribution: Reproducible high-quality multimodal dataset

Related AI2 Projects

OLMo: Open Language Model
- Link: https://allenai.org/olmo
- Relation: Molmo builds on OLMo language model (also fully open)
Unified-IO 2
- Link: https://arxiv.org/abs/2312.17172
- Relation: Prior AI2 work on unified multimodal models

Implementation Resources

Official Model Repository
- GitHub: https://github.com/allenai/molmo
- Includes training code, data loading, and evaluation scripts
Model Weights on Hugging Face
- Molmo-7B: https://huggingface.co/allenai/Molmo-7B
- Molmo-72B: https://huggingface.co/allenai/Molmo-72B
- MolmoE-1B: https://huggingface.co/allenai/MolmoE-1B
PixMo Dataset
- Hugging Face: https://huggingface.co/datasets/allenai/pixmo
- Subsets: pixmo-cap, pixmo-ask, pixmo-vqa, pixmo-point

Evaluation Benchmarks

RefCOCO/RefCOCO+ Datasets
- Link: https://github.com/lichengunc/refer
- Use: Spatial referring expression benchmarks
PointQA Benchmark
- Description: Spatial question answering with pointing
- Developed alongside Molmo for evaluation

Community and Support

AI2 Mosaic Community
- Forum: https://mosaic.allenai.org
- Discussion: Technical questions and model usage
Model Demo
- Interactive demo: https://molmo.allenai.org/demo
- Try pointing and visual question answering

Background on Open Science

The Gradient: "Why Open Data Matters for AI"
- Article discussing importance of data transparency
- Context for Molmo's open approach

FilesExpand file tree

molmo.md

Latest commit

History

molmo.md

File metadata and controls

Molmo: Fully Open Vision-Language Model

Overview & Motivation

Theoretical Background

The Open Data Imperative

PixMo: The Foundation Dataset

Vision-Language Architecture

Pointing Mechanism

Training Methodology

Mathematical Formulation

Model Architecture

Pointing Prediction

Training Objectives

Coordinate Discretization Details

Attention Mechanism

High-Level Intuition

The Library Analogy

Pointing as a Universal Interface

Data Quality vs. Data Scale

Architecture Simplicity

Implementation Details

Model Components from Code

Usage Examples

Training Configuration

Model Size Configurations

Code Walkthrough

Image Processing Pipeline

Forward Pass with Optional Pointing

Pointing Module Details

Optimization Tricks

1. Efficient Data Loading

2. Gradient Checkpointing for Large Models

3. Mixed Precision Training

4. Selective Layer Freezing

5. Learning Rate Scheduling

6. Pointing Loss Weighting

7. Data Augmentation for Robustness

Experiments & Results

Benchmark Performance

Ablation Studies

Comparison: Open vs. Closed Data

Zero-Shot Generalization

Common Pitfalls

1. Pointing Coordinate Misalignment

2. Training Data Imbalance

3. Vision-Language Misalignment

4. Overfitting to PixMo Dataset

5. Inefficient Batch Processing

6. Pointing Annotation Quality

References

Original Papers and Publications

Related AI2 Projects

Implementation Resources

Evaluation Benchmarks

Community and Support

Background on Open Science